All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
NOTE: As semantic versioning states all 0.y.z releases can contain breaking changes in API (flags, grpc API, any backward compatibility)
We use breaking
- #3204 Mixin: Use sidecar's metric timestamp for healthcheck.
- #3862 Sidecar, Store, Query, Ruler, Receiver, Query-Frontend: Added request logging for gRPC and HTTP in the server side.
- #3740 Query: Added
--query.default-step
flag to set default step. - #3700 ui: make old bucket viewer UI work with vanilla Prometheus blocks
- #2641 Query Frontend: Added
--query-range.request-downsampled
flag enabling additional queries for downsampled data in case of empty or incomplete response to range request. - #3792 Receiver: Added
--tsdb.allow-overlapping-blocks
flag to allow overlapping tsdb blocks and enable vertical compaction - #3031 Compact/Sidecar/other writers: added
--hash-func
. If some function has been specified, writers calculate hashes using that function of each file in a block before uploading them. If those hashes exist in themeta.json
file then Compact does not download the files if they already exist on disk and with the same hash. This also means that the data directory passed to Thanos Compact is only cleared once at boot or if everything succeeds. So, if you, for example, use persistent volumes on k8s and your Thanos Compact crashes or fails to make an iteration properly then the last downloaded files are not wiped from the disk. The directories that were created the last time are only wiped again after a successful iteration or if the previously picked up blocks have disappeared. - #3686 Query: Added federated metric metadata support.
- #3773 Compact: Pad compaction planner size check
- #3795 s3: A truncated "get object" response is reported as error.
- #3814 Store: Decreased memory utilisation while fetching block's chunks.
- #3815 Receive: Improve handling of empty time series from clients
- #3705 Store: Fix race condition leading to failing queries or possibly incorrect query results.
- #3854 Mixin: Remove assumed metrics. Use
thanos_info
instead ofkube_pod_info
for dashboard selectors.
v0.18.0 - 2021.01.27
- #3380 Mixin: Add block deletion panels for compactor dashboards.
- #3568 Store: Optimized inject label stage of index lookup.
- #3566 StoreAPI: Support label matchers in labels API.
- #3531 Store: Optimized common cases for time selecting smaller amount of series by avoiding looking up symbols.
- #3469 StoreAPI: Added
hints
field toLabelNamesRequest
andLabelValuesRequest
. Hints are an opaque data structure that can be used to carry additional information from the store and its content is implementation-specific. - #3421 Tools: Added
thanos tools bucket rewrite
command allowing to delete series from given block. - #3509 Store: Added a CLI flag to limit the number of series that are touched.
- #3444 Query Frontend: Make POST request to downstream URL for labels and series API endpoints.
- #3388 Tools: Bucket replicator now can specify block IDs to copy.
- #3385 Tools: Bucket prints extra statistics for block index with debug log-level.
- #3121 Receive: Added
--receive.hashrings
alternative toreceive.hashrings-file
flag (lower priority). The flag expects the literal hashring configuration in JSON format.
- #3567 Mixin: Reintroduce
thanos_objstore_bucket_operation_failures_total
alert. - #3527 Query Frontend: Fix query_range behavior when start/end times are the same
- #3760 Store: Fix panic caused by a race condition happening on concurrent index-header reader usage and unload, when
--store.enable-index-header-lazy-reader
is enabled. - #3759 Store: Fix panic caused by a race condition happening on concurrent index-header lazy load and unload, when
--store.enable-index-header-lazy-reader
is enabled. - #3560 Query Frontend: Allow separate label cache
- #3672 Rule: Prevent crashing due to
no such host error
when usingdnssrv+
ordnssrvnoa+
. - #3461 Compact, Shipper, Store: Fixed panic when no external labels are set in block metadata.
- #3496 S3: Respect SignatureV2 flag for all credential providers.
- #2732 Swift: Switched to a new library ncw/swift providing large objects support.
By default, segments will be uploaded to the same container directory
segments/
if the file is bigger than1GB
. To change the defaults see the docs. - #3626 Shipper: Failed upload of
meta.json
file doesn't cause block cleanup anymore. This has a potential to generate corrupted blocks under specific conditions. Partial block is left in bucket for later cleanup.
v0.17.2 - 2020.12.07
- #3532 compact: do not cleanup blocks on boot. Reverts the behavior change introduced in #3115 as in some very bad cases the boot of Thanos Compact took a very long time since there were a lot of blocks-to-be-cleaned.
- #3520 Fix index out of bound bug when comparing ZLabelSets.
v0.17.1 - 2020.11.24
- #3480 Query Frontend: Fixed regression.
- #3734 pkg/rules/proxy: fix hotlooping when receiving client errors
- #3498 Enabled debug.SetPanicOnFault(true) which allow us to recover on queries causing SEG FAULTs (e.g unmmaped memory access).
v0.17.0 - 2020.11.18
- #3259 Thanos BlockViewer: Added a button in the blockviewer that allows users to download the metadata of a block.
- #3261 Thanos Store: Use segment files specified in meta.json file, if present. If not present, Store does the LIST operation as before.
- #3276 Query Frontend: Support query splitting and retry for label names, label values and series requests.
- #3315 Query Frontend: Support results caching for label names, label values and series requests.
- #3346 Ruler UI: Fix a bug preventing the /rules endpoint from loading.
- #3115 compact: now deletes partially uploaded and blocks with deletion marks concurrently. It does that at the beginning and then every
--compact.cleanup-interval
time period. By default it is 5 minutes. - #3312 s3: add list_objects_version config option for compatibility.
- #3356 Query Frontend: Add a flag to disable step alignment middleware for query range.
- #3378 Ruler: added the ability to send queries via the HTTP method POST. Helps when alerting/recording rules are extra long because it encodes the actual parameters inside of the body instead of the URI. Thanos Ruler now uses POST by default unless
--query.http-method
is setGET
. - #3381 Querier UI: Add ability to enable or disable metric autocomplete functionality.
- #2979 Replicator: Add the ability to replicate blocks within a time frame by passing --min-time and --max-time
- #3398 Query Frontend: Add default config for query frontend memcached config.
- #3277 Thanos Query: Introduce dynamic lookback interval. This allows queries with large step to make use of downsampled data.
- #3409 Compactor: Added support for no-compact-mark.json which excludes the block from compaction.
- #3245 Query Frontend: Add
query-frontend.org-id-header
flag to specify HTTP header(s) to populate slow query log (e.g. X-Grafana-User). - #3431 Store: Added experimental support to lazy load index-headers at query time. When enabled via
--store.enable-index-header-lazy-reader
flag, the store-gateway will load into memory an index-header only once it's required at query time. Index-header will be automatically released after--store.index-header-lazy-reader-idle-timeout
of inactivity.- This, generally, reduces baseline memory usage of store when inactive, as well as a total number of mapped files (which is limited to 64k in some systems.
- #3437 StoreAPI: Added
hints
field toLabelNamesResponse
andLabelValuesResponse
. Hints in an opaque data structure that can be used to carry additional information from the store and its content is implementation specific.- This, generally, reduces baseline memory usage of store when inactive, as well as a total number of mapped files (which is limited to 64k in some systems.
- #3415 Tools: Added
thanos tools bucket mark
command that allows to mark given block for deletion or for no-compact
- #3257 Ruler: Prevent Ruler from crashing when using default DNS to lookup hosts that results in "No such hosts" errors.
- #3331 Disable Azure blob exception logging
- #3341 Disable Azure blob syslog exception logging
- #3414 Set CORS for Query Frontend
- #3437 Add external labels to Labels APIs.
- #3452 Store: Index cache posting compression is now enabled by default. Removed
experimental.enable-index-cache-postings-compression
flag. - #3410 Compactor: Changed metric
thanos_compactor_blocks_marked_for_deletion_total
tothanos_compactor_blocks_marked_total
withmarker
label. Compactor will now automatically disable compaction for blocks with large index that would output blocks after compaction larger than specified value (by default: 64GB). This automatically handles the Promethus format limit. - #2906 Tools: Refactor Bucket replicate execution. Removed all
thanos_replicate_origin_.*
metrics.thanos_replicate_origin_meta_loads_total
can be replaced byblocks_meta_synced{state="loaded"}
.thanos_replicate_origin_partial_meta_reads_total
can be replaced byblocks_meta_synced{state="failed"}
.
- #3309 Compact: breaking
⚠️ Rename metrics to match naming convention. This includes metrics starting withthanos_compactor
tothanos_compact
,thanos_querier
tothanos_query
andthanos_ruler
tothanos_rule
.
v0.16.0 - 2020.10.26
Highlights:
- New Thanos component, Query Frontend has more options and supports shared cache (currently: Memcached).
- Added debug mode in Thanos UI that allows to filter Stores to query from by their IPs from Store page (!). This helps enormously in e.g debugging the slowest store etc. All raw Thanos API allows passing
storeMatch[]
arguments with__address__
matchers. - Improved debuggability on all Thanos components by exposing off-CPU profiles thanks to fgprof endpoint.
- Significantly improved sidecar latency and CPU usage for metrics fetches.
- #3234 UI: Fix assets not loading when
--web.prefix-header
is used. - #3184 Compactor: Fixed support for
web.external-prefix
for Compactor UI.
- #3114 Query Frontend: Added support for Memcached cache.
- breaking Renamed flag
log_queries_longer_than
tolog-queries-longer-than
.
- breaking Renamed flag
- #3166 UIs: Added UI for passing a
storeMatch[]
parameter to queries. - #3181 Logging: Added debug level logging for responses between 300-399
- #3133 Query: Allowed passing a
storeMatch[]
to Labels APIs; Time range metadata based store filtering is supported on Labels APIs. - #3146 Sidecar: Significantly improved sidecar latency (reduced ~2x). Added
thanos_sidecar_prometheus_store_received_frames
histogram metric. - #3147 Querier: Added
query.metadata.default-time-range
flag to specify the default metadata time range duration for retrieving labels through Labels and Series API when the range parameters are not specified. The zero value means range covers the time since the beginning. - #3207 Query Frontend: Added
cache-compression-type
flag to use compression in the query frontend cache. - #3122 *: All Thanos components have now
/debug/fgprof
endpoint on HTTP port allowing to get off-CPU profiles as well. - #3109 Query Frontend: Added support for
Cache-Control
HTTP response header which controls caching behaviour. So farno-store
value is supported and it makes the response skip cache. - #3092 Tools: Added
tools bucket cleanup
CLI tool that deletes all blocks marked to be deleted.
- #3136 Sidecar: breaking Added metric
thanos_sidecar_reloader_config_apply_operations_total
and rename metricthanos_sidecar_reloader_config_apply_errors_total
tothanos_sidecar_reloader_config_apply_operations_failed_total
. - #3154 Querier: breaking Added metric
thanos_query_gate_queries_max
. Remove metricthanos_query_concurrent_selects_gate_queries_in_flight
. - #3154 Store: breaking Renamed metric
thanos_bucket_store_queries_concurrent_max
tothanos_bucket_store_series_gate_queries_max
. - #3179 Store: context.Canceled will not increase
thanos_objstore_bucket_operation_failures_total
. - #3136 Sidecar: Improved detection of directory changes for Prometheus config.
- breaking Added metric
thanos_sidecar_reloader_config_apply_operations_total
and rename metricthanos_sidecar_reloader_config_apply_errors_total
tothanos_sidecar_reloader_config_apply_operations_failed_total
.
- breaking Added metric
- #3022 *: Thanos images are now build with Go 1.15.
- #3205 *: Updated TSDB to ~2.21
v0.15.0 - 2020.09.07
Highlights:
- Added new Thanos component: Query Frontend responsible for response caching, query scheduling and parallelization (based on Cortex Query Frontend).
- Added various new, improved UIs to Thanos based on React: Querier BuildInfo & Flags, Ruler UI, BlockViewer.
- Optimized Sidecar, Store, Receive, Ruler data retrieval with new TSDB ChunkIterator (capping chunks to 120 samples), which fixed various leaks.
- Fixed sample limit on Store Gateway.
- Added S3 Server Side Encryption options.
- Tons of other important fixes!
- #2665 Swift: Fix issue with missing Content-Type HTTP headers.
- #2800 Query: Fix handling of
--web.external-prefix
and--web.route-prefix
. - #2834 Query: Fix rendered JSON state value for rules and alerts should be in lowercase.
- #2866 Receive, Querier: Fixed leaks on receive and querier Store API Series, which were leaking on errors.
- #2937 Receive: Fixing auto-configuration of
--receive.local-endpoint
. - #2895 Compact: Fix increment of
thanos_compact_downsample_total
metric for downsample of 5m resolution blocks. - #2858 Store: Fix
--store.grpc.series-sample-limit
implementation. The limit is now applied to the sum of all samples fetched across all queried blocks via a single Series call, instead of applying it individually to each block. - #2936 Compact: Fix ReplicaLabelRemover panic when replicaLabels are not specified.
- #2956 Store: Fix fetching of chunks bigger than 16000 bytes.
- #2970 Store: Upgrade minio-go/v7 to fix slowness when running on EKS.
- #2957 Rule: breaking
⚠️ Now sets all of the relevant fields properly; avoids a panic when/api/v1/rules
is called and the time zone is not UTC;rules
field is an empty array now if no rules have been defined in a rule group. Thanos Rule's/api/v1/rules
endpoint no longer returns the old, deprecatedpartial_response_strategy
. The old, deprecated value has been fixed toWARN
for quite some time. Please usepartialResponseStrategy
. - #2976 Query: Better rounding for incoming query timestamps.
- #2929 Mixin: Fix expression for 'unhealthy sidecar' alert and increase the timeout for 10 minutes.
- #3024 Query: Consider group name and file for deduplication.
- #3012 Ruler,Receiver: Fix TSDB to delete blocks in atomic way.
- #3046 Ruler,Receiver: Fixed framing of StoreAPI response, it was one chunk by one.
- #3095 Ruler: Update the manager when all rule files are removed.
- #3105 Querier: Fix overwriting
maxSourceResolution
when auto downsampling is enabled. - #3010 Querier: Added
--query.lookback-delta
flag to override the default lookback delta in PromQL. The flag should be lookback delta should be set to at least 2 times of the slowest scrape interval. If unset it will use the PromQL default of 5m.
- #2305 Receive,Sidecar,Ruler: Propagate correct (stricter) MinTime for TSDBs that have no block.
- #2849 Query, Ruler: Added request logging for HTTP server side.
- #2832 ui React: Add runtime and build info page
- #2926 API: Add new blocks HTTP API to serve blocks metadata. The status endpoints (
/api/v1/status/flags
,/api/v1/status/runtimeinfo
and/api/v1/status/buildinfo
) are now available on all components with a HTTP API. - #2892 Receive: Receiver fails when the initial upload fails.
- #2865 ui: Migrate Thanos Ruler UI to React
- #2964 Query: Add time range parameters to label APIs. Add
start
andend
fields to Store APILabelNamesRequest
andLabelValuesRequest
. - #2996 Sidecar: Add
reloader_config_apply_errors_total
metric. Add new flags--reloader.watch-interval
, and--reloader.retry-interval
. - #2973 Add Thanos Query Frontend component.
- #2980 Bucket Viewer: Migrate block viewer to React.
- #2725 Add bucket index operation durations:
thanos_bucket_store_cached_series_fetch_duration_seconds
andthanos_bucket_store_cached_postings_fetch_duration_seconds
. - #2931 Query: Allow passing a
storeMatch[]
to select matching stores when debugging the querier. See documentation
- #2893 Store: Rename metric
thanos_bucket_store_cached_postings_compression_time_seconds
tothanos_bucket_store_cached_postings_compression_time_seconds_total
. - #2915 Receive,Ruler: Enable TSDB directory locking by default. Add a new flag (
--tsdb.no-lockfile
) to override behavior. - #2902 Querier UI:Separate dedupe and partial response checkboxes per panel in new UI.
- #2991 Store: breaking
⚠️ operation
label valuegetrange
changed toget_range
forthanos_store_bucket_cache_operation_requests_total
andthanos_store_bucket_cache_operation_hits_total
to be consistent with bucket operation metrics. - #2876 Receive,Ruler: Updated TSDB and switched to ChunkIterators instead of sample one, which avoids unnecessary decoding / encoding.
- #3064 s3: breaking
⚠️ Add SSE/SSE-KMS/SSE-C configuration. The S3encrypt_sse: true
option is now deprecated in favour ofsse_config
. If you usedencrypt_sse
, the migration strategy is to set up the following block:
sse_config:
type: SSE-S3
v0.14.0 - 2020.07.10
- #2637 Compact: Detect retryable errors that are inside of a wrapped
tsdb.MultiError
. - #2648 Store: Allow index cache and caching bucket to be configured at the same time.
- #2728 Query: Fixed panics when using larger number of replica labels with short series label sets.
- #2787 Update Prometheus mod to pull in prometheus/prometheus#7414.
- #2807 Store: Decreased memory allocations while querying block's index.
- #2809 Query:
/api/v1/stores
now guarantees to return a string in thelastError
field.
- #2658 #2703 Upgrade to Prometheus @3268eac2ddda which is after v2.18.1.
- TSDB now does memory-mapping of Head chunks and reduces memory usage.
- #2667 Store: Removed support to the legacy
index.cache.json
. The hidden flag--store.disable-index-header
was removed. - #2613 Store: Renamed the caching bucket config option
chunk_object_size_ttl
tochunk_object_attrs_ttl
. - #2667 Compact: The deprecated flag
--index.generate-missing-cache-file
and the metricthanos_compact_generated_index_total
were removed. - #2671 breaking Tools: Bucket replicate flag
--resolution
is now in Go duration format. - #2671 Tools: Bucket replicate now replicates by default all blocks.
- #2739 Changed
bucket tool bucket verify
--id-whitelist
flag to--id
. - #2748 Upgrade Prometheus to @66dfb951c4ca which is after v2.19.0.
- PromQL now allow us to executed concurrent selects.
- #2671 Tools: Bucket replicate now allows passing repeated
--compaction
and--resolution
flags. - #2657 Querier: Add the ability to perform concurrent select request per query.
- #2754 UI: Add stores page in the React UI.
- #2752 Compact: Add flag
--block-viewer.global.sync-block-interval
to configure metadata sync interval for the bucket UI.
v0.13.0 - 2020.06.22
- #2548 Query: Fixed rare cases of double counter reset accounting when querying
rate
with deduplication enabled. - #2536 S3: Fixed AWS STS endpoint url to https for Web Identity providers on AWS EKS.
- #2501 Query: Gracefully handle additional fields in
SeriesResponse
protobuf message that may be added in the future. - #2568 Query: Don't close the connection of strict, static nodes if establishing a connection had succeeded but Info() call failed.
- #2615 Rule: Fix bugs where rules were out of sync.
- #2614 Tracing: Disabled Elastic APM Go Agent default tracer on initialization to disable the default metric gatherer.
- #2525 Query: Fixed logging for dns resolution error in the
Query
component. - #2484 Query/Ruler: Fixed issue #2483, when web.route-prefix is set, it is added twice in HTTP router prefix.
- #2416 Bucket: Fixed issue #2416 bug in
inspect --sort-by
doesn't work correctly in all cases. - #2719 Query:
irate
andresets
use now counter downsampling aggregations. - #2705 minio-go: Added support for
af-south-1
andeu-south-1
regions. - #2753 Sidecar, Receive, Rule: Fixed possibility of out of order uploads in error cases. This could potentially cause Compactor to create overlapping blocks.
- #2012 Receive: Added multi-tenancy support (based on header)
- #2502 StoreAPI: Added
hints
field toSeriesResponse
. Hints in an opaque data structure that can be used to carry additional information from the store and its content is implementation specific. - #2521 Sidecar: Added
thanos_sidecar_reloader_reloads_failed_total
,thanos_sidecar_reloader_reloads_total
,thanos_sidecar_reloader_watch_errors_total
,thanos_sidecar_reloader_watch_events_total
andthanos_sidecar_reloader_watches
metrics. - #2412 UI: Added React UI from Prometheus upstream. Currently only accessible from Query component as only
/graph
endpoint is migrated. - #2532 Store: Added hidden option
--store.caching-bucket.config=<yaml content>
(or--store.caching-bucket.config-file=<file.yaml>
) for experimental caching bucket, that can cache chunks into shared memcached. This can speed up querying and reduce number of requests to object storage. - #2579 Store: Experimental caching bucket can now cache metadata as well. Config has changed from #2532.
- #2526 Compact: In case there are no labels left after deduplication via
--deduplication.replica-label
, assign firstreplica-label
with valuededuped
. - #2621 Receive: Added flag to configure forward request timeout. Receive write will complete request as soon as quorum of writes succeeds.
- #2194 Updated to golang v1.14.2.
- #2505 Store: Removed obsolete
thanos_store_node_info
metric. - #2513 Tools: Moved
thanos bucket
commands tothanos tools bucket
, also movedthanos check rules
tothanos tools rules-check
.thanos tools rules-check
also takes rules by--rules
repeated flag not argument anymore. - #2548 Store, Querier: remove duplicated chunks on StoreAPI.
- #2596 Updated Prometheus dependency to @cd73b3d33e064bbd846fc7a26dc8c313d46af382 which falls in between v2.17.0 and v2.18.0.
- Receive,Rule: TSDB now supports isolation of append and queries.
- Receive,Rule: TSDB now holds less WAL files after Head Truncation.
- #2450 Store: Added Regex-set optimization for
label=~"a|b|c"
matchers. - #2526 Compact: In case there are no labels left after deduplication via
--deduplication.replica-label
, assign firstreplica-label
with valuededuped
. - #2603 Store/Querier: Significantly optimize cases where StoreAPIs or blocks returns exact overlapping chunks (e.g Store GW and sidecar or brute force Store Gateway HA).
v0.12.2 - 2020.04.30
- #2459 Compact: Fixed issue with old blocks being marked and deleted in a (slow) loop.
- #2533 Rule: do not wrap reload endpoint with
/
. Makes/-/reload
accessible again when no prefix has been specified.
v0.12.1 - 2020.04.20
- #2411 Query: fix a bug where queries might not time out sometimes due to issues with one or more StoreAPIs.
- #2475 Store: remove incorrect optimizations for queries with
=~".*"
and!=~".*"
matchers. - #2472 Compact: fix a bug where partial blocks were never deleted, causing spam of warnings.
- #2474 Store: fix a panic caused by concurrent memory access during block filtering.
v0.12.0 - 2020.04.15
- #2288 Ruler: fixes issue #2281, a bug causing incorrect parsing of query address with path prefix.
- #2238 Ruler: fixed issue #2204, where a bug in alert queue signaling filled up the queue and alerts were dropped.
- #2231 Bucket Web: sort chunks by thanos.downsample.resolution for better grouping.
- #2254 Bucket: fix issue where metrics were registered multiple times in bucket replicate.
- #2271 Bucket Web: fixed issue #2260, where the bucket passes null when storage is empty.
- #2339 Query: fix a bug where
--store.unhealthy-timeout
was never respected. - #2208 Query and Rule: fix handling of
web.route-prefix
to correctly handle/
and prefixes that do not begin with a/
. - #2311 Receive: ensure receive component serves TLS when TLS configuration is provided.
- #2319 Query: fixed inconsistent naming of metrics.
- #2390 Store: fixed bug that was causing all posting offsets to be used instead of only 1/32 as intended; added hidden flag to control this behavior.
- #2393 Store: fixed bug causing certain not-existing label values queried to fail with "invalid-size" error from binary header.
- #2382 Store: fixed bug causing partial writes of index-header.
- #2383 Store: handle expected errors correctly, e.g. do not increment failure counters.
- #2252 Query: add new
--store-strict
flag. More information available here. - #2265 Compact: add
--wait-interval
to specify compaction wait interval between consecutive compact runs when--wait
is enabled. - #2250 Compact: enable vertical compaction for offline deduplication (experimental). Uses
--deduplication.replica-label
flag to specify the replica label on which to deduplicate (hidden). Please note that this uses a NAIVE algorithm for merging (no smart replica deduplication, just chaining samples together). This works well for deduplication of blocks with precisely the same samples like those produced by Receiver replication. We plan to add a smarter algorithm in the following weeks. - #1714 Compact: the compact component now exposes the bucket web UI when it is run as a long-lived process.
- #2304 Store: added
max_item_size
configuration option to memcached-based index cache. This should be set to the max item size configured in memcached (-I
flag) in order to not waste network round-trips to cache items larger than the limit configured in memcached. - #2297 Store: add
--experimental.enable-index-cache-postings-compression
flag to enable re-encoding and compressing postings before storing them into the cache. Compressed postings take about 10% of the original size. - #2357 Compact and Store: the compact and store components now serve the bucket UI on
:<http-port>/loaded
, which shows exactly the blocks that are currently seen by compactor and the store gateway. The compactor also serves a different bucket UI on:<http-port>/global
, which shows the status of object storage without any filters. - #2172 Store: add support for sharding the store component based on the label hash.
- #2113 Bucket: added
thanos bucket replicate
command to replicate blocks from one bucket to another. - #1922 Docs: create a new document to explain sharding in Thanos.
- #2230 Store: optimize conversion of labels.
- #2136 breaking Store, Compact, Bucket: schedule block deletion by adding deletion-mark.json. This adds a consistent way for multiple readers and writers to access object storage.
Since there are no consistency guarantees provided by some Object Storage providers, this PR adds a consistent lock-free way of dealing with Object Storage irrespective of the choice of object storage. In order to achieve this co-ordination, blocks are not deleted directly. Instead, blocks are marked for deletion by uploading the
deletion-mark.json
file for the block that was chosen to be deleted. This file contains Unix time of when the block was marked for deletion. If you want to keep existing behavior, you should add--delete-delay=0s
as a flag. - #2090 breaking Downsample command: the
downsample
command has moved and is now a sub-command of thethanos bucket
sub-command; it cannot be called viathanos downsample
any more. - #2294 Store: optimizations for fetching postings. Queries using
=~".*"
matchers or negation matchers (!=...
or!~...
) benefit the most. - #2301 Ruler: exit with an error when initialization fails.
- #2310 Query: report timespan 0 to 0 when discovering no stores.
- #2330 Store: index-header is no longer experimental. It is enabled by default for store Gateway. You can disable it with new hidden flag:
--store.disable-index-header
. The--experimental.enable-index-header
flag was removed. - #1848 Ruler: allow returning error messages when a reload is triggered via HTTP.
- #2270 All: Thanos components will now print stack traces when they error out.
v0.11.0 - 2020.03.02
- #2033 Minio-go: Fixed Issue #1494 support Web Identity providers for IAM credentials for AWS EKS.
- #1985 Store Gateway: Fixed case where series entry is larger than 64KB in index.
- #2051 Ruler: Fixed issue where ruler does not expose shipper metrics.
- #2101 Ruler: Fixed bug where thanos_alert_sender_errors_total was not registered.
- #1789 Store Gateway: Improve timeouts.
- #2139 Properly handle SIGHUP for reloading.
- #2040 UI: Fix URL of alerts in Ruler
- #2033 Ruler: Fix tracing in Thanos Ruler
- #2003 Query: Support downsampling for /series.
- #1952 Store Gateway: Implemented binary index header. This significantly reduces resource consumption (memory, CPU, net bandwidth) for startup and data loading processes as well as baseline memory. This means that adding more blocks into object storage, without querying them will use almost no resources. This, however, still means that querying large amounts of data will result in high spikes of memory and CPU use as before, due to simply fetching large amounts of metrics data. Since we fixed baseline, we are now focusing on query performance optimizations in separate initiatives. To enable experimental
index-header
mode run store with hiddenexperimental.enable-index-header
flag. - #2009 Store Gateway: Minimum age of all blocks before they are being read. Set it to a safe value (e.g 30m) if your object storage is eventually consistent. GCS and S3 are (roughly) strongly consistent.
- #1963 Mixin: Add Thanos Ruler alerts.
- #1984 Query: Add cache-control header to not cache on error.
- #1870 UI: Persist settings in query.
- #1969 Sidecar: allow setting http connection pool size via flags.
- #1967 Receive: Allow local TSDB compaction.
- #1939 Ruler: Add TLS and authentication support for query endpoints with the
--query.config
and--query.config-file
CLI flags. See documentation for further information. - #1982 Ruler: Add support for Alertmanager v2 API endpoints.
- #2030 Query: Add
thanos_proxy_store_empty_stream_responses_total
metric for number of empty responses from stores. - #2049 Tracing: Support sampling on Elastic APM with new sample_rate setting.
- #2008 Querier, Receiver, Sidecar, Store: Add gRPC health check endpoints.
- #2145 Tracing: track query sent to prometheus via remote read api.
- #1970 breaking Receive: Use gRPC for forwarding requests between peers. Note that existing values for the
--receive.local-endpoint
flag and the endpoints in the hashring configuration file must now specify the receive gRPC port and must be updated to be a simplehost:port
combination, e.g.127.0.0.1:10901
, rather than a full HTTP URL, e.g.http://127.0.0.1:10902/api/v1/receive
. - #1933 Add a flag
--tsdb.wal-compression
to configure whether to enable tsdb wal compression in ruler and receiver. - #2021 Rename metric
thanos_query_duplicated_store_address
tothanos_query_duplicated_store_addresses_total
andthanos_rule_duplicated_query_address
tothanos_rule_duplicated_query_addresses_total
. - #2166 Bucket Web: improve the tooltip for the bucket UI; it was reconstructed and now exposes much more information about blocks.
v0.10.1 - 2020.01.24
- #2015 Sidecar: Querier /api/v1/series bug fixed when time range was ignored inside sidecar. The bug was noticeable for example when using Grafana template variables.
- #2120 Bucket Web: Set state of status prober properly.
v0.10.0 - 2020.01.13
-
#1919 Compactor: Fixed potential data loss when uploading older blocks, or upload taking long time while compactor is running.
-
#1937 Compactor: Improved synchronization of meta JSON files. Compactor now properly handles partial block uploads for all operation like retention apply, downsampling and compaction. Additionally:
- Removed
thanos_compact_sync_meta_*
metrics. Usethanos_blocks_meta_*
metrics instead. - Added
thanos_consistency_delay_seconds
andthanos_compactor_aborted_partial_uploads_deletion_attempts_total
metrics.
- Removed
-
#1936 Store: Improved synchronization of meta JSON files. Store now properly handles corrupted disk cache. Added meta.json sync metrics.
-
#1856 Receive: close DBReadOnly after flushing to fix a memory leak.
-
#1882 Receive: upload to object storage as 'receive' rather than 'sidecar'.
-
#1907 Store: Fixed the duration unit for the metric
thanos_bucket_store_series_gate_duration_seconds
. -
#1931 Compact: Fixed the compactor successfully exiting when actually an error occurred while compacting a blocks group.
-
#1872 Ruler:
/api/v1/rules
now shows a properly formatted value -
#1945
master
container images are now built with Go 1.13 -
#1956 Ruler: now properly ignores duplicated query addresses
-
#1975 Store Gateway: fixed panic caused by memcached servers selector when there's 1 memcached node
- #1852 Add support for
AWS_CONTAINER_CREDENTIALS_FULL_URI
by upgrading to minio-go v6.0.44 - #1854 Update Rule UI to support alerts count displaying and filtering.
- #1838 Ruler: Add TLS and authentication support for Alertmanager with the
--alertmanagers.config
and--alertmanagers.config-file
CLI flags. See documentation for further information. - #1838 Ruler: Add a new
--alertmanagers.sd-dns-interval
CLI option to specify the interval between DNS resolutions of Alertmanager hosts. - #1881 Store Gateway: memcached support for index cache. See documentation for further information.
- #1904 Add a skip-chunks option in Store Series API to improve the response time of
/api/v1/series
endpoint. - #1910 Query:
/api/v1/labels
now understandsPOST
- useful for sending bigger requests
-
#1947 Upgraded Prometheus dependencies to v2.15.2. This includes:
- Compactor: Significant reduction of memory footprint for compaction and downsampling process.
- Querier: Accepting spaces between time range and square bracket. e.g
[ 5m]
- Querier: Improved PromQL parser performance.
-
#1833
--shipper.upload-compacted
flag has been promoted to non hidden, non experimental state. More info available here. -
#1867 Ruler: now sets a
Thanos/$version
User-Agent
in requests -
#1887 Service discovery now deduplicates targets between different target groups
v0.9.0 - 2019.12.03
- #1678 Add Lightstep as a tracing provider.
- #1687 Add a new
--grpc-grace-period
CLI option to components which serve gRPC to set how long to wait until gRPC Server shuts down. - #1660 Sidecar: Add a new
--prometheus.ready_timeout
CLI option to the sidecar to set how long to wait until Prometheus starts up. - #1573
AliYun OSS
object storage, see documents for further information. - #1680 Add a new
--http-grace-period
CLI option to components which serve HTTP to set how long to wait until HTTP Server shuts down. - #1712 Bucket: Rename flag on bucket web component from
--listen
to--http-address
to match other components. - #1733 Compactor: New metric
thanos_compactor_iterations_total
on Thanos Compactor which shows the number of successful iterations. - #1758 Bucket:
thanos bucket web
now supports--web.external-prefix
for proxying on a subpath. - #1770 Bucket: Add
--web.prefix-header
flags to allow for bucket UI to be accessible behind a reverse proxy. - #1668 Receiver: Added TLS options for both server and client remote write.
- #1656 Store Gateway: Store now starts metric and status probe HTTP server earlier in its start-up sequence.
/-/healthy
endpoint now starts to respond with success earlier./metrics
endpoint starts serving metrics earlier as well. Make sure to point your readiness probes to the/-/ready
endpoint rather than/metrics
. - #1669 Store Gateway: Fixed store sharding. Now it does not load excluded meta.jsons and load/fetch index-cache.json files.
- #1670 Sidecar: Fixed un-ordered blocks upload. Sidecar now uploads the oldest blocks first.
- #1568 Store Gateway: Store now retains the first raw value of a chunk during downsampling to avoid losing some counter resets that occur on an aggregation boundary.
- #1751 Querier: Fixed labels for StoreUI
- #1773 Ruler: Fixed the /api/v1/rules endpoint that returned 500 status code with
failed to assert type of rule ...
message. - #1770 Querier: Fixed
--web.external-prefix
404s for static resources. - #1785 Ruler: The /api/v1/rules endpoints now returns the original rule filenames.
- #1791 Ruler: Ruler now supports identical rule filenames in different directories.
- #1562 Querier: Downsampling option now carries through URL.
- #1675 Querier: Reduced resource usage while using certain queries like
offset
. - #1725 & #1718 Store Gateway: Per request memory improvements.
- #1666 Compact:
thanos_compact_group_compactions_total
now counts block compactions, so operations that resulted in a compacted block. The old behaviour is now exposed by new metric:thanos_compact_group_compaction_runs_started_total
andthanos_compact_group_compaction_runs_completed_total
which counts compaction runs overall. - #1748 Updated all dependencies.
- #1694
prober_ready
andprober_healthy
metrics are removed, for sake ofstatus
. Nowstatus
exposes same metric with a label,check
.check
can have "healty" or "ready" depending on status of the probe. - #1790 Ruler: Fixes subqueries support for ruler.
- #1769 & #1545 Adjusted most of the metrics histogram buckets.
v0.8.1 - 2019.10.14
- #1632 Removes the duplicated external labels detection on Thanos Querier; warning only; Made Store Gateway compatible with older Querier versions.
- NOTE:
thanos_store_nodes_grpc_connections
metric is now perexternal_labels
andstore_type
. It is a recommended metric for Querier storeAPIs.thanos_store_node_info
is marked as obsolete and will be removed in next release. - NOTE2: Store Gateway is now advertising artificial:
"@thanos_compatibility_store_type=store"
label. This is to have the current Store Gateway compatible with Querier pre v0.8.0. This label can be disabled by hiddendebug.advertise-compatibility-label=false
flag on Store Gateway.
- NOTE:
v0.8.0 - 2019.10.10
Lot's of improvements this release! Noteworthy items:
- First Katacoda tutorial! 🐱
- Fixed Deletion order causing Compactor to produce not needed 👻 blocks with missing random files.
- Store GW memory improvements (more to come!).
- Querier allows multiple deduplication labels.
- Both Compactor and Store Gateway can be sharded within the same bucket using relabelling!
- Sidecar exposed data from Prometheus can be now limited to given
min-time
(e.g 3h only). - Numerous Thanos Receive improvements.
Make sure you check out Prometheus 2.13.0 as well. New release drastically improves usage and resource consumption of both Prometheus and sidecar with Thanos: https://prometheus.io/blog/2019/10/10/remote-read-meets-streaming/
- #1619 Thanos sidecar allows to limit min time range for data it exposes from Prometheus.
- #1583 Thanos sharding:
- Add relabel config (
--selector.relabel-config-file
andselector.relabel-config
) into Thanos Store and Compact components. Selecting blocks to serve depends on the result of block labels relabeling. - For store gateway, advertise labels from "approved" blocks.
- Add relabel config (
- #1540 Thanos Downsample added
/-/ready
and/-/healthy
endpoints. - #1538 Thanos Rule added
/-/ready
and/-/healthy
endpoints. - #1537 Thanos Receive added
/-/ready
and/-/healthy
endpoints. - #1460 Thanos Store Added
/-/ready
and/-/healthy
endpoints. - #1534 Thanos Query Added
/-/ready
and/-/healthy
endpoints. - #1533 Thanos inspect now supports the timeout flag.
- #1496 Thanos Receive now supports setting block duration.
- #1362 Optional
replicaLabels
param for/query
and/query_range
querier endpoints. When provided overwrite thequery.replica-label
cli flags. - #1482 Thanos now supports Elastic APM as tracing provider.
- #1612 Thanos Rule added
resendDelay
flag. - #1480 Thanos Receive flushes storage on hashring change.
- #1613 Thanos Receive now traces forwarded requests.
- #1362
query.replica-label
configuration can be provided more than once for multiple deduplication labels like:--query.replica-label=prometheus_replica --query.replica-label=service
. - #1581 Thanos Store now can use smaller buffer sizes for Bytes pool; reducing memory for some requests.
- #1622 & #1590 Upgraded to Go 1.13.1
- #1498 Thanos Receive change flag
labels
tolabel
to be consistent with other commands.
- #1525 Thanos now deletes block's file in correct order allowing to detect partial blocks without problems.
- #1505 Thanos Store now removes invalid local cache blocks.
- #1587 Thanos Sidecar cleanups all cache dirs after each compaction run.
- #1582 Thanos Rule correctly parses Alertmanager URL if there is more
+
in it. - #1544 Iterating over object store is resilient to the edge case for some providers.
- #1469 Fixed Azure potential failures (EOF) when requesting more data then blob has.
- #1512 Thanos Store fixed memory leak for chunk pool.
- #1488 Thanos Rule now now correctly links to query URL from rules and alerts.
v0.7.0 - 2019.09.02
Accepted into CNCF:
- Thanos moved to new repository https://github.com/thanos-io/thanos
- Docker images moved to https://quay.io/thanos/thanos and mirrored at https://hub.docker.com/r/thanosio/thanos
- Slack moved to https://slack.cncf.io
#thanos
/#thanos-dev
/#thanos-prs
- #1478 Thanos components now exposes gRPC server metrics as soon as server starts, to provide more reliable data for instrumentation.
- #1378 Thanos Receive now exposes
thanos_receive_config_hash
,thanos_receive_config_last_reload_successful
andthanos_receive_config_last_reload_success_timestamp_seconds
metrics to track latest configuration change - #1268 Thanos Sidecar added support for newest Prometheus streaming remote read added here. This massively improves memory required by single
request for both Prometheus and sidecar. Single requests now should take constant amount of memory on sidecar, so resource consumption prediction is now straightforward. This will be used if you have Prometheus
2.13
or2.12-master
. - #1358 Added
part_size
configuration option for HTTP multipart requests minimum part size for S3 storage type - #1363 Thanos Receive now exposes
thanos_receive_hashring_nodes
andthanos_receive_hashring_tenants
metrics to monitor status of hash-rings - #1395 Thanos Sidecar added
/-/ready
and/-/healthy
endpoints to Thanos sidecar. - #1297 Thanos Compact added
/-/ready
and/-/healthy
endpoints to Thanos compact. - #1431 Thanos Query added hidden flag to allow the use of downsampled resolution data for instant queries.
- #1408 Thanos Store Gateway can now allow the specifying of supported time ranges it will serve (time sharding). Flags:
min-time
&max-time
- #1414 Upgraded important dependencies: Prometheus to 2.12-rc.0. TSDB is now part of Prometheus.
- #1380 Upgraded important dependencies: Prometheus to 2.11.1 and TSDB to 0.9.1. Some changes affecting Querier:
- [ENHANCEMENT] Query performance improvement: Efficient iteration and search in HashForLabels and HashWithoutLabels. #5707
- [ENHANCEMENT] Optimize queries using regexp for set lookups. tsdb#602
- [BUGFIX] prometheus_tsdb_compactions_failed_total is now incremented on any compaction failure. tsdb#613
- [BUGFIX] PromQL: Correctly display {name="a"}.
- #1338 Thanos Query still warns on store API duplicate, but allows a single one from duplicated set. This is gracefully warn about the problematic logic and not disrupt immediately.
- #1385 Thanos Compact exposes flag to disable downsampling
downsampling.disable
.
- #1327 Thanos Query
/series
API end-point now properly returns an empty array just like Prometheus if there are no results - #1302 Thanos now efficiently reuses HTTP keep-alive connections
- #1371 Thanos Receive fixed race condition in hashring
- #1430 Thanos fixed value of GOMAXPROCS inside container.
- #1410 Fix for CVE-2019-10215
- #1458 Thanos Query and Receive now use common instrumentation middleware. As as result, for sake of
http_requests_total
andhttp_request_duration_seconds_bucket
; Thanos Query no longer exposesthanos_query_api_instant_query_duration_seconds
,thanos_query_api_range_query_duration_second
metrics and Thanos Receive no longer exposesthanos_http_request_duration_seconds
,thanos_http_requests_total
,thanos_http_response_size_bytes
. - #1423 Thanos Bench deprecated.
v0.6.0 - 2019.07.18
-
#1097 Added
thanos check rules
linter for Thanos rule rules files. -
#1253 Add support for specifying a maximum amount of retries when using Azure Blob storage (default: no retries).
-
#1244 Thanos Compact now exposes new metrics
thanos_compact_downsample_total
andthanos_compact_downsample_failures_total
which are useful to catch when errors happen -
#1260 Thanos Query/Rule now exposes metrics
thanos_querier_store_apis_dns_provider_results
andthanos_ruler_query_apis_dns_provider_results
which tell how many addresses were configured and how many were actually discovered respectively -
#1248 Add a web UI to show the state of remote storage.
-
#1217 Thanos Receive gained basic hashring support
-
#1262 Thanos Receive got a new metric
thanos_http_requests_total
which shows how many requests were handled by it -
#1243 Thanos Receive got an ability to forward time series data between nodes. Now you can pass the hashring configuration via
--receive.hashrings-file
; the refresh interval--receive.hashrings-file-refresh-interval
; the name of the local node's name--receive.local-endpoint
; and finally the header's name which is used to determine the tenant--receive.tenant-header
. -
#1147 Support for the Jaeger tracer has been added!
breaking New common flags were added for configuring tracing: --tracing.config-file
and --tracing.config
. You can either pass a file to Thanos with the tracing configuration or pass it in the command line itself. Old --gcloudtrace.*
flags were removed
To migrate over the old --gcloudtrace.*
configuration, your tracing configuration should look like this:
---
type: STACKDRIVER
config:
- service_name: 'foo'
project_id: '123'
sample_factor: 123
The other type
you can use is JAEGER
now. The config
keys and values are Jaeger specific and you can find all of the information here.
-
#1284 Add support for multiple label-sets in Info gRPC service. This deprecates the single
Labels
slice of theInfoResponse
, in a future release backward compatible handling for the single set of Labels will be removed. Upgrading to v0.6.0 or higher is advised. breaking If you run have duplicate queries in your Querier configuration with hierarchical federation of multiple Queries this PR makes Thanos Querier to detect this case and block all duplicates. Refer to 0.6.1 which at least allows for single replica to work. -
#1314 Removes
http_request_duration_microseconds
(Summary) and addshttp_request_duration_seconds
(Histogram) from http server instrumentation used in Thanos APIs and UIs. -
#1287 Sidecar now waits on Prometheus' external labels before starting the uploading process
-
#1261 Thanos Receive now exposes metrics
thanos_http_request_duration_seconds
andthanos_http_response_size_bytes
properly of each handler -
#1274 Iteration limit has been lifted from the LRU cache so there should be no more spam of error messages as they were harmless
-
#1321 Thanos Query now fails early on a query which only uses external labels - this improves clarity in certain situations
-
#1227 Some context handling issues were fixed in Thanos Compact; some unnecessary memory allocations were removed in the hot path of Thanos Store.
-
#1183 Compactor now correctly propagates retriable/haltable errors which means that it will not unnecessarily restart if such an error occurs
-
#1231 Receive now correctly handles SIGINT and closes without deadlocking
-
#1278 Fixed inflated values problem with
sum()
on Thanos Query -
#1280 Fixed a problem with concurrent writes to a
map
in Thanos Query while rendering the UI -
#1311 Fixed occasional panics in Compact and Store when using Azure Blob cloud storage caused by lack of error checking in client library.
-
#1322 Removed duplicated closing of the gRPC listener - this gets rid of harmless messages like
store gRPC listener: close tcp 0.0.0.0:10901: use of closed network connection
when those programs are being closed
- #1216 the old "Command-line flags" has been removed from Thanos Query UI since it was not populated and because we are striving for consistency
v0.5.0 - 2019.06.05
TL;DR: Store LRU cache is no longer leaking, Upgraded Thanos UI to Prometheus 2.9, Fixed auto-downsampling, Moved to Go 1.12.5 and more.
This version moved tarballs to Golang 1.12.5 from 1.11 as well, so same warning applies if you use container_memory_usage_bytes
from cadvisor. Use container_memory_working_set_bytes
instead.
breaking As announced couple of times this release also removes gossip with all configuration flags (--cluster.*
).
- #1142 fixed major leak on store LRU cache for index items (postings and series).
- #1163 sidecar is no longer blocking for custom Prometheus versions/builds. It only checks if flags return non 404, then it performs optional checks.
- #1146 store/bucket: make getFor() work with interleaved resolutions.
- #1157 querier correctly handles duplicated stores when some store changes external labels in place.
- #1094 Allow configuring the response header timeout for the S3 client.
-
#1118 breaking swift: Added support for cross-domain authentication by introducing
userDomainID
,userDomainName
,projectDomainID
,projectDomainName
. The outdated termstenantID
,tenantName
are deprecated and have been replaced byprojectID
,projectName
. -
#1066 Upgrade Thanos ui to Prometheus v2.9.1.
Changes from the upstream:
- query:
- rule:
-
#1156 Moved CI and docker multistage to Golang 1.12.5 for latest mem alloc improvements.
-
#1103 Updated go-cos deps. (COS bucket client).
-
#1149 Updated google Golang API deps (GCS bucket client).
-
#1190 Updated minio deps (S3 bucket client). This fixes minio retries.
-
#1133 Use prometheus v2.9.2, common v0.4.0 & tsdb v0.8.0.
Changes from the upstreams:
- store gateway:
- [ENHANCEMENT] Fast path for EmptyPostings cases in Merge, Intersect and Without.
- store gateway & compactor:
- [BUGFIX] Fix fd and vm_area leak on error path in chunks.NewDirReader.
- [BUGFIX] Fix fd and vm_area leak on error path in index.NewFileReader.
- query:
- [BUGFIX] Make sure subquery range is taken into account for selection #5467
- [ENHANCEMENT] Check for cancellation on every step of a range evaluation. #5131
- [BUGFIX] Exponentation operator to drop metric name in result of operation. #5329
- [BUGFIX] Fix output sample values for scalar-to-vector comparison operations. #5454
- rule:
- [BUGFIX] Reload rules: copy state on both name and labels. #5368
- store gateway:
- #1008 breaking Removed Gossip implementation. All
--cluster.*
flags removed and Thanos will error out if any is provided.
v0.4.0 - 2019.05.3
This release also disables gossip mode by default for all components. See this for more details.
On Linux, the runtime now uses MADV_FREE to release unused memory. This is more efficient but may result in higher reported RSS. The kernel will reclaim the unused data when it is needed. To revert to the Go 1.11 behavior (MADV_DONTNEED), set the environment variable GODEBUG=madvdontneed=1.
If you want to see exact memory allocation of Thanos process:
- Use
go_memstats_heap_alloc_bytes
metric exposed by Golang orcontainer_memory_working_set_bytes
exposed by cadvisor. - Add
GODEBUG=madvdontneed=1
before running Thanos binary to revert to memory releasing to pre 1.12 logic.
Using cadvisor container_memory_usage_bytes
metric could be misleading e.g: google/cadvisor#2242
- thanos.io website & automation 🎉
- #1053 compactor: Compactor & store gateway now handles incomplete uploads gracefully. Added hard limit on how long block upload can take (30m).
- #811 Remote write receiver component ❤️ ❤️ thanks to RedHat (@brancz) contribution.
- #910 Query's stores UI page is now sorted by type and old DNS or File SD stores are removed after 5 minutes (configurable via the new
--store.unhealthy-timeout=5m
flag). - #905 Thanos support for Query API: /api/v1/labels. Notice that the API was added in Prometheus v2.6.
- #798 Ability to limit the maximum number of concurrent request to Series() calls in Thanos Store and the maximum amount of samples we handle.
- #1060 Allow specifying region attribute in S3 storage configuration
--store.grpc.series-max-concurrency
. Most likely you will want to make it the same as --query.max-concurrent
on Thanos Query.
New options:
New Store flags:
* `--store.grpc.series-sample-limit` limits the amount of samples that might be retrieved on a single Series() call. By default it is 0. Consider enabling it by setting it to more than 0 if you are running on limited resources.
* `--store.grpc.series-max-concurrency` limits the number of concurrent Series() calls in Thanos Store. By default it is 20. Considering making it lower or bigger depending on the scale of your deployment.
New Store metrics:
* `thanos_bucket_store_queries_dropped_total` shows how many queries were dropped due to the samples limit;
* `thanos_bucket_store_queries_concurrent_max` is a constant metric which shows how many Series() calls can concurrently be executed by Thanos Store;
* `thanos_bucket_store_queries_in_flight` shows how many queries are currently "in flight" i.e. they are being executed;
* `thanos_bucket_store_gate_duration_seconds` shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.
New Store tracing span: * store_query_gate_ismyturn
shows how long it took for a query to pass (or not) through the gate.
-
#1016 Added option for another DNS resolver (miekg/dns client). Note that this is required to have SRV resolution working on Golang 1.11+ with KubeDNS below v1.14
New Querier and Ruler flag:
-- store.sd-dns-resolver
which allows to specify resolver to use. Eithergolang
ormiekgdns
-
#986 Allow to save some startup & sync time in store gateway as it is no longer needed to compute index-cache from block index on its own for larger blocks. The store Gateway still can do it, but it first checks bucket if there is index-cached uploaded already. In the same time, compactor precomputes the index cache file on every compaction.
New Compactor flag:
--index.generate-missing-cache-file
was added to allow quicker addition of index cache files. If enabled it precomputes missing files on compactor startup. Note that it will take time and it's only one-off step per bucket. -
#887 Compact: Added new
--block-sync-concurrency
flag, which allows you to configure number of goroutines to use when syncing block metadata from object storage. -
#928 Query: Added
--store.response-timeout
flag. If a Store doesn't send any data in this specified duration then a Store will be ignored and partial data will be returned if it's enabled. 0 disables timeout. -
#893 S3 storage backend has graduated to
stable
maturity level. -
#936 Azure storage backend has graduated to
stable
maturity level. -
#937 S3: added trace functionality. You can add
trace.enable: true
to enable the minio client's verbose logging. -
#953 Compact: now has a hidden flag
--debug.accept-malformed-index
. Compaction index verification will ignore out of order label names. -
#963 GCS: added possibility to inline ServiceAccount into GCS config.
-
#1010 Compact: added new flag
--compact.concurrency
. Number of goroutines to use when compacting groups. -
#1028 Query: added
--query.default-evaluation-interval
, which sets default evaluation interval for sub queries. -
#980 Ability to override Azure storage endpoint for other regions (China)
-
#1021 Query API
series
now supports POST method. -
#939 Query API
query_range
now supports POST method.
-
#970 Deprecated
partial_response_disabled
proto field. Addedpartial_response_strategy
instead. Both in gRPC and Query API. NoPartialResponseStrategy
field forRuleGroups
by default meansabort
strategy (old PartialResponse disabled) as this is recommended option for Rules and alerts.Metrics:
- Added
thanos_rule_evaluation_with_warnings_total
to Ruler. - DNS
thanos_ruler_query_apis*
are nowthanos_ruler_query_apis_*
for consistency. - DNS
thanos_querier_store_apis*
are nowthanos_querier_store_apis__*
for consistency. - Query Gate
thanos_bucket_store_series*
are nowthanos_bucket_store_series_*
for consistency. - Most of thanos ruler metris related to rule manager has
strategy
label.
Ruler tracing spans:
/rule_instant_query HTTP[client]
is now/rule_instant_query_part_resp_abort HTTP[client]"
if request is for abort strategy.
- Added
-
#1009: Upgraded Prometheus (~v2.7.0-rc.0 to v2.8.1) and TSDB (
v0.4.0
tov0.6.1
) deps.Changes that affects Thanos:
- query:
- [ENHANCEMENT] In histogram_quantile merge buckets with equivalent le values. #5158.
- [ENHANCEMENT] Show list of offending labels in the error message in many-to-many scenarios. #5189
- [BUGFIX] Fix panic when aggregator param is not a literal. #5290
- ruler:
- [ENHANCEMENT] Reduce time that Alertmanagers are in flux when reloaded. #5126
- [BUGFIX] prometheus_rule_group_last_evaluation_timestamp_seconds is now a unix timestamp. #5186
- [BUGFIX] prometheus_rule_group_last_duration_seconds now reports seconds instead of nanoseconds. Fixes our issue #1027
- [BUGFIX] Fix sorting of rule groups. #5260
- store: [ENHANCEMENT] Fast path for EmptyPostings cases in Merge, Intersect and Without.
- tooling: [FEATURE] New dump command to tsdb tool to dump all samples.
- compactor:
- [ENHANCEMENT] When closing the db any running compaction will be cancelled so it doesn't block.
- [CHANGE] breaking Renamed flag
--sync-delay
to--consistency-delay
#1053
For ruler essentially whole TSDB CHANGELOG applies between v0.4.0-v0.6.1: https://github.com/prometheus/tsdb/blob/master/CHANGELOG.md
Note that this was added on TSDB and Prometheus: [FEATURE] Time-ovelapping blocks are now allowed. #370 Whoever due to nature of Thanos compaction (distributed systems), for safety reason this is disabled for Thanos compactor for now.
- query:
-
#868 Go has been updated to 1.12.
-
#1055 Gossip flags are now disabled by default and deprecated.
-
#964 repair: Repair process now sorts the series and labels within block.
-
#1073 Store: index cache for requests. It now calculates the size properly (includes slice header), has anti-deadlock safeguard and reports more metrics.
- #921
thanos_objstore_bucket_last_successful_upload_time
now does not appear when no blocks have been uploaded so far. - #966 Bucket: verify no longer warns about overlapping blocks, that overlap
0s
- #848 Compact: now correctly works with time series with duplicate labels.
- #894 Thanos Rule: UI now correctly shows evaluation time.
- #865 Query: now properly parses DNS SRV Service Discovery.
- #889 Store: added safeguard against merging posting groups segfault
- #941 Sidecar: added better handling of intermediate restarts.
- #933 Query: Fixed 30 seconds lag of adding new store to query.
- #962 Sidecar: Make config reloader file writes atomic.
- #982 Query: now advertises Min & Max Time accordingly to the nodes.
- #1041 Ruler is now able to return long time range queries.
- #904 Compact: Skip compaction for blocks with no samples.
- #1070 Downsampling works back again. Deferred closer errors are now properly captured.
v0.3.2 - 2019.03.04
index-cache-size
. Handling of limit for this cache was
broken so it was unbounded all the time. From this release actual value matters and is extremely low by default. To "revert"
the old behaviour (no boundary), use a large enough value.
- #833 Store Gateway matcher regression for intersecting with empty posting.
- #867 Fixed race condition in sidecare between reloader and shipper.
v0.3.1 - 2019.02.18
- #829 Store Gateway crashing due to
slice bounds out of range
. - #834 Store Gateway matcher regression for
<>
!=
.
v0.3.0 - 2019.02.08
- Support for gzip compressed configuration files before envvar substitution for reloader package.
bucket inspect
command for better insights on blocks in object storage.- Support for Tencent COS object storage.
- Partial Response disable option for StoreAPI and QueryAPI.
- Partial Response disable button on Thanos UI
- We have initial docs for goDoc documentation!
- Flags for Querier and Ruler UIs:
--web.route-prefix
,--web.external-prefix
,--web.prefix-header
. Details here
- #649 - Fixed store label values api to add also external label values.
- #396 - Fixed sidecar logic for proxying series that has more than 2^16 samples from Prometheus.
- #732 - Fixed S3 authentication sequence. You can see new sequence enumerated here
- #745 - Fixed race conditions and edge cases for Thanos Querier fanout logic.
- #651 - Fixed index cache when asked buffer size is bigger than cache max size.
- #529 Massive improvement for compactor. Downsampling memory consumption was reduce to only store labels and single chunks per each series.
- Qurerier UI: Store page now shows the store APIs per component type.
- Prometheus and TSDB deps are now up to date with ~2.7.0 Prometheus version. Lot's of things has changed. See details here #704 Known changes that affects us:
- prometheus/prometheus/discovery/file
- [ENHANCEMENT] Discovery: Improve performance of previously slow updates of changes of targets. #4526
- [BUGFIX] Wait for service discovery to stop before exiting #4508 ??
- prometheus/prometheus/promql:
- [ENHANCEMENT] Subqueries support. #4831
- [BUGFIX] PromQL: Fix a goroutine leak in the lexer/parser. #4858
- [BUGFIX] Change max/min over_time to handle NaNs properly. #438
- [BUGFIX] Check label name for
count_values
PromQL function. #4585 - [BUGFIX] Ensure that vectors and matrices do not contain identical label-sets. #4589
- [ENHANCEMENT] Optimize PromQL aggregations #4248
- [BUGFIX] Only add LookbackDelta to vector selectors #4399
- [BUGFIX] Reduce floating point errors in stddev and related functions #4533
- prometheus/prometheus/rules:
- New metrics exposed! (prometheus evaluation!)
- [ENHANCEMENT] Rules: Error out at load time for invalid templates, rather than at evaluation time. #4537
- prometheus/tsdb/index: Index reader optimizations.
- prometheus/prometheus/discovery/file
- Thanos store gateway flag for sync concurrency (
block-sync-concurrency
with20
default, so no change by default) - S3 provider:
- Added
put_user_metadata
option to config. - Added
insecure_skip_verify
option to config.
- Added
- Tests against Prometheus below v2.2.1. This does not mean lack of support for those. Only that we don't tests the compatibility anymore. See #758 for details.
v0.2.1 - 2018.12.27
- Relabel drop for Thanos Ruler to enable replica label drop and alert deduplication on AM side.
- Query: Stores UI page available at
/stores
.
- Thanos Rule Alertmanager DNS SD bug.
- DNS SD bug when having SRV results with different ports.
- Move handling of HA alertmanagers to be the same as Prometheus.
- Azure iteration implementation flaw.
v0.2.0 - 2018.12.10
Next Thanos release adding support to new discovery method, gRPC mTLS and two new object store providers (Swift and Azure).
Note lots of necessary breaking changes in flags that relates to bucket configuration.
- breaking: Removed all bucket specific flags as we moved to config files:
- --gcs-bucket=<bucket>
- --s3.bucket=<bucket>
- --s3.endpoint=<api-url>
- --s3.access-key=<key>
- --s3.insecure
- --s3.signature-version2
- --s3.encrypt-sse
- --gcs-backup-bucket=<bucket>
- --s3-backup-bucket=<bucket>
- breaking: Removed support of those environment variables for bucket:
- S3_BUCKET
- S3_ENDPOINT
- S3_ACCESS_KEY
- S3_INSECURE
- S3_SIGNATURE_VERSION2
- breaking: Removed provider specific bucket metrics e.g
thanos_objstore_gcs_bucket_operations_total
in favor of of generic bucket operation metrics.
- breaking: Added
thanos_
prefix to memberlist (gossip) metrics. Make sure to update your dashboards and rules. - S3 provider:
- Set
"X-Amz-Acl": "bucket-owner-full-control"
metadata for s3 upload operation.
- Set
- Support for heterogeneous secure gRPC on StoreAPI.
- Handling of scalar result in rule node evaluating rules.
- Flag
--objstore.config-file
to reference to the bucket configuration file in yaml format. Detailed information can be found in document storage. - File service discovery for StoreAPIs:
- In
thanos rule
, static configuration of query nodes via--query
- In
thanos rule
, file based discovery of query nodes using--query.file-sd-config.files
- In
thanos query
, file based discovery of store nodes using--store.file-sd-config.files
/-/healthy
endpoint to Querier.- DNS service discovery to static and file based configurations using the
dns+
anddnssrv+
prefixes for the respective lookup. Details here --cluster.disable
flag to disable gossip functionality completely.- Hidden flag to configure max compaction level.
- Azure Storage.
- OpenStack Swift support.
- Thanos Ruler
thanos_rule_loaded_rules
metric. - Option for JSON logger format.
- Issue whereby the Proxy Store could end up in a deadlock if there were more than 9 stores being queried and all returned an error.
- Ruler tracing causing panics.
- GatherIndexStats panics on duplicated chunks check.
- Clean up of old compact blocks on compact restart.
- Sidecar too frequent Prometheus reload.
thanos_compactor_retries_total
metric not being registered.
v0.1.0 - 2018.09.14
Initial version to have a stable reference before gossip protocol removal.
- Gossip layer for all components.
- StoreAPI gRPC proto.
- TSDB block upload logic for Sidecar.
- StoreAPI logic for Sidecar.
- Config and rule reloader logic for Sidecar.
- On-the fly result merge and deduplication logic for Querier.
- Custom Thanos UI (based mainly on Prometheus UI) for Querier.
- Optimized object storage fetch logic for Store.
- Index cache and chunk pool for Store for better memory usage.
- Stable support for Google Cloud Storage object storage.
- StoreAPI logic for Querier to support Thanos federation (experimental).
- Support for S3 minio-based AWS object storage (experimental).
- Compaction logic of blocks from multiple sources for Compactor.
- Optional Compaction fixed retention.
- Optional downsampling logic for Compactor (experimental).
- Rule (including alerts) evaluation logic for Ruler.
- Rule UI with hot rules reload.
- StoreAPI logic for Ruler.
- Basic metric orchestration for all components.
- Verify commands with potential fixes (experimental).
- Compact / Downsample offline commands.
- Bucket commands.
- Downsampling support for UI.
- Grafana dashboards for Thanos components.