Skip to content

Remove references to max_series_per_query from docs #6889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .golangci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ run:
- integration_querier
- integration_ruler
- integration_query_fuzz
- integration_remote_write_v2
- slicelabels
output:
formats:
Expand Down
1 change: 1 addition & 0 deletions ADOPTERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,4 @@ This is the list of organisations that are using Cortex in **production environm
* [Platform9](https://platform9.com/)
* [REWE Digital](https://rewe-digital.com/)
* [SysEleven](https://www.syseleven.de/)
* [Twilio](https://www.twilio.com/)
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
* [CHANGE] StoreGateway/Alertmanager: Add default 5s connection timeout on client. #6603
* [CHANGE] Ingester: Remove EnableNativeHistograms config flag and instead gate keep through new per-tenant limit at ingestion. #6718
* [CHANGE] Validate a tenantID when to use a single tenant resolver. #6727
* [FEATURE] Distributor: Add an experimental `-distributor.otlp.enable-type-and-unit-labels` flag to add `__type__` and `__unit__` labels for OTLP metrics. #6969
* [FEATURE] Distributor: Add an experimental `-distributor.otlp.allow-delta-temporality` flag to ingest delta temporality otlp metrics. #6934
* [FEATURE] Query Frontend: Add dynamic interval size for query splitting. This is enabled by configuring experimental flags `querier.max-shards-per-query` and/or `querier.max-fetched-data-duration-per-query`. The split interval size is dynamically increased to maintain a number of shards and total duration fetched below the configured values. #6458
* [FEATURE] Querier/Ruler: Add `query_partial_data` and `rules_partial_data` limits to allow queries/rules to be evaluated with data from a single zone, if other zones are not available. #6526
Expand All @@ -22,6 +23,7 @@
* [FEATURE] Querier: Allow choosing PromQL engine via header. #6777
* [FEATURE] Querier: Support for configuring query optimizers and enabling XFunctions in the Thanos engine. #6873
* [FEATURE] Query Frontend: Add support /api/v1/format_query API for formatting queries. #6893
* [ENHANCEMENT] Ingester: Add `cortex_ingester_tsdb_wal_replay_unknown_refs_total` and `cortex_ingester_tsdb_wbl_replay_unknown_refs_total` metrics to track unknown series references during wal/wbl replaying. #6945
* [ENHANCEMENT] Ruler: Emit an error message when the rule synchronization fails. #6902
* [ENHANCEMENT] Querier: Support snappy and zstd response compression for `-querier.response-compression` flag. #6848
* [ENHANCEMENT] Tenant Federation: Add a # of query result limit logic when the `-tenant-federation.regex-matcher-enabled` is enabled. #6845
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ lint:
golangci-lint run

# Ensure no blocklisted package is imported.
GOFLAGS="-tags=requires_docker,integration,integration_alertmanager,integration_backward_compatibility,integration_memberlist,integration_querier,integration_ruler,integration_query_fuzz" faillint -paths "github.com/bmizerany/assert=github.com/stretchr/testify/assert,\
GOFLAGS="-tags=requires_docker,integration,integration_alertmanager,integration_backward_compatibility,integration_memberlist,integration_querier,integration_ruler,integration_query_fuzz,integration_remote_write_v2" faillint -paths "github.com/bmizerany/assert=github.com/stretchr/testify/assert,\
golang.org/x/net/context=context,\
sync/atomic=go.uber.org/atomic,\
github.com/prometheus/client_golang/prometheus.{MultiError}=github.com/prometheus/prometheus/tsdb/errors.{NewMulti},\
Expand Down
2 changes: 1 addition & 1 deletion build-image/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM golang:1.24.3-bullseye
FROM golang:1.24.6-bullseye
ARG goproxyValue
ENV GOPROXY=${goproxyValue}
RUN apt-get update && apt-get install -y curl file gettext jq unzip protobuf-compiler libprotobuf-dev && \
Expand Down
10 changes: 4 additions & 6 deletions docs/configuration/arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -318,11 +318,10 @@ overrides:
tenant1:
ingestion_rate: 10000
max_series_per_metric: 100000
max_series_per_query: 100000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably change max_series_per_query to max_fetched_series_per_query as it is the new limit.

max_fetched_series_per_query: 10000
tenant2:
max_samples_per_query: 1000000
max_series_per_metric: 100000
max_series_per_query: 100000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be max_fetched_series_per_query for consistency

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 , thanks for pointing this out.

max_fetched_series_per_query: 100000
multi_kv_config:
mirror_enabled: false
Expand All @@ -348,11 +347,10 @@ overrides:
tenant1:
ingestion_rate: 10000
max_series_per_metric: 100000
max_series_per_query: 100000
max_fetched_series_per_query: 100000
tenant2:
max_samples_per_query: 1000000
max_series_per_metric: 100000
max_series_per_query: 100000
max_fetched_series_per_query: 100000
```

Valid per-tenant limits are (with their corresponding flags for default values):
Expand Down
12 changes: 11 additions & 1 deletion docs/configuration/config-file-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -3143,7 +3143,7 @@ ha_tracker:
# EXPERIMENTAL: If true, accept prometheus remote write v2 protocol push
# request.
# CLI flag: -distributor.remote-writev2-enabled
[remote_write2_enabled: <boolean> | default = false]
[remote_writev2_enabled: <boolean> | default = false]

ring:
kvstore:
Expand Down Expand Up @@ -3265,6 +3265,11 @@ otlp:
# EXPERIMENTAL: If true, delta temporality otlp metrics to be ingested.
# CLI flag: -distributor.otlp.allow-delta-temporality
[allow_delta_temporality: <boolean> | default = false]

# EXPERIMENTAL: If true, the '__type__' and '__unit__' labels are added for
# the OTLP metrics.
# CLI flag: -distributor.otlp.enable-type-and-unit-labels
[enable_type_and_unit_labels: <boolean> | default = false]
```

### `etcd_config`
Expand Down Expand Up @@ -4114,6 +4119,11 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s
# CLI flag: -frontend.max-queriers-per-tenant
[max_queriers_per_tenant: <float> | default = 0]

# [Experimental] Number of shards to use when distributing shardable PromQL
# queries.
# CLI flag: -frontend.query-vertical-shard-size
[query_vertical_shard_size: <int> | default = 0]

# Enable to allow queries to be evaluated with data from a single zone, if other
# zones are not available.
[query_partial_data: <boolean> | default = false]
Expand Down
1 change: 1 addition & 0 deletions docs/configuration/v1-guarantees.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ Currently experimental features are:
- `alertmanager-sharding-ring.final-sleep` (duration) CLI flag
- OTLP Receiver
- Ingest delta temporality OTLP metrics (`-distributor.otlp.allow-delta-temporality=true`)
- Add `__type__` and `__unit__` labels (`-distributor.otlp.enable-type-and-unit-labels`)
- Persistent tokens in the Ruler Ring:
- `-ruler.ring.tokens-file-path` (path) CLI flag
- Native Histograms
Expand Down
3 changes: 1 addition & 2 deletions docs/guides/overrides-exporter-runtime.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,4 @@ overrides:
max_global_series_per_user: 300000
max_series_per_metric: 0
max_series_per_user: 0
max_samples_per_query: 100000
max_series_per_query: 100000
max_fetched_series_per_query: 100000
4 changes: 1 addition & 3 deletions docs/guides/overrides-exporter.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,7 @@ overrides:
max_global_series_per_user: 300000
max_series_per_metric: 0
max_series_per_user: 0
max_samples_per_query: 100000
max_series_per_query: 100000
max_fetched_series_per_query: 100000
```
The `overrides-exporter` is configured to run as follows:
Expand All @@ -59,7 +58,6 @@ cortex_overrides{limit_name="max_global_series_per_user",user="user1"} 300000
cortex_overrides{limit_name="max_local_series_per_metric",user="user1"} 0
cortex_overrides{limit_name="max_local_series_per_user",user="user1"} 0
cortex_overrides{limit_name="max_samples_per_query",user="user1"} 100000
cortex_overrides{limit_name="max_series_per_query",user="user1"} 100000
```

With these metrics, you can set up alerts to know when tenants are close to hitting their limits
Expand Down
209 changes: 209 additions & 0 deletions docs/proposals/partition-ring-multi-az-replication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
---
title: "Partition Ring with Multi-AZ Replication"
linkTitle: "Partition Ring Multi-AZ Replication"
weight: 1
slug: partition-ring-multi-az-replication
---

- Author: [Daniel Blando](https://github.com/danielblando)
- Date: July 2025
- Status: Proposed

## Background

Distributors use a token-based ring to shard data across ingesters. Each ingester owns random tokens (32-bit numbers) in a hash ring. For each incoming series, the distributor:

1. Hashes the series labels to get a hash value
2. Finds the primary ingester (smallest token > hash value)
3. When replication is enabled, selects additional replicas by moving clockwise around the ring
4. Ensures replicas are distributed across different availability zones

The issue arises when replication is enabled: each series in a request is hashed independently, causing each series to route to different groups of ingesters.

```mermaid
graph TD
A[Write Request] --> B[Distributor]
B --> C[Hash Series 1] --> D[Ingesters: 5,7,9]
B --> E[Hash Series 2] --> F[Ingesters: 5,3,10]
B --> G[Hash Series 3] --> H[Ingesters: 7,27,28]
B --> I[...] --> J[Different ingester sets<br/>for each series]
```

## Problem

### Limited AZ Failure Tolerance with replication factor

While the token ring effectively distributes load across the ingester fleet, the independent hashing and routing of each series creates an amplification effect where a single ingester failure can impact a large number of write requests.

Consider a ring with 30 ingesters, each series gets distributed to three different ingesters:

```
Sample 1: {name="http_request_latency",api="/push", status="2xx"}
→ Ingesters: ing-5, ing-7, ing-9
Sample 2: {name="http_request_latency",api="/push", status="4xx"}
→ Ingesters: ing-5, ing-3, ing-10
Sample 3: {name="http_request_latency",api="/push", status="2xx"}
→ Ingesters: ing-7, ing-27, ing-28
...
```
If ingesters `ing-15` and `ing-18` (in different AZs) are offline, any request containing a series that needs to write to both these ingesters will fail completely:

```
Sample 15: {name="http_request_latency",api="/push", status="5xx"}
→ Ingesters: ing-10, ing-15, ing-18 // Request fails
```

With requests increasing their batch size, the probability of request failure becomes critical in replicated deployments. Given two failed ingesters in different AZs, each individual series has a small chance of requiring both failed ingesters. However, as request batch sizes increase, the probability that at least one series in the batch will hash to both failed ingesters approaches certainty.

**Note**: This problem specifically affects Cortex using replication. Replication as 1 are not impacted by this availability amplification issue.

## Proposed Solution

### Partition Ring Architecture

A new Partition Ring is proposed where the ring is divided into partitions, with each partition containing a set of tokens and a group of ingesters. Ingesters are allocated to partitions based on their order in the zonal StatefulSet, ensuring that scaling operations align with StatefulSet's LIFO behavior. Each partition contains a number of ingesters equal to the replication factor, with exactly one ingester per availability zone.

This approach provides **reduced failure probability** where the chances of getting two ingesters in the same partition down decreases significantly compared to random ingester failures affecting multiple series. It also enables **deterministic replication** where data sent to `ing-az1-1` always replicates to `ing-az2-1` and `ing-az3-1`, making the system behavior more predictable and easier to troubleshoot.

```mermaid
graph TD
subgraph "Partition Ring"
subgraph "Partition 3"
P1A[ing-az1-3]
P1B[ing-az2-3]
P1C[ing-az3-3]
end
subgraph "Partition 2"
P2A[ing-az1-2]
P2B[ing-az2-2]
P2C[ing-az3-2]
end
subgraph "Partition 1"
P3A[ing-az1-1]
P3B[ing-az2-1]
P3C[ing-az3-1]
end
end

T1[Tokens 34] --> P1A
T2[Tokens 56] --> P2A
T3[Tokens 12] --> P3A
```

Within each partition, ingesters maintain identical data, acting as true replicas of each other. Distributors maintain similar hashing logic but select a partition instead of individual ingesters. Data is then forwarded to all ingesters within the selected partition, making the replication pattern deterministic.

### Protocol Buffer Definitions

```protobuf
message PartitionRingDesc {
map<string, PartitionDesc> partitions = 1;
}

message PartitionDesc {
PartitionState state = 1;
repeated uint32 tokens = 2;
map<string, InstanceDesc> instances = 3;
int64 registered_timestamp = 4;
}

// Unchanged from current implementation
message InstanceDesc {
string addr = 1;
int64 timestamp = 2;
InstanceState state = 3;
string zone = 7;
int64 registered_timestamp = 8;
}
```

### Partition States

Partitions maintain a simplified state model that provides **clear ownership** where each series belongs to exactly one partition, but requires **additional state management** for partition states and lifecycle management:

```go
type PartitionState int

const (
NON_READY PartitionState = iota // Insufficient ingesters
ACTIVE // Fully operational
READONLY // Scale-down in progress
)
```

State transitions:
```mermaid
stateDiagram-v2
[*] --> NON_READY
NON_READY --> ACTIVE : Required ingesters joined<br/>across all AZs
ACTIVE --> READONLY : Scale-down initiated
ACTIVE --> NON_READY : Ingester removed
READONLY --> NON_READY : Ingesters removed
NON_READY --> [*] : Partition deleted
```

### Partition Lifecycle Management

#### Creating Partitions

When a new ingester joins the ring:
1. Check if a suitable partition exists with available slots
2. If no partition exists, create a new partition in `NON_READY` state
3. Add partition's tokens to the ring
4. Add the ingester to the partition
5. Wait for required number of ingesters across all AZs (one per AZ)
6. Once all AZs are represented, transition partition to `ACTIVE`

#### Removing Partitions

The scale-down process follows these steps:
1. **Mark READONLY**: Partition stops accepting new writes but continues serving reads
2. **Data Transfer**: Wait for all ingesters in partition to transfer data and become empty
3. **Coordinated Removal**: Remove one ingester from each AZ simultaneously
4. **State Transition**: Partition automatically transitions to `NON_READY` (insufficient replicas)
5. **Cleanup**: Remove remaining ingesters and delete partition from ring

If not using READONLY mode, removing an ingester will make the partition as NON_READY. When all ingesters are removed, the last will delete the partition if configuration `unregister_on_shutdown` is true

### Multi-Ring Migration Strategy

To address the migration challenge for production clusters currently running token-based rings, this proposal also introduces a multi-ring infrastructure that allows gradual traffic shifting from token-based to partition-based rings:

```mermaid
sequenceDiagram
participant C as Client
participant D as Distributor
participant MR as Multi-Ring Router
participant TR as Token Ring
participant PR as Partition Ring

C->>D: Write Request (1000 series)
D->>MR: Route request
MR->>MR: Check percentage config<br/>(e.g., 80% token, 20% partition)
MR->>TR: Route 800 series to Token Ring
MR->>PR: Route 200 series to Partition Ring

Note over TR,PR: Both rings process their portion
TR->>D: Response for 800 series
PR->>D: Response for 200 series
D->>C: Combined response
```

Migration phases for production clusters:
1. **Phase 1**: Deploy partition ring alongside existing token ring (0% traffic)
2. **Phase 2**: Route 10% traffic to partition ring
3. **Phase 3**: Gradually increase to 50% traffic
4. **Phase 4**: Route 90% traffic to partition ring
5. **Phase 5**: Complete migration (100% partition ring)

This multi-ring approach solves the migration problem for existing production deployments that cannot afford downtime during the transition from token-based to partition-based rings. It provides **zero downtime migration** with **rollback capability** and **incremental validation** at each step. However, it requires **dual ring participation** where ingesters must participate in both rings during migration, **increased memory usage** and **migration coordination** requiring careful percentage management and monitoring.

#### Read Path Considerations

During migration, the read path (queriers and rulers) must have visibility into both rings to ensure all functionality works correctly:

- **Queriers** must check both token and partition rings to locate series data, as data may be distributed across both ring types during migration
- **Rulers** must evaluate rules against data from both rings to ensure complete rule evaluation
- **Ring-aware components** (like shuffle sharding) must operate correctly across both ring types
- **Metadata operations** (like label queries) must aggregate results from both rings

All existing Cortex functionality must continue to work seamlessly during the migration period, requiring components to transparently handle the dual-ring architecture.
Loading