Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Cost Attribution Proposal 2 #9733

Draft
wants to merge 36 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
4e75934
Poc: cost attribution proposal 1.2
ying-jeanne Oct 24, 2024
0456d94
test update
ying-jeanne Oct 24, 2024
f4ebc07
address review comments and fix lint and test
ying-jeanne Oct 25, 2024
cb99a3f
fix lint and ci
ying-jeanne Oct 25, 2024
b0d3f0a
change max-cost-attribution-cardinality-per-user to 10k
ying-jeanne Oct 25, 2024
ebd6105
change custom registry path
ying-jeanne Oct 25, 2024
a438331
Add license for lint
ying-jeanne Oct 25, 2024
13a0b2c
add reset logics to handle overflow and recovery from overflow
ying-jeanne Oct 27, 2024
7f0b372
remove noop implementation
ying-jeanne Nov 5, 2024
3397578
Merge remote-tracking branch 'origin/main' into poc-cost-attribution-2
ying-jeanne Nov 5, 2024
50fa0ec
add new discarded sample metrics
ying-jeanne Nov 5, 2024
abdd0cc
fix test
ying-jeanne Nov 5, 2024
698a5c6
address comment to combine 2 config compare
ying-jeanne Nov 6, 2024
da6b00b
add logic for overflow
ying-jeanne Nov 7, 2024
2b5e3ff
improve tests for cost attribution service
ying-jeanne Nov 7, 2024
cb2a2b6
Don't hold labels from store-gateways in two forms, and don't convert…
grafanabot Nov 18, 2024
078e689
add per tenant cost attribution label limit
ying-jeanne Nov 18, 2024
3131a2f
Merge remote-tracking branch 'origin/main' into poc-cost-attribution-2
ying-jeanne Nov 18, 2024
4bf418a
update doc
ying-jeanne Nov 18, 2024
5e9e1c1
fix unittest
ying-jeanne Nov 18, 2024
bd3e112
fix ci
ying-jeanne Nov 18, 2024
cf16611
fix ci
ying-jeanne Nov 18, 2024
3c1f886
remove unrelated changes
ying-jeanne Nov 18, 2024
d0cb1f3
update purge logics
ying-jeanne Nov 18, 2024
5d4a2c4
fix ci
ying-jeanne Nov 18, 2024
6091493
fix ci
ying-jeanne Nov 18, 2024
7324a1d
update logic for overflow, purge other metrics than overflow
ying-jeanne Nov 18, 2024
5af48e4
add distributor benchmark test for push
ying-jeanne Nov 19, 2024
203689a
Improve logging at ha_tracker sync operation (#9958) (#9961)
grafanabot Nov 20, 2024
a2f009b
add benchmark in ingester
ying-jeanne Nov 20, 2024
791a75d
Merge remote-tracking branch 'origin/main' into poc-cost-attribution-2
ying-jeanne Nov 20, 2024
00d2092
refactory benchmark tests
ying-jeanne Nov 22, 2024
9fba531
MQE: fix issue where subqueries could return series with no points (#…
charleskorn Nov 25, 2024
12d7d79
fix service dependencies
ying-jeanne Nov 25, 2024
e0525e1
fix the distributor crashloop
ying-jeanne Nov 27, 2024
70c6099
Merge remote-tracking branch 'origin/r317' into poc-cost-attribution-2
ying-jeanne Nov 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
* [CHANGE] Ingester: remove experimental flags `-ingest-storage.kafka.ongoing-records-per-fetch` and `-ingest-storage.kafka.startup-records-per-fetch`. They are removed in favour of `-ingest-storage.kafka.max-buffered-bytes`. #9906
* [CHANGE] Ingester: Replace `cortex_discarded_samples_total` label from `sample-out-of-bounds` to `sample-timestamp-too-old`. #9885
* [CHANGE] Ruler: the `/prometheus/config/v1/rules` does not return an error anymore if a rule group is missing in the object storage after been successfully returned by listing the storage, because it could have been deleted in the meanwhile. #9936
* [FEATURE] Querier: add experimental streaming PromQL engine, enabled with `-querier.query-engine=mimir`. #9367 #9368 #9398 #9399 #9403 #9417 #9418 #9419 #9420 #9482 #9504 #9505 #9507 #9518 #9531 #9532 #9533 #9553 #9558 #9588 #9589 #9639 #9641 #9642 #9651 #9664 #9681 #9717 #9719 #9724 #9874
* [FEATURE] Querier: add experimental streaming PromQL engine, enabled with `-querier.query-engine=mimir`. #9367 #9368 #9398 #9399 #9403 #9417 #9418 #9419 #9420 #9482 #9504 #9505 #9507 #9518 #9531 #9532 #9533 #9553 #9558 #9588 #9589 #9639 #9641 #9642 #9651 #9664 #9681 #9717 #9719 #9724 #9874 #9998
* [FEATURE] Distributor: Add support for `lz4` OTLP compression. #9763
* [FEATURE] Query-frontend: added experimental configuration options `query-frontend.cache-errors` and `query-frontend.results-cache-ttl-for-errors` to allow non-transient responses to be cached. When set to `true` error responses from hitting limits or bad data are cached for a short TTL. #9028
* [FEATURE] Query-frontend: add middleware to control access to specific PromQL experimental functions on a per-tenant basis. #9798
Expand Down
66 changes: 66 additions & 0 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -4358,6 +4358,50 @@
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_labels",
"required": false,
"desc": "List of labels used to define the cost attribution. This label will be included in the specified distributor and ingester metrics for each write request, allowing them to be distinguished by the label. The label applies to the following metrics: cortex_distributor_received_samples_total, cortex_ingester_active_series and cortex_discarded_samples_attribution_total. Set to an empty string to disable cost attribution.",
"fieldValue": null,
"fieldDefaultValue": "",
"fieldFlag": "validation.cost-attribution-labels",
"fieldType": "string",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_cost_attribution_labels_per_user",
"required": false,
"desc": "Maximum number of cost attribution labels allowed per user. 0 to disable.",
"fieldValue": null,
"fieldDefaultValue": 2,
"fieldFlag": "validation.max-cost-attribution-labels-per-user",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_cost_attribution_cardinality_per_user",
"required": false,
"desc": "Maximum cardinality of cost attribution labels allowed per user.",
"fieldValue": null,
"fieldDefaultValue": 10000,
"fieldFlag": "validation.max-cost-attribution-cardinality-per-user",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_cooldown",
"required": false,
"desc": "Cooldown period for cost attribution labels. This specifies how long the cost attribution tracker remains in overflow before attempting a reset. If the tracker is still in overflow after this period, the cooldown will be extended. Set to 0 to disable the cooldown period.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "validation.cost-attribution-cooldown",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "ruler_evaluation_delay_duration",
Expand Down Expand Up @@ -19524,6 +19568,17 @@
"fieldValue": null,
"fieldDefaultValue": null
},
{
"kind": "field",
"name": "cost_attribution_registry_path",
"required": false,
"desc": "Defines a custom path for the registry. When specified, Mimir will expose cost attribution metrics through this custom path, if not specified, cost attribution metrics won't be exposed.",
"fieldValue": null,
"fieldDefaultValue": "",
"fieldFlag": "cost-attribution.registry-path",
"fieldType": "string",
"fieldCategory": "advanced"
},
{
"kind": "field",
"name": "timeseries_unmarshal_caching_optimization_enabled",
Expand All @@ -19534,6 +19589,17 @@
"fieldFlag": "timeseries-unmarshal-caching-optimization-enabled",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_eviction_interval",
"required": false,
"desc": "Time interval at which inactive cost attributions will be evicted from the counter, so it won't be counted when checking max_cost_attribution_cardinality_per_user.",
"fieldValue": null,
"fieldDefaultValue": 1800000000000,
"fieldFlag": "cost-attribution.eviction-interval",
"fieldType": "duration",
"fieldCategory": "experimental"
}
],
"fieldValue": null,
Expand Down
12 changes: 12 additions & 0 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1283,6 +1283,10 @@ Usage of ./cmd/mimir/mimir:
Expands ${var} or $var in config according to the values of the environment variables.
-config.file value
Configuration file to load.
-cost-attribution.eviction-interval duration
[experimental] Time interval at which inactive cost attributions will be evicted from the counter, so it won't be counted when checking max_cost_attribution_cardinality_per_user. (default 30m0s)
-cost-attribution.registry-path string
Defines a custom path for the registry. When specified, Mimir will expose cost attribution metrics through this custom path, if not specified, cost attribution metrics won't be exposed.
-debug.block-profile-rate int
Fraction of goroutine blocking events that are reported in the blocking profile. 1 to include every blocking event in the profile, 0 to disable.
-debug.mutex-profile-fraction int
Expand Down Expand Up @@ -3297,10 +3301,18 @@ Usage of ./cmd/mimir/mimir:
Enable anonymous usage reporting. (default true)
-usage-stats.installation-mode string
Installation mode. Supported values: custom, helm, jsonnet. (default "custom")
-validation.cost-attribution-cooldown duration
[experimental] Cooldown period for cost attribution labels. This specifies how long the cost attribution tracker remains in overflow before attempting a reset. If the tracker is still in overflow after this period, the cooldown will be extended. Set to 0 to disable the cooldown period.
-validation.cost-attribution-labels comma-separated-list-of-strings
[experimental] List of labels used to define the cost attribution. This label will be included in the specified distributor and ingester metrics for each write request, allowing them to be distinguished by the label. The label applies to the following metrics: cortex_distributor_received_samples_total, cortex_ingester_active_series and cortex_discarded_samples_attribution_total. Set to an empty string to disable cost attribution.
-validation.create-grace-period duration
Controls how far into the future incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is greater than '(now + creation_grace_period)'. This configuration is enforced in the distributor and ingester. (default 10m)
-validation.enforce-metadata-metric-name
Enforce every metadata has a metric name. (default true)
-validation.max-cost-attribution-cardinality-per-user int
[experimental] Maximum cardinality of cost attribution labels allowed per user. (default 10000)
-validation.max-cost-attribution-labels-per-user int
[experimental] Maximum number of cost attribution labels allowed per user. 0 to disable. (default 2)
-validation.max-label-names-per-series int
Maximum number of label names per series. (default 30)
-validation.max-length-label-name int
Expand Down
9 changes: 8 additions & 1 deletion development/mimir-microservices-mode/config/mimir.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
multitenancy_enabled: false
cost_attribution_registry_path: "/usage-metrics"
cost_attribution_eviction_interval: 10m

distributor:
ha_tracker:
Expand Down Expand Up @@ -184,5 +186,10 @@ limits:
ha_replica_label: ha_replica
ha_max_clusters: 10

cost_attribution_labels: "container"
max_cost_attribution_labels_per_user: 2
max_cost_attribution_cardinality_per_user: 100
cost_attribution_cooldown: 20m

runtime_config:
file: ./config/runtime.yaml
file: ./config/runtime.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
multitenancy_enabled: false
cost_attribution_registry_path: "/usage-metrics"
cost_attribution_eviction_interval: 10m

distributor:
pool:
Expand Down Expand Up @@ -180,5 +182,11 @@ limits:
ha_replica_label: ha_replica
ha_max_clusters: 10

cost_attribution_labels: "instance"
max_cost_attribution_labels_per_user: 2
max_cost_attribution_cardinality_per_user: 100
cost_attribution_cooldown: 20m

runtime_config:
file: ./config/runtime.yaml

38 changes: 38 additions & 0 deletions docs/sources/mimir/configure/configuration-parameters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -458,9 +458,21 @@ overrides_exporter:
# time.
[common: <common>]

# (advanced) Defines a custom path for the registry. When specified, Mimir will
# expose cost attribution metrics through this custom path, if not specified,
# cost attribution metrics won't be exposed.
# CLI flag: -cost-attribution.registry-path
[cost_attribution_registry_path: <string> | default = ""]

# (experimental) Enables optimized marshaling of timeseries.
# CLI flag: -timeseries-unmarshal-caching-optimization-enabled
[timeseries_unmarshal_caching_optimization_enabled: <boolean> | default = true]

# (experimental) Time interval at which inactive cost attributions will be
# evicted from the counter, so it won't be counted when checking
# max_cost_attribution_cardinality_per_user.
# CLI flag: -cost-attribution.eviction-interval
[cost_attribution_eviction_interval: <duration> | default = 30m]
```

### common
Expand Down Expand Up @@ -3539,6 +3551,32 @@ The `limits` block configures default and per-tenant limits imposed by component
# CLI flag: -querier.active-series-results-max-size-bytes
[active_series_results_max_size_bytes: <int> | default = 419430400]

# (experimental) List of labels used to define the cost attribution. This label
# will be included in the specified distributor and ingester metrics for each
# write request, allowing them to be distinguished by the label. The label
# applies to the following metrics: cortex_distributor_received_samples_total,
# cortex_ingester_active_series and cortex_discarded_samples_attribution_total.
# Set to an empty string to disable cost attribution.
# CLI flag: -validation.cost-attribution-labels
[cost_attribution_labels: <string> | default = ""]

# (experimental) Maximum number of cost attribution labels allowed per user. 0
# to disable.
# CLI flag: -validation.max-cost-attribution-labels-per-user
[max_cost_attribution_labels_per_user: <int> | default = 2]

# (experimental) Maximum cardinality of cost attribution labels allowed per
# user.
# CLI flag: -validation.max-cost-attribution-cardinality-per-user
[max_cost_attribution_cardinality_per_user: <int> | default = 10000]

# (experimental) Cooldown period for cost attribution labels. This specifies how
# long the cost attribution tracker remains in overflow before attempting a
# reset. If the tracker is still in overflow after this period, the cooldown
# will be extended. Set to 0 to disable the cooldown period.
# CLI flag: -validation.cost-attribution-cooldown
[cost_attribution_cooldown: <duration> | default = 0s]

# Duration to delay the evaluation of rules to ensure the underlying metrics
# have been pushed.
# CLI flag: -ruler.evaluation-delay-duration
Expand Down
8 changes: 8 additions & 0 deletions pkg/api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"github.com/grafana/dskit/middleware"
"github.com/grafana/dskit/server"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"

"github.com/grafana/mimir/pkg/alertmanager"
"github.com/grafana/mimir/pkg/alertmanager/alertmanagerpb"
Expand Down Expand Up @@ -278,6 +279,13 @@ func (a *API) RegisterDistributor(d *distributor.Distributor, pushConfig distrib
a.RegisterRoute("/distributor/ha_tracker", d.HATracker, false, true, "GET")
}

// Function to register the usage metrics route
func (a *API) RegisterUsageMetricsRoute(customRegistryPath string, reg *prometheus.Registry) {
// Create a Prometheus HTTP handler for the custom registry
// Register the handler with the API's routing system
a.RegisterRoute(customRegistryPath, promhttp.HandlerFor(reg, promhttp.HandlerOpts{}), true, false, "GET")
}

// Ingester is defined as an interface to allow for alternative implementations
// of ingesters to be passed into the API.RegisterIngester() method.
type Ingester interface {
Expand Down
2 changes: 1 addition & 1 deletion pkg/blockbuilder/tsdb.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ type TSDBBuilder struct {
var softErrProcessor = mimir_storage.NewSoftAppendErrorProcessor(
func() {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
func() {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func([]mimirpb.LabelAdapter) {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
)
Expand Down
Loading