Make metrics collection optional/faster #1147

QuerthDP · 2024-12-10T13:28:48Z

This patch contains:

Implementation of lock-free histogram with hot and cold bucket pools
Introduction of crate feature "metrics" to make using them optional
Functionality to gather latency statistics via histogram snapshots
Implementation of rates of queries per second based on cpp-driver implementation.
Initial implementation of connection metrics (changes required after review)

Fixes: #330

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

github-actions · 2024-12-10T13:35:03Z

cargo semver-checks detected some API incompatibilities in this PR.
Checked commit: 66b63c9

See the following report for details:

cargo semver-checks output

./scripts/semver-checks.sh --baseline-rev bceac6aaa852d51699b73820fc5317d75e3e21d4
+ cargo semver-checks -p scylla -p scylla-cql --baseline-rev bceac6aaa852d51699b73820fc5317d75e3e21d4
     Cloning bceac6aaa852d51699b73820fc5317d75e3e21d4
    Building scylla v0.15.0 (current)
       Built [  23.984s] (current)
     Parsing scylla v0.15.0 (current)
      Parsed [   0.048s] (current)
    Building scylla v0.15.0 (baseline)
       Built [  21.416s] (baseline)
     Parsing scylla v0.15.0 (baseline)
      Parsed [   0.045s] (baseline)
    Checking scylla v0.15.0 -> v0.15.0 (no change)
     Checked [   0.120s] 127 checks: 124 pass, 3 fail, 0 warn, 0 skip

--- failure auto_trait_impl_removed: auto trait no longer implemented ---

Description:
A public type has stopped implementing one or more auto traits. This can break downstream code that depends on the traits being implemented.
        ref: https://doc.rust-lang.org/reference/special-types-and-traits.html#auto-traits
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.39.0/src/lints/auto_trait_impl_removed.ron

Failed in:
  type MetricsError is no longer UnwindSafe, in /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla/src/observability/metrics.rs:10
  type MetricsError is no longer RefUnwindSafe, in /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla/src/observability/metrics.rs:10

--- failure inherent_method_missing: pub method removed or renamed ---

Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.39.0/src/lints/inherent_method_missing.ron

Failed in:
  Metrics::new, previously in file /home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-bceac6aaa852d51699b73820fc5317d75e3e21d4/a2020bf1cde8f2c41199b73119bb64096056f0b9/scylla/src/observability/metrics.rs:35

--- failure struct_with_no_pub_fields_changed_type: public API struct with no public fields is no longer a struct ---

Description:
A struct without pub fields became an enum or union, breaking pattern matching.
        ref: https://internals.rust-lang.org/t/rest-patterns-foo-should-match-non-struct-types/21607
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.39.0/src/lints/struct_with_no_pub_fields_changed_type.ron

Failed in:
  struct scylla::observability::metrics::MetricsError became enum in file /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla/src/observability/metrics.rs:10

     Summary semver requires new major version: 3 major and 0 minor checks failed
    Finished [  46.584s] scylla
    Building scylla-cql v0.4.0 (current)
       Built [  10.637s] (current)
     Parsing scylla-cql v0.4.0 (current)
      Parsed [   0.026s] (current)
    Building scylla-cql v0.4.0 (baseline)
       Built [  11.062s] (baseline)
     Parsing scylla-cql v0.4.0 (baseline)
      Parsed [   0.026s] (baseline)
    Checking scylla-cql v0.4.0 -> v0.4.0 (no change)
     Checked [   0.111s] 127 checks: 127 pass, 0 skip
     Summary no semver update required
    Finished [  22.738s] scylla-cql
make: *** [Makefile:61: semver-rev] Error 1

QuerthDP · 2025-01-13T12:43:27Z

Changelog:

rebased on main

muzarski

I still need to review commits that introduce Meter and connection metrics. Posting this review early, because there are some matters to discuss.

muzarski · 2025-01-15T12:46:06Z

scylla/src/observability/lock_free_histogram.rs

+#[derive(Error, Debug, PartialEq)]
+pub enum LFError {
+    #[error("invalid use of histogram")]
+    HistogramErr(#[from] histogram::Error),
+    #[error("could not lock the snapshot mutex")]
+    Mutex,
+}


Please add a docstring. Even the simplest one - mentioning that it indicates some failure during LockFreeHistogram's operation

I don't like the name. Since this is a public type, the error name should be more descriptive (current version would be fine for driver's internal error type). I suggest a verbose name such as LockFreeHistogramError (WDYT @wprzytula ?)

The additional context is not displayed for the variants. You can display it via {0}, such as #[error("Invalid use of histogram: {0}")]. Also, please begin the error messages with capital letter (this is a convention we currently use in the driver).

HistogramErr -> HistogramError (variant name)

HistogramErr variant currently has a type from a pre-1.0.0 crate. We need to think what to do with this. Corresponding issue: Remove types from pre-1.0 crates from the public API, or hide them behind features #771. I see that later in this PR, you introuce a metrics feature, so I assume that this error type will be hidden behind it as well. Is it OK if feature name does not directly correspond to the unstable crate's name? In this case, the unstable crate is histogram, and feature name is metrics. I'm not sure how this interacts with API stability. cc: @piodul @Lorak-mmk

scylla/src/observability/lock_free_histogram.rs

muzarski · 2025-01-15T12:49:50Z

scylla/src/observability/lock_free_histogram.rs

+    Mutex,
+}
+
+pub struct LockFreeHistogram {


This needs a docstring as well. Not only this is a public type (does it need to be public?), but its methods contain some very complex logic, which could be briefly explained here.

Also, it would be nice to mention the motivation behind this struct. I understand the logic (after reviewing the methods), but I still don't quite get WHY we need it. Why can't we just use the AtomicHistogram from histogram crate? Are there any races that can occur without this additional layer of synchronization?

Okay, so there are a couple of things to mention here, but I'll start from the beginning.

In the issue description here it is noted that no suitable solution for the problem was found on crates.io (I believe AtomicHistogram already existed by the time of writing of that issue), so I assumed it must have had a flaw. And, as it turned out, it did.

I highly recommend reading through the issue I opened up on the crate histogram repo, where I explain all of the motivation in detail, but in short: the .load() method has no synchronisation with increments, which causes a logical race (the state of the loaded histogram is dependent on the speed at which it is loaded).

The idea of some sort of a lock-free algorithm was also proposed in the "Make metrics optional" issue, along with sharding, which I considered potentially harder to implement, thus I went with a lock-free algorithm.

However (!), the LockFreeHistogram's implementation comes with potential drawbacks in terms of performance (in comparison to AtomicHistogram, not a global mutex) due to global atomic counters accessed upon each bucket increment.

I haven't managed to run any benchmarks in this regard, nor do I have concrete examples of cases when AtomicHistogram's implementation yields a very significant error in results (though I did come up with some ideas and calculations; they can be found on my linked issue on crate histogram's repo). Therefore, the decision of which histogram implementation to incorporate into this driver is up to you. I've just provided a safe alternative and done some research.

Also, should you choose to go with AtomicHistogram, the change will be rather effortless, as I maintained the API schema used in crate histogram for my implementation.

Thank you for the explanation! I think part of this deserves to be put in the docstring.

Also, should you choose to go with AtomicHistogram, the change will be rather effortless, as I maintained the API schema used in crate histogram for my implementation.

This is great. Also, please unpub (pub(crate)) the LockFreeHistogram and its methods. Only then will we be 100% sure that if we ever decide to drop/modify LockFreeHistogram, such change will not be API breaking.

Which histogram implementation we want to incorporate to the driver? Well, this needs to be discussed. I'd wait until @Lorak-mmk and @wprzytula review the code.

I unpubbed LockFreeHistogram and its methods, but now I can see it might be difficult to make such a change completely non-API-breaking. That's because LockFreeHistogramError is propagated as MetricsError's cause and thus needs to be pub. Upon change to AtomicHistogram this error struct would be removed entirely.

I'm not yet sure how we could go around this issue.

Ah, good catch. There are some workarounds for this, however. For example, we could always hide underlying LockFreeHistogramError under Arc<dyn Error>. I wouldn't worry about this now. Let's wait for others to join the discussion.

For what it's worth, while .load() is not atomic wrt concurrent increments into the histogram. I'd still consider using the AtomicHistogram which is used in metrics for both rpc-perf and Pelikan.

As I said in the issue opened in rustcommon, I'm not sure that the potential skew here would be meaningful. But I do welcome some concrete details if such skew does prove to be meaningful.

The TLDR is I'm not sure how much it matters whether it's on one side or the other of loading a histogram. If you envision periodically snapshotting the histogram, it seems you have to accept that the latency is already being recorded at the tail end of the event. Imagine a request that takes a long time, that latency gets incremented after the fact, when the service might be back to responding quickly.

My feeling is that ultimately this is all an approximation and I've found the AtomicHistogram as in the histogram crate to be satisfactory for projects I work on.

scylla/src/observability/lock_free_histogram.rs

muzarski · 2025-01-15T12:56:43Z

scylla/src/observability/lock_free_histogram.rs

+impl Default for LockFreeHistogram {
+    fn default() -> Self {
+        // Config: 64ms error, values in range [0ms, ~262_000ms].
+        // Size: (2^13 + 5 * 2^12) * 8B * 2 ~= 450kB.
+        let grouping_power = 12;
+        let max_value_power = 18;
+        LockFreeHistogram::new(grouping_power, max_value_power)
+    }
+}


Where are these defaults taken from?

Since the histogram crate no longer provides defaults, I had to come up with some choice here. It was my best guess at what might be needed, though that is obviously to be discussed and modified if needed.

You may find our calculator useful while evaluating what parameters are appropriate: https://observablehq.com/@iopsystems/h2-histogram

❓ @NikodemGapski Have you consulted the above calculator about the defaults?

I mean, I had made my guess according to the mentioned calculator, but I can't know if it meets the needs of the users of this driver (the range and absolute error I noted in the comment for reference). It is your decision whether or not to change it.

scylla/src/observability/metrics.rs

examples/Cargo.toml

scylla/src/observability/lock_free_histogram.rs

scylla/src/observability/metrics.rs

wprzytula · 2025-01-17T13:20:50Z

scylla/src/observability/metrics.rs

+/// Snapshot is a structure that contains histogram statistics such as
+/// min, max, mean, standard deviation, median, and most common percentiles
+/// collected in a certain moment.
+#[derive(Debug)]
+pub struct Snapshot {
+    pub min: u64,
+    pub max: u64,
+    pub mean: u64,
+    pub stddev: u64,
+    pub median: u64,
+    pub percentile_75: u64,
+    pub percentile_95: u64,
+    pub percentile_98: u64,
+    pub percentile_99: u64,
+    pub percentile_99_9: u64,
+}


🔧 For future compatibility, let's either:

make this struct #[non_exhaustive] to allow adding more fields in the future without breaking the API;

or make all those fields private and expose a getter for each field.

Which one do you find a more suitable solution? @Lorak-mmk @muzarski

So far I've used the #[non_exhaustive] macro, but this case is still open for discussion.

scylla/src/observability/metrics.rs

QuerthDP · 2025-01-26T16:14:25Z

Changelog:

rebased on main

docs/source/metrics/metrics.md

examples/basic.rs

scylla/src/client/session.rs

scylla/src/network/connection.rs

wprzytula · 2025-01-27T09:00:23Z

scylla/src/policies/speculative_execution.rs

+    #[cfg(not(feature = "metrics"))]
+    fn retry_interval(&self, _: &Context) -> Duration {
+        warn!("PercentileSpeculativeExecutionPolicy requires the 'metrics' feature to work as intended, defaulting to 100 ms");
+        Duration::from_millis(100)
+    }


💭 Perhaps we should hide the PercentileSpeculativeExecutionPolicy behind the metrics feature, too? @Lorak-mmk @muzarski

Sounds reasonable. Current approach is no different from SimpleSpeculativeExecutionPolicy with retry_interval set to 100ms.

scylla/src/observability/metrics.rs

scylla/src/network/connection_pool.rs

QuerthDP · 2025-01-27T18:34:32Z

Changelog:

rebased @wprzytula's refactor on the branch and squashed his contribution into b3ae501
adjusted EWMA docstrings
fixed invalid unwraps in examplary code
added more clarity to the request timeout
resolved nitpicks

QuerthDP · 2025-01-27T19:43:14Z

Changelog:

rebased on main
bumped Cargo.lock.msrv as it was changed by other commits

I think this PR should pass all the CI now.

wprzytula

💯 Apart from the open discussion over PercentileSpeculativeExecutionPolicy, LGTM!

🎉 You've done a truly great job!

roydahan · 2025-01-29T14:35:40Z

@Lorak-mmk do we need your review here in order to merge?

muzarski · 2025-01-29T14:38:03Z

@Lorak-mmk do we need your review here in order to merge?

I'm currently reviewing it.

Lorak-mmk · 2025-01-29T14:47:01Z

@Lorak-mmk do we need your review here in order to merge?

I'll be reviewing it soon.

muzarski

I noticed that there are some potential logical races in rate metrics logic (yet another time I'm disappointed with cpp-driver...). Apart from that, LGTM. Nice job!

muzarski · 2025-01-29T15:00:47Z

scylla/src/observability/metrics.rs

+    fn tick(&self) {
+        let count = self.uncounted.swap(0, ORDER_TYPE);
+        let instant_rate = count as f64 / INTERVAL as f64;
+
+        if self.is_initialized.load(Ordering::Acquire) {
+            let rate = f64::from_bits(self.rate.load(Ordering::Acquire));
+            self.rate.store(
+                f64::to_bits(rate + self.alpha * (instant_rate - rate)),
+                Ordering::Release,
+            );
+        } else {
+            self.rate
+                .store(f64::to_bits(instant_rate), Ordering::Release);
+            self.is_initialized.store(true, Ordering::Release);
+        }
+    }


AFAIU, there can be a logical race if multiple threads execute this method concurrently, correct? Potential two-threads scenario: both threads enter the method. 1st thread reads the non-zero count, and atomically sets uncounted to 0. 2nd thread reads 0. Then they both land in the else branch. 1st thread sets rate to non-zero value, then the 2nd thread sets rate to zero. If I understand correctly, the acquire-release relationship of ordering does not prevent from such scenario.

OTOH, if this is not possible for two threads to enter this method concurrently (or there are some other safety guarantees), I think it should be documented.

I'm aware that this implementation is based on cpp-driver's implementation, and I wonder if we should ignore these issues. There is always the cpp_rust_unstable cfg if we do not want to expose this API to the standard users - we would only use it in cpp-rust-driver.

cc: @Lorak-mmk @wprzytula

AFAIU, there can be a logical race if multiple threads execute this method concurrently, correct?

Your observation is correct.

OTOH, if this is not possible for two threads to enter this method concurrently (or there are some other safety guarantees), I think it should be documented.

Agreed. From what I see in the code, there are safety guarantees in the calling code:

// Multiple threads could read the same `old_tick`... let old_tick = self.last_tick.load(ORDER_TYPE); let new_tick = self.start_time.elapsed().as_nanos() as u64; let elapsed = new_tick - old_tick; // _"Problematic"_ `if` - see a comment below. if elapsed > INTERVAL * 1_000_000_000 { let new_interval_start_tick = new_tick - elapsed % (INTERVAL * 1_000_000_000); // But then only one will succeed in the following COMPARE EXCHANGE operation. if self .last_tick .compare_exchange(old_tick, new_interval_start_tick, ORDER_TYPE, ORDER_TYPE) .is_ok() { let required_ticks = elapsed / (INTERVAL * 1_000_000_000); // So only one thread will do the following ticks. // My only concern is that this loop might take so long that another thread // enters the _"problematic"_ `if` and then we have a logical race there. // BUT this is extremely unlikely, because then the loop would have to take // 5 seconds! (INTERVAL * 1e9). So I think we are safe from those races. for _ in 0..required_ticks { self.one_minute_rate.tick(); self.five_minute_rate.tick(); self.fifteen_minute_rate.tick(); } } }

Perhaps my comments could be added to the code, along with a SHOUTING WARNING that tick must not be called concurrently?

Or, we could use advanced type system ~~magic~~ machinery and make tick require a SafetyMark instance as an argument, which would be constructible only the block guarded by that COMPARE EXCHANGE idiom, with the restriction enforced by the visibility rules?

The idea is a bit similar to the SendAttemptedProof in pager.rs. Actually, I'd happily write such a type-system-guaranteed mechanism. WDYT? @muzarski @Lorak-mmk

I did not read the code yet, but if something needs to not have concurrent access, then it should accept &mut instead of reinventing it using some type system trickery. Why is that not possible?

Well, it's not possible because of the complex logic involved. From what I described in the comments it's clear that it's not statically provable by the borrow checker that the code is sound.

Lorak-mmk · 2025-01-31T13:09:40Z

Hi, thanks for the PR!

I did not read all the code yet, but I did read the linked histogram issue.

I think that the best course for now is to:

Make sure that implementation details are not exposed, so that we can easily swap underlying histogram implementation in the future. I understand that the only think preventing this is the error type, right? In such case it makes sense to put it in Arc<dyn Error>.
For now use the AtomicHistogram, but put LockFreeHistogram somewhere it won't get lost (maybe a new repo?)

Why do I think we should use AtomicHistogram (at least for now)?

The original motivation for meddling with metrics is performance. Afaiu AtomicHistogram has better performance (but we don't have the benchmarks that could tell how much better), in exchange for being a small bit imprecise. Imo our metrics use case is not something that requires the absolute precision provided by LockFreeHistogram, so it should be ok to sacrifice it in the name of performance. If it ever turns out that we need the precision, we can change the implementation.
Histograms author provided examples of other crates successfully using AtomicHistogram for similar purposes to ours, which inspires confidence in the implementation.
LockfreeHistogram is more code to be maintained by us, so we should have a good rationale to use it.

wdyt @wprzytula @muzarski ?

muzarski · 2025-01-31T14:21:46Z

wdyt @wprzytula @muzarski ?

Yep, sounds good to me.

roydahan · 2025-02-06T14:09:54Z

@QuerthDP can you please fix the conflicts and address any comment left? (if any)

QuerthDP · 2025-02-06T20:12:25Z

Changelog:

rebased on main
added safety comments to the tick functions

TODO:

@NikodemGapski

replace LockFreeHistogram with AtomicHistogram (leaving a possible way to revert this change)

wprzytula · 2025-02-07T10:38:26Z

scylla/src/observability/metrics.rs

+    /// **WARNING: MUST NOT BE CALLED CONCURRENTLY!**
+    fn tick(&self) {


💭 This makes a valid case to employ unsafe, I believe, to force callers to prove that the contract is upheld.

Will there be a memory corruption if the contract is violated? If not, then it is not a valid use case for unsafe.

wprzytula · 2025-02-07T10:40:28Z

scylla/src/observability/metrics.rs

+    fn tick_if_necessary(&self) {
+        // Multiple threads could read the same `old_tick`...
+        let old_tick = self.last_tick.load(ORDER_TYPE);
+        let new_tick = self.start_time.elapsed().as_nanos() as u64;
+        let elapsed = new_tick - old_tick;
+
+        // _"Problematic"_ `if` - see a comment below.
+        if elapsed > INTERVAL * 1_000_000_000 {
+            let new_interval_start_tick = new_tick - elapsed % (INTERVAL * 1_000_000_000);
+            // But then only one will succeed in the following COMPARE EXCHANGE operation.
+            if self
+                .last_tick
+                .compare_exchange(old_tick, new_interval_start_tick, ORDER_TYPE, ORDER_TYPE)
+                .is_ok()
+            {
+                let required_ticks = elapsed / (INTERVAL * 1_000_000_000);
+                // So only one thread will do the following ticks.
+                // My only concern is that this loop might take so long that another thread
+                // enters the _"problematic"_ `if` and then we have a logical race there.
+                // BUT this is extremely unlikely, because then the loop would have to take
+                // 5 seconds! (INTERVAL * 1e9). So I think we are safe from those races.
+                for _ in 0..required_ticks {
+                    self.one_minute_rate.tick();
+                    self.five_minute_rate.tick();
+                    self.fifteen_minute_rate.tick();
+                }
+            }
+        }
+    }
+}


💭 I believe the body of this function could be put into unsafe block in order to warn readers and further implementers, and to allow calling tick() after making it an unsafe function.

Add atomic histogram from the histogram crate to metrics instead of a plain histogram placed under a mutex. This commit also updates crate histogram dependency from 0.6.9 to 0.11.1 for atomic functionalities and cleaner error handling.

Implement metrics making use of the histogram to measure query latencies. Added metrics are provided by the Snapshot structure containing metrics such as: min, max, mean, median, standard deviation and various percentiles. Co-authored-by: NikodemGapski <[email protected]>

Implement rates similar to those available in cpp-driver. [CPP-Driver implementation](https://github.com/scylladb/cpp-driver/blob/9d6b05c9d4ebd0a6d7006af4df3e33fcdf956eeb/src/metrics.hpp#L39C1-L252C5)

Implement gathering of connection metrics like total number of active connections, connection timeouts and request timeouts. Co-authored-by: Wojciech Przytuła <[email protected]>

Add metrics crate feature which enables usage and gathering of metrics. Therefore everyone willing to use metrics in their code is required to add metrics feature in their Cargo.toml file or compile otherwise with --features metrics flag. Additionally, add a CI step with cargo checks for this feature.

Inform that metrics may now only be used under crate feature 'metrics'. Mention new metrics in documentation and show an example how to collect them. Adjust examples to include new metrics.

As the user should not be able to create metrics instance otherwise than by `get_metrics()` function, the `Metrics::new()` method shall be set to pub(crate) visibility to support only internal usage.

NikodemGapski · 2025-02-08T16:47:37Z

Changelog:

rebased on main
replaced LockFreeHistogram with AtomicHistogram (the lock-free version can still be found on this branch of our fork)
hid the potential synchronisation metrics errors under Arc<dyn Error> (that's where the lock-free errors may go)

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from 74b7fa3 to d989a59 Compare December 11, 2024 10:50

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from d7b32d0 to f9a7153 Compare January 7, 2025 22:28

wprzytula added this to the 0.16.0 milestone Jan 9, 2025

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from bff846c to 88de7ff Compare January 12, 2025 18:46

QuerthDP marked this pull request as ready for review January 12, 2025 19:01

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from 88de7ff to 185ff57 Compare January 13, 2025 12:42

wprzytula requested review from wprzytula, Lorak-mmk and muzarski January 13, 2025 12:45

github-actions bot added the semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes label Jan 13, 2025

NikodemGapski force-pushed the 330-make-metrics-collection-optional/faster branch from 185ff57 to 7be0e38 Compare January 14, 2025 19:03

muzarski reviewed Jan 15, 2025

View reviewed changes

NikodemGapski force-pushed the 330-make-metrics-collection-optional/faster branch from 7be0e38 to 2f6ae32 Compare January 15, 2025 18:31

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from 2f6ae32 to b36ef1f Compare January 16, 2025 11:46

NikodemGapski force-pushed the 330-make-metrics-collection-optional/faster branch from 0bcde8e to cf43ed2 Compare January 16, 2025 17:20

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from cf43ed2 to 4392cb8 Compare January 16, 2025 19:57

wprzytula requested changes Jan 17, 2025

View reviewed changes

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from 4392cb8 to 6dab731 Compare January 22, 2025 11:52

QuerthDP requested a review from wprzytula January 22, 2025 12:14

wprzytula assigned QuerthDP Jan 22, 2025

wprzytula requested a review from muzarski January 22, 2025 13:36

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from 6dab731 to 32537f9 Compare January 26, 2025 16:13

wprzytula requested changes Jan 27, 2025

View reviewed changes

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from bcb3b77 to f297b75 Compare January 27, 2025 18:33

QuerthDP requested a review from wprzytula January 27, 2025 18:35

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from f297b75 to b31d836 Compare January 27, 2025 19:13

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from b31d836 to 27e4e86 Compare January 27, 2025 19:41

wprzytula previously approved these changes Jan 27, 2025

View reviewed changes

roydahan assigned Lorak-mmk Jan 29, 2025

muzarski reviewed Jan 29, 2025

View reviewed changes

Lorak-mmk modified the milestones: 0.16.0, 1.0.0 Feb 5, 2025

QuerthDP dismissed wprzytula’s stale review via 8a0ea57 February 6, 2025 20:11

QuerthDP force-pushed the 330-make-metrics-collection-optional/faster branch from 27e4e86 to 8a0ea57 Compare February 6, 2025 20:11

wprzytula reviewed Feb 7, 2025

View reviewed changes

NikodemGapski and others added 8 commits February 8, 2025 17:33

observability: add atomic histogram

c98feb8

Add atomic histogram from the histogram crate to metrics instead of a plain histogram placed under a mutex. This commit also updates crate histogram dependency from 0.6.9 to 0.11.1 for atomic functionalities and cleaner error handling.

metrics: add rates of queries per second

976e007

Implement rates similar to those available in cpp-driver. [CPP-Driver implementation](https://github.com/scylladb/cpp-driver/blob/9d6b05c9d4ebd0a6d7006af4df3e33fcdf956eeb/src/metrics.hpp#L39C1-L252C5)

metrics: add connection metrics

c92f325

Implement gathering of connection metrics like total number of active connections, connection timeouts and request timeouts. Co-authored-by: Wojciech Przytuła <[email protected]>

docs: adjust metrics documentation

136613b

Inform that metrics may now only be used under crate feature 'metrics'. Mention new metrics in documentation and show an example how to collect them. Adjust examples to include new metrics.

metrics: unpub Metrics::new()

5e8cdad

As the user should not be able to create metrics instance otherwise than by `get_metrics()` function, the `Metrics::new()` method shall be set to pub(crate) visibility to support only internal usage.

node: fix heavily outdated docstring

66b63c9

NikodemGapski force-pushed the 330-make-metrics-collection-optional/faster branch from 8a0ea57 to 66b63c9 Compare February 8, 2025 16:36

		/// WARNING: MUST NOT BE CALLED CONCURRENTLY!
		fn tick(&self) {

Make metrics collection optional/faster #1147

Are you sure you want to change the base?

Make metrics collection optional/faster #1147

Conversation

QuerthDP commented Dec 10, 2024 • edited Loading

Pre-review checklist

github-actions bot commented Dec 10, 2024 • edited Loading

QuerthDP commented Jan 13, 2025

muzarski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muzarski Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QuerthDP commented Jan 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QuerthDP commented Jan 27, 2025

QuerthDP commented Jan 27, 2025 • edited Loading

wprzytula left a comment

Choose a reason for hiding this comment

roydahan commented Jan 29, 2025

muzarski commented Jan 29, 2025

Lorak-mmk commented Jan 29, 2025

muzarski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wprzytula Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lorak-mmk commented Jan 31, 2025

muzarski commented Jan 31, 2025

roydahan commented Feb 6, 2025

QuerthDP commented Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NikodemGapski commented Feb 8, 2025

QuerthDP commented Dec 10, 2024 •

edited

Loading

github-actions bot commented Dec 10, 2024 •

edited

Loading

muzarski Jan 15, 2025 •

edited

Loading

QuerthDP commented Jan 27, 2025 •

edited

Loading

wprzytula Jan 29, 2025 •

edited

Loading

QuerthDP commented Feb 6, 2025 •

edited

Loading