Skip to content

Update insert, contains and remove methods to use Borrow trait #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 23, 2025

Conversation

TyberiusPrime
Copy link
Contributor

I've really enjoyed scalable_cuckoo_filter, thank you!

I need to deduplicate a (&[u8], Option<&[u8]>, Option<&[u8]>, Option<&[u8]>,) and I don't want to do cloning just to get a &[u8] (millions of entries...).

Unfortunately, declaring a ScalableCuckooFilter<MyTuple>
and MyTuple<'a> = (&'a [u8], Option<&'a [u8], ...)
makes rustc think that insert() keeps the reference around.

The problem is that rust ties the inner references to the struct's lifetime, even if you introduce a lifetime parameter and change the PhantomData<T> to a PhantomData<&'b T>.

I could instead have a ScalableCuckooFilter<T = u64>,
and pass in a hash of MyTuple, which then get's immediately hashed again, but it's hard to envision if that would still get the cuckoo's guarantees.

One could remove T from the ScaleableCuckooFilter, and put the constraints on query/insert - but that would allow you to mix types in one filter. I don't think that'd be sensible.

Or we can export crate::hash, make hasher on ScalableCuckoFilter pub, and introduce an unsafe insert_by_hash(&mut self, item_hash: u64) function.

Or, combining the ideas, adding this to ScalableCuckooFilter:
pub unsafe fn insert_reference_type<U: Hash + ?Sized>(&mut self, item: &U)
(which this PR does).
That does not require the user to know anything about the hash,
just to uphold the type invariant.

(Seems rust's been missing a type equality constraint for 10+ years now:
rust-lang/rust#20041 )

@sile
Copy link
Owner

sile commented Apr 22, 2025

Thank you for you PR and the detailed explanation.
I understand your motivation.
Your approach seems basically reasonable, but I'll review it in more detail later.
Just an idea, it might be sufficient to add a trait bound as follows to the insert method (and contains / remove methods too?).

pub fn insert<U>(&mut self, item: &U) 
    where U: PartialEq<T> + Hash + ?Sized {

@TyberiusPrime
Copy link
Contributor Author

pub fn insert<U>(&mut self, item: &U) 
    where U: PartialEq<T> + Hash + ?Sized {

Alas, that also pulls in the hidden T<'life_of_filter> lifetime.

Here's my tiny example to test compilation

    let mut f = scalable_cuckoo_filter::ScalableCuckooFilter::new(1000, 0.05);
    for _ in 0..2 {
        let a = "hello".to_string();
        let q = (&a,);

        f.insert(&q);
        dbg!(f.contains(&q));
        f.insert(&q);
    }

@sile
Copy link
Owner

sile commented Apr 23, 2025

Thank you for sharing the example code.
I modified it to work as follows:

     // Define a struct and implement `PartialEq` for the desired reference type.
     #[derive(Hash)]
     struct MyTuple(String);

     impl PartialEq<(&String,)> for MyTuple {
         fn eq(&self, rhs: &(&String,)) -> bool {
             self.0 == *rhs.0
         }
     }

     // Added `::<MyTuple>`
     let mut f = scalable_cuckoo_filter::ScalableCuckooFilter::<MyTuple>::new(1000, 0.05);

     // The remaining code is identical to yours.
     for _ in 0..2 {
         let a = "hello".to_string();
         let q = (&a,);

         f.insert(&q);
         dbg!(f.contains(&q));
         f.insert(&q);
     }

And the patch that was applied locally to this crate is shown below:

diff --git a/src/scalable_cuckoo_filter.rs b/src/scalable_cuckoo_filter.rs
index d4338de..88e9803 100644
--- a/src/scalable_cuckoo_filter.rs
+++ b/src/scalable_cuckoo_filter.rs
@@ -208,7 +208,11 @@ impl<T: Hash + ?Sized, H: Hasher + Clone, R: Rng> ScalableCuckooFilter<T, H, R>
     }

     /// Returns `true` if this filter may contain `item`, otherwise `false`.
-    pub fn contains(&self, item: &T) -> bool {
+    pub fn contains<U>(&self, item: &U) -> bool
+    where
+        T: PartialEq<U>,
+        U: Hash + ?Sized,
+    {
         let item_hash = crate::hash(&self.hasher, item);
         self.filters
             .iter()
@@ -242,7 +246,11 @@ impl<T: Hash + ?Sized, H: Hasher + Clone, R: Rng> ScalableCuckooFilter<T, H, R>
     ///     }
     /// }
     /// ```
-    pub fn insert(&mut self, item: &T) {
+    pub fn insert<U>(&mut self, item: &U)
+    where
+        T: PartialEq<U>,
+        U: Hash + ?Sized,
+    {
         let item_hash = crate::hash(&self.hasher, item);
         let last = self.filters.len() - 1;
         self.filters[last].insert(&self.hasher, &mut self.rng, item_hash);

@sile
Copy link
Owner

sile commented Apr 23, 2025

I think that my approach is not perfect (e.g., it is a breaking change introducing an additional PartialEq constraint to the existing methods).
However, I don't think I can merge your insert_reference_type approach as-is either, since I want to avoid the use of unsafe code as much as possible.
If it is needed to bypass the Rust compiler's type checking, it might be better to provide an alternative structure like UntypedScalableCuckooFilter that doesn't take the type parameter T (though this is just an idea too).

@TyberiusPrime
Copy link
Contributor Author

That essentially works, but I think (ab)using PartialEq isn't the right call.

It also works with a custom, function less trait, but it's still a breaking change having the user implement it.

Unfortunately, a blanket implementation of that trait again brings us into the situation where a SCF will accept any Hash+?Sized, not just T.

pub trait FixReferences<T: ?Sized> {}
impl <T, U> FixReferences<T> for U  where U: ?Sized + Hash {}

so baring that, the user has to

impl scalable_cuckoo_filter::FixReferences<MyTuple<'_>> for MyTuple<'_> { }

@sile
Copy link
Owner

sile commented Apr 23, 2025

I think (ab)using PartialEq isn't the right call.

I agree.

I came up with a better approach that uses Borrow trait and 'static lifetime.

 fn main() { 
     #[derive(Hash)]
     struct MyTuple((&'static String,));

     impl<'a> std::borrow::Borrow<(&'a String,)> for MyTuple {
         fn borrow(&self) -> &(&'a String,) {
             &self.0
         }
     }

     let mut f = scalable_cuckoo_filter::ScalableCuckooFilter::<MyTuple>::new(1000, 0.05);
     for _ in 0..2 {
         let a = "hello".to_string();
         let q = (&a,);

         f.insert(&q);
         dbg!(f.contains(&q));
         f.insert(&q);
     }
 }
diff --git a/src/scalable_cuckoo_filter.rs b/src/scalable_cuckoo_filter.rs
index d4338de..1bd4385 100644
--- a/src/scalable_cuckoo_filter.rs
+++ b/src/scalable_cuckoo_filter.rs
@@ -208,7 +208,11 @@ impl<T: Hash + ?Sized, H: Hasher + Clone, R: Rng> ScalableCuckooFilter<T, H, R>
     }

     /// Returns `true` if this filter may contain `item`, otherwise `false`.
-    pub fn contains(&self, item: &T) -> bool {
+    pub fn contains<U>(&self, item: &U) -> bool
+    where
+        T: std::borrow::Borrow<U>,
+        U: Hash + ?Sized,
+    {
         let item_hash = crate::hash(&self.hasher, item);
         self.filters
             .iter()
@@ -242,7 +246,11 @@ impl<T: Hash + ?Sized, H: Hasher + Clone, R: Rng> ScalableCuckooFilter<T, H, R>
     ///     }
     /// }
     /// ```
-    pub fn insert(&mut self, item: &T) {
+    pub fn insert<U>(&mut self, item: &U)
+    where
+        T: std::borrow::Borrow<U>,
+        U: Hash + ?Sized,
+    {
         let item_hash = crate::hash(&self.hasher, item);
         let last = self.filters.len() - 1;
         self.filters[last].insert(&self.hasher, &mut self.rng, item_hash);

This approach is still somewhat tricky, but it seems acceptable to me.

@TyberiusPrime
Copy link
Contributor Author

TyberiusPrime commented Apr 23, 2025

I guess we could supplement a bunch of default impls
to cover the common use cases

impl FixReferences<&str> for &str { }
impl FixReferences<str> for str { }
impl FixReferences<String> for String { }
impl FixReferences<&String> for String { }
impl FixReferences<[u8]> for [u8] { }
impl FixReferences<&[u8]> for [u8] { }
impl FixReferences<Vec<u8>> for Vec<u8> { }

So I guess we either

  1. add a marker trait and expect users to implement it
    or
  2. relax the requirement that query/insert/remove take exactly one kind of T:Hash
    or
  3. relax the requirement, but only on unsafe equivalents of the functions.
    or
  4. add insert_hash / remove_hash, and let the user hash themselves. (marked unsafe or not)

Thoughts:

  1. is a breaking change, though potentially buffered providing impls for common types. Will on occasion require the caller to introduce a newtype.
  2. is just bad api design
  3. is the 'not so bad kind of unsafe' (unsafe == caller upholds invariant, failure leads to nothing horrible. vs 'leads to undefined behaviour)
  4. is basically three, but upholding the invariant just got a whole bunch more complicate
    To my mind, there's two levels of unsafe functions.
    The first just means 'there's special invariants that the caller has to uphold' for this to be sound. The second one is 'there's special invariants that the caller has to uphold or we're in undefined behavior country.

If I could say impl FixReferencs<T> for <anything-without-an-inner-lifetime>,
no 1 would be non-breaking, I believe.

edit:
I think this might point into the right direction:

impl<T> FixReferences<T> for &T where T: ?Sized {}

But it's not quite right. Now ScalableCuckooFilter::
requires a &&str for query/insert.

@sile
Copy link
Owner

sile commented Apr 23, 2025

Thank you for your additional comment.

As I said before, I want to avoid using unsafe as much as possible (even if it is "not so bad kind of unsafe").

Among the options, I prefer option 1.
However, rather than defining a crate-specific trait, using std::borrow::Borrow (as shown in #17 (comment)) seems better.
(I think this usage of Borrow aligns with the standard library's approach in methods like HashSet::contains())

@TyberiusPrime
Copy link
Contributor Author

 struct MyTuple((&'static String,));

Thoughts:

  • This requires the user to not pass in &MyTuples, but anonymous, equivalent types, (because the MyTuple itself is already 'static)
  • The nested tuple is neat, even if I don't quite yet grok why it's necessary.
  • The usual suspects ([u8], str, Vec) already implement Borrow
  • &String implements Borrow<&str> so you can hand a &String into insert & contains on a ScalableCuckooFilter, that's neat.

You can make the anonymous type explicit:

#[derive(Hash)]
struct InnerTuple<'a>(&'a str, Option<&'a str>);
#[derive(Hash)]
struct MyTuple(InnerTuple<'static>);

impl<'a> std::borrow::Borrow<InnerTuple<'a>> for MyTuple {
    fn borrow(&self) -> &InnerTuple<'a> {
        &self.0
    }
}

I think we might have a winner.

@TyberiusPrime
Copy link
Contributor Author

It's still a breaking change though, as the test cases show.

It 'sabotages' the type interference from the first insert call.

@TyberiusPrime TyberiusPrime force-pushed the insert_reference_type branch from e095d30 to e69b67a Compare April 23, 2025 13:46
@TyberiusPrime
Copy link
Contributor Author

I've adjusted the PR for the borrow solution, fixed up the test cases and added some brief documentation.

Copy link
Owner

@sile sile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@sile sile merged commit 92ee23f into sile:master Apr 23, 2025
@sile sile changed the title Introduce insert_reference_type Update insert, contains and remove methods to use Borrow traitI Apr 23, 2025
@sile sile changed the title Update insert, contains and remove methods to use Borrow traitI Update insert, contains and remove methods to use Borrow trait Apr 23, 2025
@TyberiusPrime
Copy link
Contributor Author

Thank you!

Thank you! This felt very much like a collaboration, and not the random drive by PR that's so common.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants