Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about segcache eviction policy #5

Closed
luzhang6 opened this issue Dec 21, 2022 · 22 comments
Closed

Question about segcache eviction policy #5

luzhang6 opened this issue Dec 21, 2022 · 22 comments

Comments

@luzhang6
Copy link

Hi @brayniac @thinkingfish @kevyang, recently I am testing segcache, and I would like to ask about the eviction policy for segcache, are both ttl and segment size factors that will trigger eviction for cache item?
For example, I set 500s ttl, and my test would take about 1 minute to run. If I am still getting 0s for a cached key, is that caused by the cache segment is full and that value is kicked out? Thank you.

@1a1a11a
Copy link

1a1a11a commented Dec 22, 2022

Both TTL and eviction will cause items to be removed from the cache. When the cache size is not large enough, items may get evicted, and when you have TTLs set, items will be removed after reaching TTL. In your case, if you have 500s TTL, and your test runs for 1 min, then no items should be removed/expired.

I do not understand your question, If I am still getting 0s for a cached key, can you elaborate? What is 0s?

@luzhang6
Copy link
Author

@1a1a11a Thanks for the answer! In my case, I actually found the key still exists in the cache, but no value is associated with that key because the value is set to 0. I guess it's evicted or expired. The TTL is set to 500s, so the reason is the cache size is too small? Where can I check the cache size?

@thinkingfish
Copy link
Member

thinkingfish commented Dec 27, 2022

@luzhang6 How are you reading this key right now? Using the function API or accessing it remotely via RPC? Was it 0 immediately after you set the key, if you try to read your own write right away?

Can you maybe present a minimal setup to reproduce this problem? If so we can try to reproduce independently and save some back-and-forth.

@luzhang6
Copy link
Author

luzhang6 commented Jan 2, 2023

@thinkingfish Thanks! I think we use function API in our setup. We create random keys and values, and put into the cache, then we read the keys from the cache again to validate the values. However, some values are 0s. So we think those keys expired somehow in our test.
The setup is that we create our own cache manager based on the segcache library, and in our test we just do random write and read. The whole setup may be a little complex to explain, we will wrap up and present to you if you need more information from us.

@brayniac
Copy link
Collaborator

brayniac commented Jan 3, 2023

@luzhang6 - if the key is not present in the cache, a miss is returned. You shouldn't get an actual value response. Are you using some client to communicate with the Pelikan Segcache server? Or are you using the "seg" storage library directly?

@luzhang6
Copy link
Author

luzhang6 commented Jan 5, 2023

@brayniac We used seg storage library directly. Does the get key call have an error if the key is not present?
@beinan creates the segcache implementation in our use case and can provide more context.

@brayniac
Copy link
Collaborator

brayniac commented Jan 6, 2023

It'll return the None variant of Option<Item> on a cache miss - as opposed to some item. Similar to std lib hashmap which returns none if the key isn't present but returns the value if it is. It's up to the caller to handle that properly. I suspect an issue in the usage of the library. However, if you still suspect a bug, I would need a simple way to reproduce the issue to be certain.

@beinan
Copy link

beinan commented Jan 6, 2023

It'll return the None variant of Option<Item> on a cache miss - as opposed to some item. Similar to std lib hashmap which returns none if the key isn't present but returns the value if it is. It's up to the caller to handle that properly. I suspect an issue in the usage of the library. However, if you still suspect a bug, I would need a simple way to reproduce the issue to be certain.

Thank you @brayniac ! I'm also thinking if there is some mistake on the caller side, here is impl of the caller. (it's a rust-java binding)
https://github.com/beinan/data_cache/blob/master/cache_jni/rust_lib/src/lib.rs

Java side code: https://github.com/beinan/data_cache/blob/master/cache_jni/java_lib/src/main/java/alluxio/sdk/file/cache/NativeCacheManager.java

And a unit test in java: https://github.com/beinan/data_cache/blob/master/cache_jni/java_lib/src/test/java/alluxio/sdk/file/cache/NativeCacheManagerTest.java

The unit test is running really well, no flaky at all.

But it failed in the microbench, https://github.com/Alluxio/alluxio/pull/16448/files#diff-465844ed4d3733bb67ded1023462180095271ba97f3cbd21c196c0427d1eeefa

Some of the key is missing as @luzhang6 mentioned. Any idea or suggestions? Thanks a lot!

@brayniac
Copy link
Collaborator

brayniac commented Jan 8, 2023

@beinan - if you can provide a minimum repro in pure Rust, I can take a look. I'm not going to be able to troubleshoot the bindings. It looks like you're handling the "None" case properly - but not sure if I'm missing some subtle issue in the bindings or if the issue is in calling code.

To reduce the debug area, please write some basic Rust program that demonstrates the issue.

@thinkingfish
Copy link
Member

thinkingfish commented Jan 8, 2023 via email

@brayniac
Copy link
Collaborator

brayniac commented Jan 8, 2023

I guess the other basic question here would be "what is the expected value"?

@thinkingfish
Copy link
Member

thinkingfish commented Jan 8, 2023 via email

@brayniac
Copy link
Collaborator

brayniac commented Jan 8, 2023 via email

@luzhang6
Copy link
Author

luzhang6 commented Jan 9, 2023

@thinkingfish @brayniac I adjusted the hashpower from 16 to 32 and almost did not see those errors. We roughly have 1 - 10 million keys, what is a reasonable number for the hashpower? Also can you please give more details on what is hashpower and how we should tune it? Thanks! cc @beinan

@brayniac
Copy link
Collaborator

brayniac commented Jan 9, 2023

@luzhang6 - hashpower controls the number of item slots in the hashtable. If there are too few... then items might not be stored because the hastable is a fixed size on initialization. You'd get an error Result for the insert if there is no space in the hashtable to insert the item.

See the docs here for the meaning of the value: https://github.com/pelikan-io/pelikan/blob/main/src/storage/seg/src/builder.rs#L29

@thinkingfish
Copy link
Member

@luzhang6 Oh! when you get the key in the microbench you don't check the return status, so you won't be able to differentiate a miss (which returns -1) from a corrupted value.

@luzhang6
Copy link
Author

@brayniac @thinkingfish Thanks! Will take a look at the reference doc. I see, currently we just check whether the value equals the expected value in our test.

@luzhang6
Copy link
Author

Hi @thinkingfish @brayniac, I'd like to ask another question, currently we use the segcache in memory, how can we switch to the disk mode when using segcache?

@brayniac
Copy link
Collaborator

@luzhang6 - the current support is really focused around using Intel Persistent Memory (PMEM) and can be enabled with the "datapool path" configuration: https://github.com/pelikan-io/pelikan/blob/main/src/storage/seg/src/builder.rs#L138

It's important to note, this is based on using memory mapped file access. If the filesystem is not mounted with DAX option, then access will go through the page cache and there will not be strong benefits and could potentially be significant impacts from colocated processes.

I would not currently recommend trying to use this with a standard block device.

@thinkingfish
Copy link
Member

Closing this issue- the question around SSD is a great one and we have plans to improve support for SSD. Will make that into the roadmap when we publish and please let me know if anybody wants to collaborate on it.

@beinan
Copy link

beinan commented Jan 29, 2023

Closing this issue- the question around SSD is a great one and we have plans to improve support for SSD. Will make that into the roadmap when we publish and please let me know if anybody wants to collaborate on it.

Thank you @thinkingfish ! Is it possible to pull me into the ssd project? I think this would be a very decent replacement for presto/trino's build-in cache store, since we're suffering from the GC issues for the java implementation now. thanks!

@thinkingfish
Copy link
Member

I just created a Discord server, you are welcome to join! https://discord.gg/EuSQn6TQh7 It's just created so still quiet right now but hopefully will become more active soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants