-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support lock-free hashmap backend #436
base: main
Are you sure you want to change the base?
support lock-free hashmap backend #436
Conversation
@@ -115,6 +116,35 @@ class HashMapBackend final : public VolatileBackend<Key, HashMapBackendParams> { | |||
uint64_t access_count; | |||
}; | |||
ValuePtr value; | |||
std::atomic<int8_t> lck; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically speaking I have nothing against this change in general. But need to carefully think that this will not break the MultiProcessHashmap. Maybe it is better to splice the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is more relevant is that despite you using int_8
here, the memory usage should still increase by 8 here due to alignment. Can we un-align the data structure without loss of performance?
phmap::parallel_flat_hash_map<Key, Payload, phmap::priv::hash_default_hash<Key>, | ||
phmap::priv::hash_default_eq<Key>, | ||
phmap::priv::Allocator<phmap::priv::Pair<const Key, Payload>>, | ||
4, std::shared_mutex> entries; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will avoid breaking the updates, but partly due to reasons that I wanted to avoid.
Why parallel_hash_map
does with this configuration is to split the hash table into 4 independent pieces.
The submap is this thing in their code:
struct Inner : public Lockable
Upon query each call ends up with a structure like this,
typename Lockable::UniqueLock m;
auto res = this->find_or_prepare_insert(k, m);
which in turn executes:
Inner& inner = sets_[subidx(hashval)];
auto& set = inner.set_;
mutexlock = std::move(typename Lockable::UniqueLock(inner));
. So in other words. It will select one of the 4 independent submaps and FULLY lock it. So there is no parallelism possible on that submap. So in the very BEST case when you have 4 keys, each being hashed by subidx
to different sub-hashmaps, you can accomodate up to 4 threads. Arguably, that is better than now. But you need to be aware this the worst-case still exists. Hence the reason why we didn't use their parallelized hashmap implementation yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingcanw We can take over this part of the change without much issues. In my new streamlined hashmap we will use a similar method to speed things up.
@@ -55,7 +56,9 @@ namespace HugeCTR { | |||
const Payload& payload{it->second}; \ | |||
\ | |||
/* Stash pointer and reference in map. */ \ | |||
std::unique_lock lock(part.read_write_lck); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingcanw What worries me about this is mostly the strain under heavy load. We would need to measure if it doesn't kill insert speed. The new hashmap will fix this one.
@@ -36,7 +36,7 @@ HashMapBackend<Key>::HashMapBackend(const HashMapBackendParams& params) : Base(p | |||
|
|||
template <typename Key> | |||
size_t HashMapBackend<Key>::size(const std::string& table_name) const { | |||
const std::shared_lock lock(read_write_guard_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes below are fine under the assumption that we takeover the above changes.
@@ -91,6 +94,7 @@ namespace HugeCTR { | |||
\ | |||
/* Race-conditions here are deliberately ignored because insignificant in practice. */ \ | |||
__VA_ARGS__; \ | |||
while (payload.lck.load(std::memory_order_relaxed) != 0); \ | |||
std::copy_n(payload.value, part.value_size, &values[(k - keys) * value_stride]); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here is a logical error.
What if payload.value
is rapidly changed. The lck
doesn't guarantee that. During the copy_n
payload value might change or become invalid.
} \ | ||
\ | ||
if (payload.lck.load(std::memory_order_relaxed) != -1) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it could work. It is a little bit fuzzy. Can you clarify the intended lifecycle of lck
. What value should follow after what?
This is a draft implementation to remove global mutex in
HashMapBackend
.This PR tries to avoiding locking the whole hashmap backend while updating or inserting new payload because our online service exists writing and reading actions at the same time. We need to ensure reading performance while writing actions exist.
The following is my change ideas:
Payload
to implement a row lock.Partition
to protectinsert a new payload
action. Besides,parallel_flat_hash_map
ensures the atomicity of adding/removing/updating an entry in flat hash map.