support lock-free hashmap backend #436

ZhuYuJin · 2023-12-12T13:55:10Z

This is a draft implementation to remove global mutex in HashMapBackend.
This PR tries to avoiding locking the whole hashmap backend while updating or inserting new payload because our online service exists writing and reading actions at the same time. We need to ensure reading performance while writing actions exist.

The following is my change ideas:

I involve an atomic variable in Payload to implement a row lock.
I involve a mutex in Partition to protect insert a new payload action. Besides, parallel_flat_hash_map ensures the atomicity of adding/removing/updating an entry in flat hash map.

bashimao · 2023-12-14T09:57:19Z

HugeCTR/include/hps/hash_map_backend.hpp

@@ -115,6 +116,35 @@ class HashMapBackend final : public VolatileBackend<Key, HashMapBackendParams> {
      uint64_t access_count;
    };
    ValuePtr value;
+    std::atomic<int8_t> lck;


Technically speaking I have nothing against this change in general. But need to carefully think that this will not break the MultiProcessHashmap. Maybe it is better to splice the code.

What is more relevant is that despite you using int_8 here, the memory usage should still increase by 8 here due to alignment. Can we un-align the data structure without loss of performance?

bashimao · 2023-12-14T10:21:03Z

HugeCTR/include/hps/hash_map_backend.hpp

+    phmap::parallel_flat_hash_map<Key, Payload, phmap::priv::hash_default_hash<Key>,
+                                  phmap::priv::hash_default_eq<Key>,
+                                  phmap::priv::Allocator<phmap::priv::Pair<const Key, Payload>>, 
+                                  4, std::shared_mutex> entries;


Yes, it will avoid breaking the updates, but partly due to reasons that I wanted to avoid.

Why parallel_hash_map does with this configuration is to split the hash table into 4 independent pieces.

The submap is this thing in their code:

struct Inner : public Lockable

Upon query each call ends up with a structure like this,

typename Lockable::UniqueLock m; auto res = this->find_or_prepare_insert(k, m);

which in turn executes:

Inner& inner = sets_[subidx(hashval)]; auto& set = inner.set_; mutexlock = std::move(typename Lockable::UniqueLock(inner));

. So in other words. It will select one of the 4 independent submaps and FULLY lock it. So there is no parallelism possible on that submap. So in the very BEST case when you have 4 keys, each being hashed by subidx to different sub-hashmaps, you can accomodate up to 4 threads. Arguably, that is better than now. But you need to be aware this the worst-case still exists. Hence the reason why we didn't use their parallelized hashmap implementation yet.

@yingcanw We can take over this part of the change without much issues. In my new streamlined hashmap we will use a similar method to speed things up.

bashimao · 2023-12-14T10:25:33Z

HugeCTR/include/hps/hash_map_backend_detail.hpp

@@ -55,7 +56,9 @@ namespace HugeCTR {
      const Payload& payload{it->second};                           \
                                                                    \
      /* Stash pointer and reference in map. */                     \
+      std::unique_lock lock(part.read_write_lck);                   \


@yingcanw What worries me about this is mostly the strain under heavy load. We would need to measure if it doesn't kill insert speed. The new hashmap will fix this one.

bashimao · 2023-12-14T10:26:35Z

HugeCTR/src/hps/hash_map_backend.cpp

@@ -36,7 +36,7 @@ HashMapBackend<Key>::HashMapBackend(const HashMapBackendParams& params) : Base(p

 template <typename Key>
 size_t HashMapBackend<Key>::size(const std::string& table_name) const {
-  const std::shared_lock lock(read_write_guard_);


The changes below are fine under the assumption that we takeover the above changes.

bashimao · 2023-12-14T10:32:12Z

HugeCTR/include/hps/hash_map_backend_detail.hpp

@@ -91,6 +94,7 @@ namespace HugeCTR {
                                                                                             \
      /* Race-conditions here are deliberately ignored because insignificant in practice. */ \
      __VA_ARGS__;                                                                           \
+      while (payload.lck.load(std::memory_order_relaxed) != 0);                              \
      std::copy_n(payload.value, part.value_size, &values[(k - keys) * value_stride]);       \


I think here is a logical error.

What if payload.value is rapidly changed. The lck doesn't guarantee that. During the copy_n payload value might change or become invalid.

bashimao · 2023-12-14T10:34:41Z

HugeCTR/include/hps/hash_map_backend_detail.hpp

    }                                                                                        \
                                                                                             \
+    if (payload.lck.load(std::memory_order_relaxed) != -1) {                                 \


This looks like it could work. It is a little bit fuzzy. Can you clarify the intended lifecycle of lck. What value should follow after what?

support lock-free hashmap backend

48e1d45

bashimao requested changes Dec 14, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support lock-free hashmap backend #436

support lock-free hashmap backend #436

ZhuYuJin commented Dec 12, 2023

bashimao Dec 14, 2023

bashimao Dec 14, 2023

bashimao Dec 14, 2023

bashimao Dec 14, 2023

bashimao Dec 14, 2023

bashimao Dec 14, 2023

bashimao Dec 14, 2023

bashimao Dec 14, 2023

support lock-free hashmap backend #436

Are you sure you want to change the base?

support lock-free hashmap backend #436

Conversation

ZhuYuJin commented Dec 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment