Don't generate a mutex for each Locked<T>, share one per object #10117

Al2Klimov · 2024-08-20T15:11:54Z

This reduces RAM usage per object by sizeof(mutex)*(FIELDS-1).

The atomic_load(const std::shared_ptr<T>*) idea would, without doubt, also consume less RAM. However, it would scale worse, the more objects you have, with a fixed number of mutexes. Neither one SpinMutex per property, nor one std::mutex per object suffer from this limitation.

julianbrost · 2024-08-26T12:24:47Z

This misses part of what #10113 asked for:

Figure out how much of an effect this has on the total memory use of Icinga 2.

We have a vague "Icinga 2 uses more memory in newer versions", so how much does this affect the total memory use of the whole process? Might this be the answer or is it just a small puzzle piece?

Also, what made you choose adding that m_FieldsMutex and passing it around over what I suggested in #10113? I'm not a big fan of how random code (see the changes to lib/icingadb/ now has to access some internal attribute generated by mkclass. A nice aspect of the current implementation is that it's very easy to reason that it won't deadlock, all the magic happens here:

icinga2/lib/base/atomic.hpp

Lines 52 to 73 in 585b357

    
           template<typename T> 
        
           class Locked 
        
           { 
        
           public: 
        
           	inline T load() const 
        
           	{ 
        
           		std::unique_lock<std::mutex> lock(m_Mutex); 
        
           		return m_Value; 
        
           	} 
        
           	inline void store(T desired) 
        
           	{ 
        
           		std::unique_lock<std::mutex> lock(m_Mutex); 
        
           		m_Value = std::move(desired); 
        
           	} 
        
           private: 
        
           	mutable std::mutex m_Mutex; 
        
           	T m_Value; 
        
           };

Whereas with this PR, exposing the underlying mutex, this gets more spread out in the code base.

Al2Klimov · 2024-08-26T15:44:52Z

Also, what made you choose adding that m_FieldsMutex and passing it around over what I suggested in #10113?

sharing the mutex between objects, this would reduce the memory requirements

The more fields, especially objects, share one mutex, the slower they can proceed. I prefer 1+ mutex per separate object.

I'm not a big fan of how random code (see the changes to lib/icingadb/ now has to access some internal attribute generated by mkclass. A nice aspect of the current implementation is that it's very easy to reason that it won't deadlock, all the magic happens here:

icinga2/lib/base/atomic.hpp

Lines 52 to 73 in 585b357

template<typename T>

class Locked

{

public:

inline T load() const

{

std::unique_lock<std::mutex> lock(m_Mutex);

return m_Value;

}

inline void store(T desired)

{

std::unique_lock<std::mutex> lock(m_Mutex);

m_Value = std::move(desired);

}

private:

mutable std::mutex m_Mutex;

T m_Value;

};

Whereas with this PR, exposing the underlying mutex, this gets more spread out in the code base.

You're right. One wrong move on the mutex and... 🙈

lib/base/atomic.hpp

julianbrost · 2024-09-26T08:03:02Z

Please use a search engine of your choice to search for the keywords "userspace" and "spinlock". Seems like the majority isn't too positive about that idea. And as already said, Icinga 2 already had a negative experience with spinlocks itself. What makes you believe that now is a good time to use a spinlock in userspace?

By the way, Linus Torvalds' rant post (the one linked from cppreference.com) even has a follow-up from him on yield.

Apart from that, the current version of the PR seems to restrict the use of that suggested spinlock to just intrusive_ptr, so each string attribute would still use its own std::mutex. So the memory use will be somewhere in the middle (I haven't looked into the distribution of attribute types), but why do it that way if it can be done better?

The atomic_load(const std::shared_ptr<T>*) idea would, without doubt, also consume less RAM. However, it would scale worse, the more objects you have, with a fixed number of mutexes.

However, while there may be a large number of objects, the number of CPU cores and thereby threads we start is typically much smaller. And that actually limits the total contention. There won't be a thread for each single object trying to lock it.

Al2Klimov

However, while there may be a large number of objects, the number of CPU cores and thereby threads we start is typically much smaller. And that actually limits the total contention. There won't be a thread for each single object trying to lock it.

This sounds nice in theory, but I'd prefer a solution much more predicable, especially for large setups. I mean my one mutex per object idea.

Al2Klimov · 2024-09-26T10:55:36Z

lib/base/atomic.hpp

+/**
+ * Wraps std::mutex, so that only Locked<T> can (un)lock it.
+ *
+ * The latter tiny lock scope is enforced this way to prevent deadlocks while passing around mutexes.
+ *
+ * @ingroup base
+ */
+class LockedMutex
+{
+	template<class T>
+	friend class Locked;
+
+private:
+	std::mutex m_Mutex;
+};


I'm not a big fan of how random code (see the changes to lib/icingadb/ now has to access some internal attribute generated by mkclass. A nice aspect of the current implementation is that it's very easy to reason that it won't deadlock, all the magic happens here:

icinga2/lib/base/atomic.hpp

Lines 52 to 73 in 585b357

template<typename T>

class Locked

{

public:

inline T load() const

{

std::unique_lock<std::mutex> lock(m_Mutex);

return m_Value;

}

inline void store(T desired)

{

std::unique_lock<std::mutex> lock(m_Mutex);

m_Value = std::move(desired);

}

private:

mutable std::mutex m_Mutex;

T m_Value;

};

Whereas with this PR, exposing the underlying mutex, this gets more spread out in the code base.

I've fixed it. :)

Al2Klimov · 2024-10-23T14:05:04Z

tools/mkclass/classcompiler.cpp

+
+	if (klass.Parent.empty()) {
+		m_Header << "protected:" << std::endl
+			<< "\tmutable LockedMutex m_FieldsMutex;" << std::endl << std::endl;


And the cool "side" effect here is that the amount of mutexes scales linear with the amount of objects. So, in contrast to a central array of mutexes created at reload time, you guaranteed won't ever have too few mutexes.

Why do you think there would be a problem with "too few mutexes"? Like you'll have a more or less fixed number of threads started by Icinga 2, so that limits the concurrency and would be a good candidate for sizing the number of mutexes in a std::atomic/libatomic-like implementation.

I'm not opposed to the idea of using one mutex per object, however, the current PR introduces a quite strange interface with AtomicPseudoLocked where you have to specify a mutex which then is ignored and has to be specified in the custom get {{{ }}} implementations. What if mkclass would just emit the required locking code in the relevant getter methods instead? Like depending on the type, it will either be an atomic load or a lock + non-atomic load + unlock. (And similar for the setters of course.)

Nonetheless, I still don't think that would be necessary and having a global array of mutexes would suffice and keep the implementation simpler.

Why do you think there would be a problem with "too few mutexes"?

No. I don't. I don't know.

I don't know how many mutexes are enough – O(?).
I don't know the hashing algorithm mapping tons of Locked<> in memory to the O(?) mutexes in your scenario and how fair it "schedules".
And beyond the things which I know and the (above) ones I know that I don't know, there are things I don't know that I don't know.

The more radically you change a running system in an amount of time, the more likely you hit a (delayed) black swan. Like JSON-RPC crash one (v2.11.3), JSON-RPC crash two (v2.11.4), IcingaDB(?) OOM _{ref/NC/820479} (v2.?) ...

I'm not opposed to the idea of using one mutex per object,

👍

Btw. I'm neither opposed to yours.
I just lack the knowledge to implement it by myself properly enough to sleep well afterwards.

AtomicPseudoLocked where you have to specify a mutex which then is ignored and has to be specified in the custom get {{{ }}} implementations

No, yes and no.
AtomicPseudoLocked<T> just extends std::atomic<T> with two additional LockedMutex-compliant methods you can call if you wish. They indeed ignore the mutex _{(God bless compiler optimization!)} and call their (public) equivalents which don't take one. Custom get {{{ }}} implementations can also call the latter, not specifying any mutex, but all .ti files in the current diff override get {{{ }}} for strings. So they need the mutex and they actually lock it.

What if mkclass would just emit the required locking code in the relevant getter methods instead? Like depending on the type, it will either be an atomic load or a lock + non-atomic load + unlock.

mkclass is surely a good tool for stuff that would be significantly(?) more difficult in vanilla C++.
But from my experience I'd prefer "native solutions" (e.g vanilla C++) if applicable easily enough.

And I went down the rabbit hole quite a bit. The GCC implementation for the std::atomic_load(std::shared_ptr<T>*) and related is actually quite trivial:

These function use a RAII-style locker called _Sp_locker (source) which uses some hash function (I didn't dig into deeper the hash function itself) modulo __gnu_internal::mask to get a mutex index (source) which is then used from a static array of mutexes (source). That mask is actually quite small (0xf) and results in a pool size of 16 (source).

To be honest, that wasn't really what I was expecting and I knew I found something different when initially suggesting this. Luckily I found it again, it was this answer on Stack Overflow for a somewhat related but different question, namely how std::atomic<T> works when T is too large for atomic instructions. The answer there relating to clang++ (in particular the lock_for_pointer(void*) function) look much more like what I was expecting.

For completeness: std::atomic<std::shared_ptr<T>> is a new thing in C++20 and GCC also has an implementation for it which is way different from the one for std::atomic_load(), though I didn't look into it in detail as it doesn't seem to do what we would need for our implementation.

the current PR introduces a quite strange interface with AtomicPseudoLocked where you have to specify a mutex which then is ignored

I've fixed it* – everything's Locked<>. (Involuntarily*, however, because I consider this solution development, not a beauty contest. Especially, you admitted yourself you can't summarize what's so "ugly" and why this "uglyness" seems a problem to you.) Anyway:

I've got why that std::atomic<std::shared_ptr<T>> implementation isn't problematic: because it only affects std::atomic<std::shared_ptr<T>>. Well-implemented copying of std::shared_ptr (or at least boost::intrusive_ptr) IMAO involves few enough instructions to protect it via just a spinlock (with yield). So copying one std::shared_ptr doesn't even block the next noticeably long and a noticeable amount may share one mutex. String plays in another league with its malloc(3) and memcpy(3).

*) If you still consider your approach better, here is my "counter"-suggestion:

AtomicOrLocked as such stays (in contrast to current PR state)

Locked<> uses the general idea of yours/C++, BUT:

(Don't fall from your chair!) There's a pool of std::thread::hardware_concurrency() * 64 mutexes. This is enough for sure (and just 4 MB on an M3 Mac with 1024 cores)

Which mutex is used for x is calculated this way: std::hash<decltype(x)*>()(&x) % (std::thread::hardware_concurrency() * 64)

As strings are stored in the heap anyway, and to reduce lock time, Locked<String> stores internally not a String, but a Shared<struct{size_t,char[1]}>::Ptr

I've fixed it* – everything's Locked<>.

So you're not any atomic operations for attribute accesses anymore but always lock a mutex instead? That sounds like a strange thing to do without any performance considerations, especially given your worries about "not enough mutexes".

*) If you still consider your approach better, here is my "counter"-suggestion:

I mean basically it's the fixed-size shared mutex pool suggestion from the beginning, just with some details filled in.

There's a pool of std::thread::hardware_concurrency() * 64 mutexes. This is enough for sure

Probably enough, that even sounds a bit excessive. Was 64 chosen by a fair dice role or is there more consideration behind that number?

on an M3 Mac with 1024 cores

Sounds like an amazing machine! Where did you get that?

Shared<struct{size_t,char[1]}>::Ptr

I'm not really sure what's that supposed to do? Especially that struct inside instead of simply a string.

I've fixed it* – everything's Locked<>.

So you're not any atomic operations for attribute accesses anymore but always lock a mutex instead? That sounds like a strange thing to do without any performance considerations, especially given your worries about "not enough mutexes".

I prefer atomic where possible. But you considered ignoring the mutex in case of atomic too ugly.

That I prefer functionality doesn't mean I don't make compromises to meet your code beauty standards.

*) If you still consider your approach better, here is my "counter"-suggestion:

I mean basically it's the fixed-size shared mutex pool suggestion from the beginning, just with some details filled in.

Sure. "Locked<> uses the general idea of yours (...)"

There's a pool of std::thread::hardware_concurrency() * 64 mutexes. This is enough for sure

Probably enough, that even sounds a bit excessive. Was 64 chosen by a fair dice role or is there more consideration behind that number?

Yes, 64 was primarily chosen because it's a cool number, but also:

I'm glad to hear your feedback because "a bit excessive" is definitely enough.

on an M3 Mac with 1024 cores

Sounds like an amazing machine! Where did you get that?

(I'm an Apple shareholder, you know... 😎 /s)

Our senior consultant @lbetz confirmed that he hasn't seen any customer machine with 1024 cores, yet. And M3 has the larger std::mutex. Now, if you combine those maximums, you'll get my 4 MB from above. For a machine of that (theoretical) size I consider that not excessive.

But I'm open to a less cool sounding power of two at your option.

Shared<struct{size_t,char[1]}>::Ptr

I'm not really sure what's that supposed to do? Especially that struct inside instead of simply a string.

It saves you one malloc(3), especially in this frequently used code path. Any kind of string is fixed size with its payload outsourced to the heap. This struct would unite everything of a string in one "data block".

This reduces RAM usage per object by sizeof(mutex)*(FIELDS-1).

Al2Klimov requested a review from julianbrost August 20, 2024 15:11

cla-bot bot added the cla/signed label Aug 20, 2024

icinga-probot bot added the core/evaluate Analyse/Evaluate features and problems label Aug 20, 2024

Al2Klimov force-pushed the AtomicOrLocked-mutexes branch from 250635f to b7611ed Compare August 26, 2024 15:45

Al2Klimov changed the title ~~Don't generate a mutex for each Locked<T>, share one per object~~ Locked<T>: optimistically use SpinMutex to consume less RAM Aug 26, 2024

julianbrost requested changes Aug 29, 2024

View reviewed changes

lib/base/atomic.hpp Outdated Show resolved Hide resolved

Al2Klimov force-pushed the AtomicOrLocked-mutexes branch from b7611ed to 37b4d10 Compare September 3, 2024 14:03

Al2Klimov changed the title ~~Locked<T>: optimistically use SpinMutex to consume less RAM~~ Locked<T>: optimistically use SpinMutex for intrusive_ptr to consume less RAM Sep 24, 2024

Al2Klimov force-pushed the AtomicOrLocked-mutexes branch from 37b4d10 to be761c1 Compare September 24, 2024 10:23

Al2Klimov changed the title ~~Locked<T>: optimistically use SpinMutex for intrusive_ptr to consume less RAM~~ Don't generate a mutex for each Locked<T>, share one per object Sep 26, 2024

Al2Klimov force-pushed the AtomicOrLocked-mutexes branch from be761c1 to d33f5ef Compare September 26, 2024 10:54

Al2Klimov commented Sep 26, 2024

View reviewed changes

Al2Klimov mentioned this pull request Sep 26, 2024

CpuBoundWork#CpuBoundWork(): don't spin on atomic int to acquire slot #9990

Open

Al2Klimov commented Oct 23, 2024

View reviewed changes

Don't generate a mutex for each Locked<T>, share one per object

98cfe49

This reduces RAM usage per object by sizeof(mutex)*(FIELDS-1).

Al2Klimov force-pushed the AtomicOrLocked-mutexes branch from d33f5ef to 98cfe49 Compare November 13, 2024 16:58

Al2Klimov mentioned this pull request Nov 14, 2024

Evaluate memory overhead of icinga::Locked<...> in object attributes #10113

Open

Al2Klimov mentioned this pull request Nov 27, 2024

Add a dedicated method for disconnecting TLS connections #10005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't generate a mutex for each Locked<T>, share one per object #10117

Don't generate a mutex for each Locked<T>, share one per object #10117

Al2Klimov commented Aug 20, 2024 •

edited

Loading

julianbrost commented Aug 26, 2024

Al2Klimov commented Aug 26, 2024

julianbrost commented Sep 26, 2024

Al2Klimov left a comment

Al2Klimov Sep 26, 2024

Al2Klimov Oct 23, 2024

julianbrost Nov 4, 2024

Al2Klimov Nov 4, 2024

julianbrost Nov 13, 2024 •

edited

Loading

Al2Klimov Nov 13, 2024

julianbrost Nov 27, 2024

Al2Klimov Nov 27, 2024

	template<typename T>
	class Locked
	{
	public:
	inline T load() const
	{
	std::unique_lock<std::mutex> lock(m_Mutex);

	return m_Value;
	}

	inline void store(T desired)
	{
	std::unique_lock<std::mutex> lock(m_Mutex);

	m_Value = std::move(desired);
	}

	private:
	mutable std::mutex m_Mutex;
	T m_Value;
	};

Don't generate a mutex for each Locked<T>, share one per object #10117

Are you sure you want to change the base?

Don't generate a mutex for each Locked<T>, share one per object #10117

Conversation

Al2Klimov commented Aug 20, 2024 • edited Loading

julianbrost commented Aug 26, 2024

Al2Klimov commented Aug 26, 2024

julianbrost commented Sep 26, 2024

Al2Klimov left a comment

Choose a reason for hiding this comment

Al2Klimov Sep 26, 2024

Choose a reason for hiding this comment

Al2Klimov Oct 23, 2024

Choose a reason for hiding this comment

julianbrost Nov 4, 2024

Choose a reason for hiding this comment

Al2Klimov Nov 4, 2024

Choose a reason for hiding this comment

👍

julianbrost Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Al2Klimov Nov 13, 2024

Choose a reason for hiding this comment

julianbrost Nov 27, 2024

Choose a reason for hiding this comment

Al2Klimov Nov 27, 2024

Choose a reason for hiding this comment

Al2Klimov commented Aug 20, 2024 •

edited

Loading

julianbrost Nov 13, 2024 •

edited

Loading