-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in synchrotron worker #17791
Comments
Hi @kosmos, thanks for filing an issue. Can I just double-check that you're using |
@anoadragon453 I'm using the official docker image, and this feature ( |
@kosmos Did you upgrade from 1.1130 before being on 1.114.0? I don't see any changes that are particularly related to caches in 1.114.0. Around the time of the memory ballooning, are you seeing lots of initial syncs at once? Those requests are known to be memory intensive, especially for users with a large amount of rooms. Do you have monitoring set up for your Synapse instance? If so, could you have a look at the |
@anoadragon453 Our update history was as follows: 113->114->116. I will write to you in private messages. |
@kosmos and I exchanged some DMs to try and get to the bottom of this. The conclusion was that @kosmos does have jemalloc enabled in their deployment, yet the memory-based cache eviction doesn't seem to be kicking in. They had no mention of this log line in their logs: synapse/synapse/util/caches/lrucache.py Line 151 in 05576f0
This is odd, as the memory of the process ( However, only part of the total memory allocations of the application are being carried out by This resulted in the cache autotuning not kicking in. The real question is what is actually taking up all that memory, and why isn't it being allocated through jemalloc? Native code (Synapse contains rust code) won't use jemalloc to allocate memory, so this could be one explanation. Native code in imports that use C/C++ extensions could also be a contender. Short of getting out a python memory profiler however, it's hard to say. As a workaround, I'd recommend reducing your |
This approach won't work because the jmalloc memory graph is not at its maximum at the moment of physical memory growth. That is, there is no criterion by which we can understand when to clear the memory. In addition, if it's not the memory allocated by jmalloc that is growing, then clearing the allocated jmalloc memory won't save me. |
Good point, yes. I think the next step is tracking down what exactly is taking up the memory in your homeserver when it OOM's, and from there figuring out why it's not being allocated by jemalloc. Alternatively, having the ability to discard caches based on total application memory versus only jemalloc-allocated memory could work. |
I still feel that there is a memory leak somewhere in the native libs, and it is important to find it. |
Hi, I think we may suffer from the same problem. We run synapse 1.113 on a deployment with 2000 daily active users and memory usage grows indefinitely (?). We use a single process, no workers, with the docker image and the following configuration :
We had the same behavior without the cache_autotuning part (we added this autotuning part as part of our experiments to solve this, without improvement). The memory-drop on 04/10 is a reboot of the container. I attach a few graphs : We observe that RAM grows, number of objects stays quite similar day after day but GC time becomes longer and longer. At some point, Synapse becomes quite unresponsive, probably because of this, and we need to restart it. Cheers |
Description
I have the following synchrotron worker configuration:
And the following cache settings for this worker:
All other worker types work without problems, but it is a memory leak in synchrotrons, which leads to the exhaustion of all memory.
It seems that the
cache_autotuning
settings are not working. The environment variablePYTHONMALLOC=malloc
is set at the operating system level.According to my impressions, the problem became relevant after updating to 1.114 of Synapse and remains relevant in 1.116.
Steps to reproduce
To reproduce the problem, you need a homeserver with a heavy load and dedicated synchrotron workers.
Homeserver
Synapse 1.116.0
Synapse Version
Synapse 1.116.0
Installation Method
Docker (matrixdotorg/synapse)
Database
PostgreSQL
Workers
Multiple workers
Platform
Configuration
No response
Relevant log output
Anything else that would be useful to know?
No response
The text was updated successfully, but these errors were encountered: