-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MySQL connection timeouts and pod restarts #74
Comments
We just tried testing a hypothesis where some connections are leaked and perhaps blocking the connection pool or something. We set our MySQL instance to auto-terminate idle connections after 30s and while the number of concurrent connections dropped, the problem is still there. |
Yes, exactly
…On Tue, Apr 11, 2023 at 2:51 PM Paul Volavsek ***@***.***> wrote:
Hey, we'll look into these three issues (#74
<#74>,
#75
<#75>
and #76
<#76>)
this week.
Can you confirm that you are running into these with 1.3.45-rc1-bullseye?
—
Reply to this email directly, view it on GitHub
<#74 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AZTOAQVW4UCMBWRNCJBMSG3XAVHWRANCNFSM6AAAAAAWSOZ5NM>
.
You are receiving this because you authored the thread.Message ID:
<fiskaltrust/product-de-bring-your-own-datacenter/issues/74/1503274731@
github.com>
--
Together, We Grow.
Grzegorz Ziemoński
Engineering Manager
[image: Facebook] <https://www.facebook.com/phorestsalonsoftware/>
[image: LinkedIn] <https://www.linkedin.com/company/phorest/>
[image: Twitter] <https://twitter.com/thephorestword/>
[image: Instagram] <https://www.instagram.com/phorestsalonsoftware/>
[image: Phorest Logo] <https://www.phorest.com/>
IE
*phorest.com* <https://www.phorest.com/>
IE: +353 (0)1 8747800
UK: +44 (0)207 100 9290
US: +1 (406) 284‑7019
[image: Salon Owners Summit 2024] <https://www.salonownersummit.com/>
We work flexibly at Phorest, while it suits me to send this email now
I do not expect a response outside of your own working hours
|
Hey, we were not yet able to reproduce this issue. Do you experience this issue also when using only one instance. |
Hi Gregzip, |
Hey @gregzip, if this problem still persists we've just released Can you please update and check if the problems prevail? |
Hi, it still persists. We were planning to switch from Aurora to regular MySQL to make sure these are not some exotic driver issues again. We'll have a go with Btw. What happened to |
I apologize for that. There was an an issue in our CI/CD pipline yesterday. We've identified and fixed the problem so it will not occur again :) |
Hey! there was a slight delay in our verification due to limited load on Monday and Tuesday. after some further checks on We're continuing with our attempt to migrate the DB but it has it's own challenges (it's 600+ GB at the moment). Is there a way to make it smaller somehow? Rebuilding some cashboxes or removing some unnecessary historical data perhaps? 😅 |
Hey, thanks for the update 😁 can you check how you see the message Also can you do a memory dump on a pod that is currently experiencing the problem (We have a guide here) and upload the dump and logs over a whole day from multiple pods here. Regarding the DB migration if you provide us with a list of all queues in the database we can check if all data is already uploaded (or up to which entry it is uploaded) and you can then do a backup and remove the already uploaded entries. Also a size analysis of the database would be interesting (size of each table for a representative queue). I suspect that the largest table is the |
Hi! Regarding the worker created message, I think it might not be representative because we were restarting all the pods every 30 mins to avoid the DB connection issues. That said, those issues seem to be gone after we moved to MySQL community and increased the number of pods to 15. I think it's the latter that helped. My current hypothesis is that we were running into MySQLConnector's default pool size limit of 100 - we had 10 pods running for ~1300 active cashboxes. Would that make sense from your perspective? As I said, we managed to migrate to regular MySQL server with all of that data in place. I had a quick look at the biggest tables and indeed it seems these are
I checked that there are indeed open transactions for them and I tried to close them for a couple of cashboxes (and wait for daily closing and even re-creating the cashboxes today) but the size is still the same. 🤔 Some questions about this:
|
Hey, We could try to add a config parameter for the Maximum Pool Size and see if increasing the value for that and reducing the replicas to 10 again also solves the issue.
Sorry, I was not clear on that. Closing the open transactions prevents rapid growth of the database in the future but it will not shrink the database. At the moment we don't have an automated way to delete historical data. What we can do though is check on our side which data was already uploaded (please provide us with a list of queues for that) and then you can remove the old data
Yes, that is perfectly fine. |
That sounds great 🙏
Okay, we'll do that soon-ish. I already closed all of the currently open ones via an ad-hoc script so I hope this gives us some time |
Hi! Any news on this? We'd be excited to check if the max pool size fixes our issues, as this could potentially allow us to use less compute resources and save quite a bit on infra costs 💸 |
Hey, we have not forgotten about you 😅 I hope we'll be able to tackle that in one of the next sprints |
Hey, we've just released 1.3.47-rc1-bullseye which introduces the |
We tried the new version and it seems the issue still persists 😬 it was mostly fine for a couple of days with less instances, so I disabled our restarting cron and today things exploded again with the same errors as before. For now we're back to lots of instances and a cron that restarts them every night. |
I did some investigation today and it seems that the pod started failing around 200-220 connections. I'm going to re-create the scenario tomorrow and watch the CPU/memory usage more closely to be absolutely sure it's not about that. I should've done it before but I somehow assumed that if the entire node cpu is very low, then it's not an issue. But it seems that it's always an isolated pod that's failing, so that one pod might not be doing great there. I'll post my findings. |
Quick update:
It seems that the issue is not about connection pooling (anymore?) and it's not about computing resources either 😬 Edit: |
Hey, thanks for the detailed investigation. We'll investigate if we have a thread leak. Could you also please get us one days worth of logs of all pods (with log level debug)? This would help us a lot in analyzing the situation 🙏 |
Hey! Any news on this? |
hey there! I know we owe you a memory dump but this is not fixed yet, even with a larger pool size |
Sorry, this was autoclosed by mistake |
Describe the bug
Some time after starting them, requests to pods start failing and MySQL connection issues show up in logs. These pods frequently fail their health checks and are eventually restarted by K8s.
To Reproduce
Let pods process requests for some time.
Expected behavior
Requests are processed correctly.
Screenshots
N/A
STDOUT/STDERR
POSSystem (please complete the following information):
Cashbox Information (please complete the following information):
The problem doesn't seem cashbox-dependent at all. Some examples:
Additional context
We have 1k+ active cashboxes.
The less instances we start, the sooner we start seeing these errors. With 3 instances errors started showing up within minutes of starting them. With 20 instances it took a few hours.
After adding new instances, old ones are still experiencing the errors until they are restarted.
The database connection limit is 2000 and the number of active connection is always below 700. The errors were also showing up when we temporarily spun up a 96 CPU database instance, so a database issue is very unlikely.
The text was updated successfully, but these errors were encountered: