-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Startup of DBOS fails when database is unavailable at startup time #150
Comments
Thanks again - an interesting observation! If something like this happens in DBOS Cloud, we simply continue restarting your DBOS process until it returns 200 on health check. This is what we recommend for self-hosting right now. Check this endpoint regularly and restart if not. Depending on your infrastructure, this gives you some control over how many times to retry and when to alert. If the database is not available, we cannot really take any other actions other than this kind of check. And we think there are good reasons why this check-restart loop should be outside of DBOS executable rather than part of it. Does that make sense? Do you think this is enough to provide what you're looking for or would you prefer a different behavior? |
@apoliakov thanks for your response! After #139 was implemented by @chuck-dbos I shut down my database during workflow processing and I observed two basic behaviors:
This is great: DBOS, when executing, recognizes and handles the unavailability of the database. I don't need any "external" mechanism to deal with this type of dependency interrupt and recovery. In my view it would be ideal if the same polling for a database connection were to take place during DBOS launch as well as DBOS destroy (I don't know how the polling for connections is implemented, so I don't know if the same code would work in all cases of database unavailability). If this were the case, in any stage of the DBOS life cycle (launch, processing, destroy) DBOS would deal with the unavailability of the database. I would also think this would make a great point of assurance for development and production aside from it being quite an elegant implementation based on its uniformity. Now, there might be other reasons than database unavailability for DBOS failing to launch. I have had no time to study the code (I am itching for it). In those cases throwing an exception and stop the launch might make very much sense and would have to be handled "outside", as you describe in your earlier response. However, I believe the database to be an important case for DBOS to handle throughout its entire life cycle. |
Thank you for addressing recovering from a database outage in #139
I admit that I was secretly hoping that this will also support starting up DBOS when the database is unavailable. However, currently (python 0.12) DBOS exits when started with the database unavailable (see reproduction below, start the code with database down, log attached).
My expectation was that I can start DBOS without the database being available as well. So if the database is available or not, start, execution, and stop of DBOS shall be possible (the life cycle of both, DBOS and database are independent of each other).
The underlying rationale is that DBOS, database (and possibly other components) are deployed on VMs or containers that are independent of each other in terms of their life cycle. It might be possible that the database is restarted (or moved in an HA/DR - high availability/disaster recovery - case, #139 addresses this). It might be possible that the database is running fine, but the container/VM that runs DBOS is crashing or moved due to HA/DR or other reasons. Both might be happening concurrently in an HA/DR case. If all systems are restarted, then ideally I do not want to introduce an order of restart between the VMs/containers in order to reduce complexity and avoid deadlock. Ideally the infrastructure restart of VMs/containers is not dictated by the applications those are running.
Thank you.
Log:
The text was updated successfully, but these errors were encountered: