Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup of DBOS fails when database is unavailable at startup time #150

Open
chbussler opened this issue Nov 10, 2024 · 2 comments
Open

Comments

@chbussler
Copy link
Contributor

Thank you for addressing recovering from a database outage in #139

I admit that I was secretly hoping that this will also support starting up DBOS when the database is unavailable. However, currently (python 0.12) DBOS exits when started with the database unavailable (see reproduction below, start the code with database down, log attached).

My expectation was that I can start DBOS without the database being available as well. So if the database is available or not, start, execution, and stop of DBOS shall be possible (the life cycle of both, DBOS and database are independent of each other).

The underlying rationale is that DBOS, database (and possibly other components) are deployed on VMs or containers that are independent of each other in terms of their life cycle. It might be possible that the database is restarted (or moved in an HA/DR - high availability/disaster recovery - case, #139 addresses this). It might be possible that the database is running fine, but the container/VM that runs DBOS is crashing or moved due to HA/DR or other reasons. Both might be happening concurrently in an HA/DR case. If all systems are restarted, then ideally I do not want to introduce an order of restart between the VMs/containers in order to reduce complexity and avoid deadlock. Ideally the infrastructure restart of VMs/containers is not dictated by the applications those are running.

Thank you.

import sys
import time
import traceback
from datetime import datetime

from dbos import DBOS

DBOS()


@DBOS.scheduled("*/5 * * * * *")
@DBOS.workflow()
def print_log(scheduled_time: datetime,
              actual_time: datetime) -> None:
    DBOS.logger.info("Workflow ran")


def main(arguments: list[str]) -> None:
    DBOS().launch()

    while True:
        time.sleep(5)
        DBOS.logger.info('alive')


if __name__ == "__main__":
    try:
        main(sys.argv)
    except KeyboardInterrupt:
        DBOS.logger.warning('KeyboardInterrupt received')
    except Exception as e:
        DBOS.logger.error(traceback.format_exc())
    finally:
        DBOS.destroy()

Log:

"C:\Users\miso\VirtualEnvironments\drag\Scripts\python.exe" "C:\Users\miso\Workspaces\drag\z_research\database_unavailable_at_startup\db_unavailable.py" 
06:52:03 [    INFO] (dbos:_dbos.py:266) Initializing DBOS
06:52:07 [   ERROR] (dbos:_dbos.py:402) DBOS failed to launch: Traceback (most recent call last):
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 146, in __init__
    self._dbapi_connection = engine.raw_connection()
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 3302, in raw_connection
    return self.pool.connect()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 449, in connect
    return _ConnectionFairy._checkout(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 1263, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 712, in checkout
    rec = pool._do_get()
          ^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\impl.py", line 179, in _do_get
    with util.safe_reraise():
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\util\langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\impl.py", line 177, in _do_get
    return self._create_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 390, in _create_connection
    return _ConnectionRecord(self)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 674, in __init__
    self.__connect()
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 900, in __connect
    with util.safe_reraise():
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\util\langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 896, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\create.py", line 643, in connect
    return dialect.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\default.py", line 621, in connect
    return self.loaded_dbapi.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\psycopg\connection.py", line 119, in connect
    raise last_ex.with_traceback(None)
psycopg.OperationalError: connection failed: connection to server at "127.0.0.1", port 5432 failed: could not receive data from server: Socket is not connected (0x00002749/10057)
could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\dbos\_dbos.py", line 348, in _launch
    self._sys_db_field = SystemDatabase(self.config)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\dbos\_sys_db.py", line 177, in __init__
    with engine.connect() as conn:
         ^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 3278, in connect
    return self._connection_cls(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 148, in __init__
    Connection._handle_dbapi_exception_noconnection(
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 2442, in _handle_dbapi_exception_noconnection
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 146, in __init__
    self._dbapi_connection = engine.raw_connection()
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 3302, in raw_connection
    return self.pool.connect()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 449, in connect
    return _ConnectionFairy._checkout(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 1263, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 712, in checkout
    rec = pool._do_get()
          ^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\impl.py", line 179, in _do_get
    with util.safe_reraise():
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\util\langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\impl.py", line 177, in _do_get
    return self._create_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 390, in _create_connection
    return _ConnectionRecord(self)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 674, in __init__
    self.__connect()
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 900, in __connect
    with util.safe_reraise():
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\util\langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 896, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\create.py", line 643, in connect
    return dialect.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\default.py", line 621, in connect
    return self.loaded_dbapi.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\psycopg\connection.py", line 119, in connect
    raise last_ex.with_traceback(None)
sqlalchemy.exc.OperationalError: (psycopg.OperationalError) connection failed: connection to server at "127.0.0.1", port 5432 failed: could not receive data from server: Socket is not connected (0x00002749/10057)
could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
(Background on this error at: https://sqlalche.me/e/20/e3q8)

06:52:07 [   ERROR] (dbos:db_unavailable.py:32) Traceback (most recent call last):
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 146, in __init__
    self._dbapi_connection = engine.raw_connection()
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 3302, in raw_connection
    return self.pool.connect()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 449, in connect
    return _ConnectionFairy._checkout(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 1263, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 712, in checkout
    rec = pool._do_get()
          ^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\impl.py", line 179, in _do_get
    with util.safe_reraise():
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\util\langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\impl.py", line 177, in _do_get
    return self._create_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 390, in _create_connection
    return _ConnectionRecord(self)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 674, in __init__
    self.__connect()
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 900, in __connect
    with util.safe_reraise():
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\util\langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 896, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\create.py", line 643, in connect
    return dialect.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\default.py", line 621, in connect
    return self.loaded_dbapi.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\psycopg\connection.py", line 119, in connect
    raise last_ex.with_traceback(None)
psycopg.OperationalError: connection failed: connection to server at "127.0.0.1", port 5432 failed: could not receive data from server: Socket is not connected (0x00002749/10057)
could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\miso\Workspaces\drag\z_research\database_unavailable_at_startup\db_unavailable.py", line 28, in <module>
    main(sys.argv)
  File "C:\Users\miso\Workspaces\drag\z_research\database_unavailable_at_startup\db_unavailable.py", line 19, in main
    DBOS().launch()
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\dbos\_dbos.py", line 339, in launch
    _dbos_global_instance._launch()
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\dbos\_dbos.py", line 348, in _launch
    self._sys_db_field = SystemDatabase(self.config)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\dbos\_sys_db.py", line 177, in __init__
    with engine.connect() as conn:
         ^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 3278, in connect
    return self._connection_cls(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 148, in __init__
    Connection._handle_dbapi_exception_noconnection(
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 2442, in _handle_dbapi_exception_noconnection
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 146, in __init__
    self._dbapi_connection = engine.raw_connection()
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\base.py", line 3302, in raw_connection
    return self.pool.connect()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 449, in connect
    return _ConnectionFairy._checkout(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 1263, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 712, in checkout
    rec = pool._do_get()
          ^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\impl.py", line 179, in _do_get
    with util.safe_reraise():
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\util\langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\impl.py", line 177, in _do_get
    return self._create_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 390, in _create_connection
    return _ConnectionRecord(self)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 674, in __init__
    self.__connect()
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 900, in __connect
    with util.safe_reraise():
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\util\langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\pool\base.py", line 896, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\create.py", line 643, in connect
    return dialect.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\sqlalchemy\engine\default.py", line 621, in connect
    return self.loaded_dbapi.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\miso\VirtualEnvironments\drag\Lib\site-packages\psycopg\connection.py", line 119, in connect
    raise last_ex.with_traceback(None)
sqlalchemy.exc.OperationalError: (psycopg.OperationalError) connection failed: connection to server at "127.0.0.1", port 5432 failed: could not receive data from server: Socket is not connected (0x00002749/10057)
could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
(Background on this error at: https://sqlalche.me/e/20/e3q8)


Process finished with exit code 0
@apoliakov
Copy link
Member

apoliakov commented Nov 10, 2024

Thanks again - an interesting observation!

If something like this happens in DBOS Cloud, we simply continue restarting your DBOS process until it returns 200 on health check. This is what we recommend for self-hosting right now. Check this endpoint regularly and restart if not. Depending on your infrastructure, this gives you some control over how many times to retry and when to alert.

If the database is not available, we cannot really take any other actions other than this kind of check. And we think there are good reasons why this check-restart loop should be outside of DBOS executable rather than part of it.

Does that make sense? Do you think this is enough to provide what you're looking for or would you prefer a different behavior?

@chbussler
Copy link
Contributor Author

@apoliakov thanks for your response!

After #139 was implemented by @chuck-dbos I shut down my database during workflow processing and I observed two basic behaviors:

  • Transactions were aborted. This is correct as chances are that the database shutdown took place during transaction processing.
  • DBOS was polling for a database connection. When I restarted the database, as soon as it was ready to accept connections, DBOS resumed processing.

This is great: DBOS, when executing, recognizes and handles the unavailability of the database. I don't need any "external" mechanism to deal with this type of dependency interrupt and recovery.

In my view it would be ideal if the same polling for a database connection were to take place during DBOS launch as well as DBOS destroy (I don't know how the polling for connections is implemented, so I don't know if the same code would work in all cases of database unavailability). If this were the case, in any stage of the DBOS life cycle (launch, processing, destroy) DBOS would deal with the unavailability of the database.

I would also think this would make a great point of assurance for development and production aside from it being quite an elegant implementation based on its uniformity.

Now, there might be other reasons than database unavailability for DBOS failing to launch. I have had no time to study the code (I am itching for it). In those cases throwing an exception and stop the launch might make very much sense and would have to be handled "outside", as you describe in your earlier response.

However, I believe the database to be an important case for DBOS to handle throughout its entire life cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants