Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Scalability issue with multiple simultaneous DIDExchange requests #3492

Open
ff137 opened this issue Feb 4, 2025 · 0 comments
Open

Comments

@ff137
Copy link
Contributor

ff137 commented Feb 4, 2025

When multiple tenants simultaneously request a DIDExchange connection with an issuer's public DID, several unhandled exceptions are raised, causing all requested connections to fail.

The handling logic and auto-complete flows associated with the DIDExchange request do not report any error to the clients that made the request, leaving their connection record in the request-sent state.

The issuer does not receive any request-received records as expected - not even one of the many requests.

Note: this is running the latest ACA-Py release, with askar 0.4.3

Steps to Reproduce

There are many steps required to reproduce this in acapy alone... so the simplest way to reproduce this would be to check out our acapy-cloud repo (previously aries-cloudapi-python), where a simple test script can do all the setup and replicate it for you: https://github.com/didx-xyz/acapy-cloud

As a summary - besides all the steps for onboarding an issuer, and registering their public DID - here's how to replicate this issue:

  1. Create multiple tenants (reliably fails for me with 10)
  2. For each one, initiate a DIDExchange connection request (POST /didexchange/create-request) using use_public_did to set the issuer's public DID for the request.
  3. Observe the unhandled exceptions raised in the system logs.
  4. Check the state of the connection records for the tenants and the issuer.

The above steps can be achieved:

  1. Follow the quick start guide in acapy-cloud (clone the repo, install prerequisites, spin up the stack)
  2. Cherry pick the following commit, to get my test script:
    git cherry-pick 8153a47eb62bea6de75ef132d0600bec8e76cab6
    This will get you a test file which you can check out at app/tests/e2e/test_many_connections.py
  3. Spin up the stack: mise run tilt:up, and wait for services to be up and running (visit localhost:10350)
  4. Run the test: pytest app/tests/e2e/test_many_connections.py
  5. Click on the Multitenant-Agent tab in the Tilt UI (localhost:10350) to view logs

The test should fail with "Connection 0 failed with exception" and then "expected webhook not received".

Under Multitenant-Agent logs, you'll see many exceptions being raised, one for each request.

The stack trace seems to reveal that it's to do with a timeout waiting to open an askar session:

2025-02-04 11:14:49,084 acapy_agent.core.dispatcher ERROR Handler error: Dispatcher.handle_v1_message
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/home/aries/.local/lib/python3.12/site-packages/aries_askar/store.py", line 773, in _open
    await bindings.session_start(self._store, self._profile, self._is_txn),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aries/.local/lib/python3.12/site-packages/aries_askar/bindings/__init__.py", line 266, in session_start
    handle = await invoke_async(
             ^^^^^^^^^^^^^^^^^^^
  File "/home/aries/.local/lib/python3.12/site-packages/aries_askar/bindings/lib.py", line 393, in invoke_async
    return await self.loaded.invoke_async(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/futures.py", line 289, in __await__
    yield self  # This tells Task to wait for completion.
    ^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/tasks.py", line 385, in __wakeup
    future.result()
  File "/usr/local/lib/python3.12/asyncio/futures.py", line 197, in result
    raise self._make_cancelled_error()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/home/aries/.local/lib/python3.12/site-packages/acapy_agent/core/dispatcher.py", line 257, in handle_v1_message
    await handler(context, responder)
  File "/home/aries/.local/lib/python3.12/site-packages/acapy_agent/protocols/didexchange/v1_0/handlers/request_handler.py", line 36, in handle
    conn_rec = await mgr.receive_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aries/.local/lib/python3.12/site-packages/acapy_agent/protocols/didexchange/v1_0/manager.py", line 558, in receive_request
    conn_rec = await self._receive_request_public_did(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aries/.local/lib/python3.12/site-packages/acapy_agent/protocols/didexchange/v1_0/manager.py", line 704, in _receive_request_public_did
    await self._extract_and_record_did_doc_info(request)
  File "/home/aries/.local/lib/python3.12/site-packages/acapy_agent/protocols/didexchange/v1_0/manager.py", line 725, in _extract_and_record_did_doc_info
    await self.store_did_document(conn_did_doc)
  File "/home/aries/.local/lib/python3.12/site-packages/acapy_agent/connections/base_manager.py", line 380, in store_did_document
    async with self._profile.session() as session:
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aries/.local/lib/python3.12/site-packages/acapy_agent/core/profile.py", line 197, in __aenter__
    await self._setup()
  File "/home/aries/.local/lib/python3.12/site-packages/acapy_agent/askar/profile.py", line 252, in _setup
    self._handle = await asyncio.wait_for(self._opener, 10)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/tasks.py", line 519, in wait_for
    async with timeouts.timeout(timeout):
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

PS: Log levels can be modified in helm/acapy-cloud/conf/local/multitenant-agent.yaml, e.g. set ACAPY_LOG_LEVEL to debug


Please let me know if the replication steps are successful or not, or whether you need help with the acapy-cloud mise setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant