Dagster+ Agent request timeouts #21597

mlarose · 2024-05-02T17:59:50Z

mlarose
May 2, 2024
Collaborator

Symptoms

Requests from the Kubernetes Dagster+ agent are often timing out. This problem is indicated by the following error messages appearing frequently in the agent logs:

dagster_cloud_cli.core.errors.GraphQLStorageError: HTTPSConnectionPool(host='{organisation}.agent.dagster.cloud', port=443): Read timed out. (read timeout=60)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='{organisation}.agent.dagster.cloud', port=443): Read timed out. (read timeout=60)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

Possible causes

It is largely encountered in Kubernetes hybrid deployments where the network egress must traverse a NAT gateway or other proxies that may impose a timeout on the connection or where connections are often dropped between the agent and the Dagster+ API endpoints.

This can also happen due to known issues such as port or IP exhaustion on platforms such as GCP CloudNAT.

Answered by mlarose

May 2, 2024

Solutions

TCP KEEPALIVE

When using the dagster-cloud-agent helm charts with a version ≥1.7.3, you can use the following options to enable probing of TCP connections to preemptively detect broken connections and reduce or eliminate the problem.

Note that setting these options also requires the dagster version for all of your code locations to also be 1.7.3 or later.

dagsterCloud:
  socketOptions:
    - ["SOL_SOCKET", "SO_KEEPALIVE", 1]
    - ["IPPROTO_TCP", "TCP_KEEPIDLE", 11]
    - ["IPPROTO_TCP", "TCP_KEEPINTVL", 7]
    - ["IPPROTO_TCP", "TCP_KEEPCNT", 5]

These values are relatively aggressive and will cause additional network packets to be set to probe and maintain the connection. You m…

View full answer

mlarose · 2024-05-02T18:02:55Z

mlarose
May 2, 2024
Collaborator Author

Solutions

TCP KEEPALIVE

When using the dagster-cloud-agent helm charts with a version ≥1.7.3, you can use the following options to enable probing of TCP connections to preemptively detect broken connections and reduce or eliminate the problem.

Note that setting these options also requires the dagster version for all of your code locations to also be 1.7.3 or later.

dagsterCloud:
  socketOptions:
    - ["SOL_SOCKET", "SO_KEEPALIVE", 1]
    - ["IPPROTO_TCP", "TCP_KEEPIDLE", 11]
    - ["IPPROTO_TCP", "TCP_KEEPINTVL", 7]
    - ["IPPROTO_TCP", "TCP_KEEPCNT", 5]

These values are relatively aggressive and will cause additional network packets to be set to probe and maintain the connection. You may want to experiment with these values and see what works for your needs.

How this works:

TCP keepalive overview

Upgrade the Dagster+ agent and user code to 1.5.9+

Since 1.5.9, most of the calls made by the agent are now being retried, which makes the agent more resilient to networking errors.

Script to reproduce and test networking changes

Running this script in their agent pod to attempt to reproduce (and test networking changes).
Note: High concurrency can trigger issues such as Cloud NAT port exhaustion. Adjust and interpret results with care.

from concurrent.futures import ThreadPoolExecutor, as_completed

import requests
from dagster import DagsterInstance
try:
    from dagster_cloud_cli.core.graphql_client import (
        create_agent_graphql_client,
    )
except ImportError:
    from dagster_cloud_cli.core.graphql_client import (
        create_proxy_client as create_agent_graphql_client,
    )

# Script that attempts to reproduce network issues connecting to Dagster Cloud
# servers under load

NUM_TRIALS = 1000
CONCURRENCY = 5

di = DagsterInstance.get()

session = requests.Session()

# Disable retries for the purpose of the test
modified_client = create_agent_graphql_client(
    session,
    di.dagster_cloud_graphql_url,
    {**di._dagster_cloud_api_config, "retries": 0},
)


def _fetch():
    """Execute GraphQL query in a threadpool"""
    return modified_client.execute("query TestScriptQuery {__typename}")


with ThreadPoolExecutor(max_workers=CONCURRENCY) as executor:
    count = 0
    finished = 0
    for f in as_completed(executor.submit(_fetch) for i in range(NUM_TRIALS)):
        finished = finished + 1
        print("Finished trial " + str(finished))
        if f.result():
            count += 1

print(f"{count} successful results")

0 replies

attekei · 2024-05-23T05:41:11Z

attekei
May 23, 2024

One way to bypass NAT issues can also be adding a proxy, e.g. Azure Front Door CDN, for routing traffic from (Kubernetes) Dagster agents to Dagster Cloud. In our case, it significantly reduced connection aborts. Please note that such proxy can be costly.

For reducing timeouts, the TCP KEEPALIVE options was the primary (and cheapest) solution.

0 replies

cbini · 2024-05-28T21:23:59Z

cbini
May 28, 2024

@mlarose is this confirmed to work on Google Cloud? I tried applying these options, and I'm getting a crashloop when GKE tries to launch the agent pod.

4 replies

mlarose May 29, 2024
Collaborator Author

Hi @cbini,

Yes, this was a solution tested by a GCP customer. Can you provide details on how this crash loop?

cbini May 31, 2024

hey @mlarose unfortunately I can't get more detailed logs from the agent, but on the pod view, it says Error with exit code 1 and then gives a generic CrashLoopBackOff error: back-off 2m40s restarting failed container=dagster-cloud-agent pod=user-cloud-dagster-cloud-agent-agent-7474b7dfc5-j5chc_dagster-cloud(ff678425-5b30-4bed-8e4b-1e9339b66294): CrashLoopBackOff

any ideas on how to troubleshoot?

cbini Aug 2, 2024

@mlarose finally figured it out, the TCP* variables are incorrect in your answer. They should each include an underscore:

TCP_KEEPIDLE
TCP_KEEPINTVL
TCP_KEEPCNT

mlarose Aug 2, 2024
Collaborator Author

Oops, my bad. I am sorry about this @cbini and I updated the answer accordingly. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dagster+ Agent request timeouts #21597

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Dagster+ Agent request timeouts #21597

mlarose May 2, 2024 Collaborator

Symptoms

Possible causes

Solutions

TCP KEEPALIVE

Replies: 3 comments · 4 replies

mlarose May 2, 2024 Collaborator Author

Solutions

TCP KEEPALIVE

Upgrade the Dagster+ agent and user code to 1.5.9+

Script to reproduce and test networking changes

attekei May 23, 2024

cbini May 28, 2024

mlarose May 29, 2024 Collaborator Author

cbini May 31, 2024

cbini Aug 2, 2024

mlarose Aug 2, 2024 Collaborator Author

mlarose
May 2, 2024
Collaborator

Replies: 3 comments 4 replies

mlarose
May 2, 2024
Collaborator Author

attekei
May 23, 2024

cbini
May 28, 2024

mlarose May 29, 2024
Collaborator Author

mlarose Aug 2, 2024
Collaborator Author