Dagster+ Agent request timeouts #21597
-
SymptomsRequests from the Kubernetes Dagster+ agent are often timing out. This problem is indicated by the following error messages appearing frequently in the agent logs:
Possible causesIt is largely encountered in Kubernetes hybrid deployments where the network egress must traverse a NAT gateway or other proxies that may impose a timeout on the connection or where connections are often dropped between the agent and the Dagster+ API endpoints. This can also happen due to known issues such as port or IP exhaustion on platforms such as GCP CloudNAT. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
SolutionsTCP KEEPALIVEWhen using the dagster-cloud-agent helm charts with a version ≥1.7.3, you can use the following options to enable probing of TCP connections to preemptively detect broken connections and reduce or eliminate the problem. Note that setting these options also requires the dagster version for all of your code locations to also be 1.7.3 or later. dagsterCloud:
socketOptions:
- ["SOL_SOCKET", "SO_KEEPALIVE", 1]
- ["IPPROTO_TCP", "TCP_KEEPIDLE", 11]
- ["IPPROTO_TCP", "TCP_KEEPINTVL", 7]
- ["IPPROTO_TCP", "TCP_KEEPCNT", 5] These values are relatively aggressive and will cause additional network packets to be set to probe and maintain the connection. You may want to experiment with these values and see what works for your needs. How this works: Upgrade the Dagster+ agent and user code to 1.5.9+Since 1.5.9, most of the calls made by the agent are now being retried, which makes the agent more resilient to networking errors. Script to reproduce and test networking changesRunning this script in their agent pod to attempt to reproduce (and test networking changes). from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
from dagster import DagsterInstance
try:
from dagster_cloud_cli.core.graphql_client import (
create_agent_graphql_client,
)
except ImportError:
from dagster_cloud_cli.core.graphql_client import (
create_proxy_client as create_agent_graphql_client,
)
# Script that attempts to reproduce network issues connecting to Dagster Cloud
# servers under load
NUM_TRIALS = 1000
CONCURRENCY = 5
di = DagsterInstance.get()
session = requests.Session()
# Disable retries for the purpose of the test
modified_client = create_agent_graphql_client(
session,
di.dagster_cloud_graphql_url,
{**di._dagster_cloud_api_config, "retries": 0},
)
def _fetch():
"""Execute GraphQL query in a threadpool"""
return modified_client.execute("query TestScriptQuery {__typename}")
with ThreadPoolExecutor(max_workers=CONCURRENCY) as executor:
count = 0
finished = 0
for f in as_completed(executor.submit(_fetch) for i in range(NUM_TRIALS)):
finished = finished + 1
print("Finished trial " + str(finished))
if f.result():
count += 1
print(f"{count} successful results") |
Beta Was this translation helpful? Give feedback.
-
One way to bypass NAT issues can also be adding a proxy, e.g. Azure Front Door CDN, for routing traffic from (Kubernetes) Dagster agents to Dagster Cloud. In our case, it significantly reduced connection aborts. Please note that such proxy can be costly. For reducing timeouts, the TCP KEEPALIVE options was the primary (and cheapest) solution. |
Beta Was this translation helpful? Give feedback.
-
@mlarose is this confirmed to work on Google Cloud? I tried applying these options, and I'm getting a crashloop when GKE tries to launch the agent pod. |
Beta Was this translation helpful? Give feedback.
Solutions
TCP KEEPALIVE
When using the dagster-cloud-agent helm charts with a version ≥1.7.3, you can use the following options to enable probing of TCP connections to preemptively detect broken connections and reduce or eliminate the problem.
Note that setting these options also requires the dagster version for all of your code locations to also be 1.7.3 or later.
These values are relatively aggressive and will cause additional network packets to be set to probe and maintain the connection. You m…