Skip to content

[Bug]: Investigate why DAG logs desappear #387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
LucaCinquini opened this issue Apr 9, 2025 · 3 comments
Open

[Bug]: Investigate why DAG logs desappear #387

LucaCinquini opened this issue Apr 9, 2025 · 3 comments
Labels
bug Something isn't working U-SPS

Comments

@LucaCinquini
Copy link
Collaborator

Often we see Airflow tasks failing with this error:

airflow-worker-1.airflow-worker.sps.svc.cluster.local
*** No logs found on s3 for ti=<TaskInstance: cwl_dag.cwl_task manual__2025-04-09T13:49:56+00:00 [running]>
*** Could not read served logs: HTTPConnectionPool(host='airflow-worker-1.airflow-worker.sps.svc.cluster.local', port=8793): Max retries exceeded with url: /log/dag_id=cwl_dag/run_id=manual__2025-04-09T13:49:56+00:00/task_id=cwl_task/attempt=2.log (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f4f747b2cd0>: Failed to resolve 'airflow-worker-1.airflow-worker.sps.svc.cluster.local' ([Errno -2] Name or service not known)"))

Must investigate the cause and fix the problem.

@LucaCinquini LucaCinquini added the bug Something isn't working label Apr 9, 2025
@LucaCinquini
Copy link
Collaborator Author

Note though that the problem seems to be temporary - the Task is eventually able to find its log stream back and it completes successfully.

@LucaCinquini
Copy link
Collaborator Author

Other times this error occurs:

File "/home/airflow/.local/lib/python3.11/site-packages/kubernetes/client/rest.py", line 238, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': '3ce8c2fd-cb94-458a-85cc-6298c0471521', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Wed, 09 Apr 2025 15:23:32 GMT', 'Content-Length': '266'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Get \"https://10.6.38.224:10250/containerLogs/sps/cwl-task-pod-m6664pve/base?follow=true\\u0026sinceSeconds=41\\u0026timestamps=true\\": dial tcp 10.6.38.224:10250: i/o timeout","code":500}\n'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 767, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)

@LucaCinquini
Copy link
Collaborator Author

See reference from Gerald: apache/airflow#43912

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working U-SPS
Projects
Status: Todo
Development

No branches or pull requests

1 participant