Phantom/Orphaned runs causing false alerts on fail run sensor #26722

RamiApos · 2024-12-25T08:49:29Z

What's the issue?

Hi,
I've encountered a very strange issue. I pushed a new code, but for some odd reason, suddenly, there was an error with the docker-compose down operation (it had never happened before), and the volumes caused an error. I decided to delete the volumes and redo the CI/cdCD

It worked, but I did not realize that my S3 bucket sensor caused 600+ queued jobs (I guess for each file), went into a bit of panicked mode, terminated and deleted all the runs, and I assume some runs were not canceled properly because now I get false alerts.

I have an email on the failure sensor that now, for most jobs, sends an email alert even though the job was successful (it still works normally for failed jobs) - the false alerts come with the error of "This run has been marked as failed from outside the execution context."

The run IDs of the false alerts do not exist inside my Postgres storage - I verified that several times and even purged the volume again. purging the sensor and scheduling ticks also did not work.
I tried to change the name of the sensor, but it still doesn't solve the issue.

I'm all out of tricks... please help.

What did you expect to happen?

No response

How to reproduce?

No response

Dagster version

1.8.0

Deployment type

Docker Compose

Deployment details

version: "3.8"

services:
dagster_postgresql:
image: image_url:latest
container_name: dagster_postgres_storage
restart: unless-stopped
ports:
- 5432:5432
volumes:
- dagster_postgresql_data:/var/lib/postgresql/data
networks:
- docker_dagster_network

dagster_webserver:
image: dagster-webserver-service
container_name: dagster_webserver
env_file:
- .env
restart: unless-stopped
ports:
- "3000:3000"
environment:
DAGSTER_CURRENT_IMAGE: "dagster-webserver-service"
PRODUCTION: "Y"
SYSTEM_DEFAULT_STATUS: "RUNNING"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /tmp/io_manager_storage:/tmp/io_manager_storage
networks:
- docker_dagster_network
depends_on:
- dagster_postgresql

dagster_daemon:
image: dagster-webserver-service
container_name: dagster_daemon
env_file:
- .env
command: "dagster-daemon run"
restart: on-failure
environment:
DAGSTER_CURRENT_IMAGE: "dagster-webserver-service"
PRODUCTION: "Y"
SYSTEM_DEFAULT_STATUS: "RUNNING"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /tmp/io_manager_storage:/tmp/io_manager_storage
networks:
- docker_dagster_network

networks:
docker_dagster_network:
driver: bridge
name: docker_dagster_network

volumes:
dagster_postgresql_data:
driver: local

-- dgaster yaml:
scheduler:
module: dagster.core.scheduler
class: DagsterDaemonScheduler

run_coordinator:
module: dagster.core.run_coordinator
class: QueuedRunCoordinator
config:
max_concurrent_runs: 5

run_launcher:
module: dagster_docker
class: DockerRunLauncher
config:
env_vars:
- env_1
- env_2
network: docker_dagster_network
container_kwargs:
volumes: # Make docker client accessible to any launched containers as well
- /var/run/docker.sock:/var/run/docker.sock
- /tmp/io_manager_storage:/tmp/io_manager_storage
auto_remove: true

run_storage:
module: dagster_postgres.run_storage
class: PostgresRunStorage
config:
postgres_db:
hostname: dagster_postgresql
username:
env: POSTGRES_USER
password:
env: POSTGRES_PASSWORD
db_name:
env: POSTGRES_DB
port: 5432

schedule_storage:
module: dagster_postgres.schedule_storage
class: PostgresScheduleStorage
config:
postgres_db:
hostname: dagster_postgresql
username:
env: POSTGRES_USER
password:
env: POSTGRES_PASSWORD
db_name:
env: POSTGRES_DB
port: 5432

event_log_storage:
module: dagster_postgres.event_log
class: PostgresEventLogStorage
config:
postgres_db:
hostname: dagster_postgresql
username:
env: POSTGRES_USER
password:
env: POSTGRES_PASSWORD
db_name:
env: POSTGRES_DB
port: 5432

telemetry:
enabled: false
nux:
enabled: false

sensors:
use_threads: true
num_workers: 8

retention:
schedule:
purge_after_days: 2
sensor:
purge_after_days:
skipped: 1
failure: 1
success: 1

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
By submitting this issue, you agree to follow Dagster's Code of Conduct.

RamiApos added the type: bug Something isn't working label Dec 25, 2024

garethbrickman added the area: sensor Related to Sensors label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phantom/Orphaned runs causing false alerts on fail run sensor #26722

Phantom/Orphaned runs causing false alerts on fail run sensor #26722

RamiApos commented Dec 25, 2024

Phantom/Orphaned runs causing false alerts on fail run sensor #26722

Phantom/Orphaned runs causing false alerts on fail run sensor #26722

Comments

RamiApos commented Dec 25, 2024

What's the issue?

What did you expect to happen?

How to reproduce?

Dagster version

Deployment type

Deployment details

Additional information

Message from the maintainers