Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phantom/Orphaned runs causing false alerts on fail run sensor #26722

Open
RamiApos opened this issue Dec 25, 2024 · 0 comments
Open

Phantom/Orphaned runs causing false alerts on fail run sensor #26722

RamiApos opened this issue Dec 25, 2024 · 0 comments
Labels
area: sensor Related to Sensors type: bug Something isn't working

Comments

@RamiApos
Copy link

What's the issue?

Hi,
I've encountered a very strange issue. I pushed a new code, but for some odd reason, suddenly, there was an error with the docker-compose down operation (it had never happened before), and the volumes caused an error. I decided to delete the volumes and redo the CI/cdCD

It worked, but I did not realize that my S3 bucket sensor caused 600+ queued jobs (I guess for each file), went into a bit of panicked mode, terminated and deleted all the runs, and I assume some runs were not canceled properly because now I get false alerts.

I have an email on the failure sensor that now, for most jobs, sends an email alert even though the job was successful (it still works normally for failed jobs) - the false alerts come with the error of "This run has been marked as failed from outside the execution context."

The run IDs of the false alerts do not exist inside my Postgres storage - I verified that several times and even purged the volume again. purging the sensor and scheduling ticks also did not work.
I tried to change the name of the sensor, but it still doesn't solve the issue.

I'm all out of tricks... please help.

What did you expect to happen?

No response

How to reproduce?

No response

Dagster version

1.8.0

Deployment type

Docker Compose

Deployment details

version: "3.8"

services:
dagster_postgresql:
image: image_url:latest
container_name: dagster_postgres_storage
restart: unless-stopped
ports:
- 5432:5432
volumes:
- dagster_postgresql_data:/var/lib/postgresql/data
networks:
- docker_dagster_network

dagster_webserver:
image: dagster-webserver-service
container_name: dagster_webserver
env_file:
- .env
restart: unless-stopped
ports:
- "3000:3000"
environment:
DAGSTER_CURRENT_IMAGE: "dagster-webserver-service"
PRODUCTION: "Y"
SYSTEM_DEFAULT_STATUS: "RUNNING"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /tmp/io_manager_storage:/tmp/io_manager_storage
networks:
- docker_dagster_network
depends_on:
- dagster_postgresql

dagster_daemon:
image: dagster-webserver-service
container_name: dagster_daemon
env_file:
- .env
command: "dagster-daemon run"
restart: on-failure
environment:
DAGSTER_CURRENT_IMAGE: "dagster-webserver-service"
PRODUCTION: "Y"
SYSTEM_DEFAULT_STATUS: "RUNNING"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /tmp/io_manager_storage:/tmp/io_manager_storage
networks:
- docker_dagster_network

networks:
docker_dagster_network:
driver: bridge
name: docker_dagster_network

volumes:
dagster_postgresql_data:
driver: local

-- dgaster yaml:
scheduler:
module: dagster.core.scheduler
class: DagsterDaemonScheduler

run_coordinator:
module: dagster.core.run_coordinator
class: QueuedRunCoordinator
config:
max_concurrent_runs: 5

run_launcher:
module: dagster_docker
class: DockerRunLauncher
config:
env_vars:
- env_1
- env_2
network: docker_dagster_network
container_kwargs:
volumes: # Make docker client accessible to any launched containers as well
- /var/run/docker.sock:/var/run/docker.sock
- /tmp/io_manager_storage:/tmp/io_manager_storage
auto_remove: true

run_storage:
module: dagster_postgres.run_storage
class: PostgresRunStorage
config:
postgres_db:
hostname: dagster_postgresql
username:
env: POSTGRES_USER
password:
env: POSTGRES_PASSWORD
db_name:
env: POSTGRES_DB
port: 5432

schedule_storage:
module: dagster_postgres.schedule_storage
class: PostgresScheduleStorage
config:
postgres_db:
hostname: dagster_postgresql
username:
env: POSTGRES_USER
password:
env: POSTGRES_PASSWORD
db_name:
env: POSTGRES_DB
port: 5432

event_log_storage:
module: dagster_postgres.event_log
class: PostgresEventLogStorage
config:
postgres_db:
hostname: dagster_postgresql
username:
env: POSTGRES_USER
password:
env: POSTGRES_PASSWORD
db_name:
env: POSTGRES_DB
port: 5432

telemetry:
enabled: false
nux:
enabled: false

sensors:
use_threads: true
num_workers: 8

retention:
schedule:
purge_after_days: 2
sensor:
purge_after_days:
skipped: 1
failure: 1
success: 1

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
By submitting this issue, you agree to follow Dagster's Code of Conduct.

@RamiApos RamiApos added the type: bug Something isn't working label Dec 25, 2024
@garethbrickman garethbrickman added the area: sensor Related to Sensors label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: sensor Related to Sensors type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants