Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues with data query. #43

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tyger
Copy link

@tyger tyger commented Jun 29, 2021

  • Fix issue with cross-join in DB in get_dag_duration_info function.
    ** Current version was using extra table for filtering that caused SQLAlchemy to generate query that creates cross-join in Postgres creating enormous amount of data. Fixing it by using values from query itself.
    This is the PostgreSQL syntax query that is generated by current version
SELECT anon_1.dag_id, anon_1.start_date, dag_run.end_date
FROM task_instance,
     (SELECT anon_2.dag_id AS dag_id, anon_2.max_execution_dt AS execution_date, min(task_instance.start_date) AS start_date
      FROM (SELECT dag_run.dag_id AS dag_id, max(dag_run.execution_date) AS max_execution_dt
            FROM dag_run
            JOIN dag ON dag.dag_id = dag_run.dag_id
            WHERE dag.is_active = true
              AND dag.is_paused = false
              AND dag_run.state = %(state_1)s AND dag_run.end_date IS NOT NULL
            GROUP BY dag_run.dag_id
           ) AS anon_2
            JOIN task_instance ON task_instance.dag_id = anon_2.dag_id AND task_instance.execution_date = anon_2.max_execution_dt
      GROUP BY anon_2.dag_id, anon_2.max_execution_dt
     ) AS anon_1
      JOIN dag_run ON dag_run.dag_id = anon_1.dag_id AND dag_run.execution_date = anon_1.execution_date
WHERE task_instance.start_date IS NOT NULL
  AND task_instance.end_date IS NOT NULL

New version of the query

SELECT anon_1.dag_id, anon_1.start_date, dag_run.end_date
FROM (SELECT anon_2.dag_id AS dag_id, anon_2.max_execution_dt AS execution_date, min(task_instance.start_date) AS start_date
      FROM (SELECT dag_run.dag_id AS dag_id, max(dag_run.execution_date) AS max_execution_dt
            FROM dag_run
            JOIN dag ON dag.dag_id = dag_run.dag_id
            WHERE dag.is_active = true
              AND dag.is_paused = false
              AND dag_run.state = %(state_1)s AND dag_run.end_date IS NOT NULL
            GROUP BY dag_run.dag_id
           ) AS anon_2
            JOIN task_instance ON task_instance.dag_id = anon_2.dag_id AND task_instance.execution_date = anon_2.max_execution_dt
      GROUP BY anon_2.dag_id, anon_2.max_execution_dt
     ) AS anon_1
      JOIN dag_run ON dag_run.dag_id = anon_1.dag_id AND dag_run.execution_date = anon_1.execution_date
WHERE anon_1.start_date IS NOT NULL
  AND dag_run.end_date IS NOT NULL;

As you can see in the first variant there is task_instance table is added to query FROM part causing DB to do cross join on data.

  • Fix issue with xcom value is not returned as an object.
    ** The expected type of xcom_value is to be dict which is not true in all cases. So taking care of that as well.

Fix issue with xcom value is not returned as an object
@tyger tyger force-pushed the fix-cross-join-issue branch from 08c650b to e7afc8f Compare June 29, 2021 21:40
@tyger
Copy link
Author

tyger commented Jun 30, 2021

Hi @abhishekray07, will you be able to review and pull this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant