Why When Running Airflow, the Simplest DAG Seems to be Run Twice?
Image by Erinne - hkhazo.biz.id

Why When Running Airflow, the Simplest DAG Seems to be Run Twice?

Posted on

Have you ever encountered an issue where your simple Airflow DAG appears to be running twice? You’re not alone! This is a common phenomenon that can be attributed to how Airflow handles DAG runs.

Understanding Airflow’s DAG Run Mechanism

Airflow’s core functionality revolves around running DAGs (Directed Acyclic Graphs), which are essentially collections of tasks organized in a specific order. When you trigger a DAG run, Airflow creates two instances of the DAG: one for the actual run and another for the scheduler.

Actual Run vs. Scheduler Run

  • Actual Run: This is the actual execution of your DAG, where tasks are performed, and data is processed.
  • Scheduler Run: This instance is responsible for scheduling the next run of your DAG, ensuring that it adheres to the specified schedule.

By design, Airflow runs the scheduler instance immediately after the actual run, which might lead to the illusion that your DAG is running twice.

Why Does Airflow Run the DAG Twice?

Airflow’s architecture is built around the concept of idempotence, ensuring that DAG runs are repeatable and do not cause unintended side effects. To achieve this, Airflow uses the following approach:

  1. Airflow runs the actual DAG instance, performing the necessary tasks.
  2. Once the actual run is complete, Airflow runs the scheduler instance to determine when the next scheduled run should occur.
  3. The scheduler instance will re-parse the DAG, re-create the task instances, and schedule the next run (if applicable).

This process creates the illusion that your DAG is running twice, when in fact, it’s just Airflow’s internal mechanism to ensure the DAG is executed correctly and on schedule.

Conclusion

In conclusion, the apparent duplicate run of your simple Airflow DAG is a result of Airflow’s internal architecture and its emphasis on idempotence. By understanding the differences between the actual run and scheduler run, you can better comprehend Airflow’s behavior and optimize your DAGs for efficient execution.

Frequently Asked Question

Are you tired of seeing your simplest DAG run twice in Airflow and wondering why? You’re not alone! We’ve got the answers to your queries.

Why does my DAG run twice when I trigger it manually in Airflow?

When you trigger a DAG manually in Airflow, it actually runs twice due to the way Airflow handles manual triggers. The first run is a “Dry Run” to calculate the dependencies and check for any issues, and the second run is the actual execution of the DAG. So, don’t worry, it’s just Airflow doing its thing!

Is it possible to disable the Dry Run feature in Airflow?

Sorry, folks! The Dry Run feature is a fundamental part of Airflow’s design, and it can’t be disabled. However, you can adjust the logging level to reduce the noise in the log files. Just set the `logging_level` parameter to `INFO` or higher in your `airflow.cfg` file.

Will the DAG run twice if I use the `Trigger DAG` button in the Airflow UI?

Yep, you guessed it! When you use the `Trigger DAG` button in the Airflow UI, it will also trigger the DAG to run twice. The first run is the Dry Run, and the second run is the actual execution of the DAG.

Does the DAG run twice only when I trigger it manually, or are there other scenarios where this happens?

The DAG will run twice not only when you trigger it manually but also when you use the `airflow dags trigger` command in the CLI or when you use the Airflow API to trigger the DAG. So, be prepared for the double run!

How can I avoid confusion when debugging my DAG?

To avoid confusion when debugging your DAG, make sure to check the `run_id` and `dag_run_id` to distinguish between the Dry Run and the actual execution of the DAG. You can also use the `airflow dags list` command to check the DAG runs and their corresponding IDs.