Apache_Airflow_Guide
DAGs
Direct Acyclic Graph is a graph representation of the data pipeline, nodes typically represent tasks and directed edges represent dependencies between tasks (with edges pointing from tasks indicating that a task needs to be ran before the other)
Directed: Dependencies are directed between tasks, going 1 way Acyclic: Does not contain loops or cycles in an effort to mitigate circular dependencies
The general flow of DAGs
- For each uncompleted tasks, the upstream dependency is checked on the other hand of the directed edge to check if it's completed, if so the task is added in consideration to a queue to be executed
- tasks within the execution queue are ran, marked as completed as soon as they are finished
- going back to step 1 until all tasks are marked as completed
Why use Airflow?
Using Apache Airflow can be advantageous for several reasons:
-
Workflow Automation: Airflow allows you to automate complex workflows, making it easier to schedule and manage tasks in a systematic manner.
-
Flexibility: It provides flexibility in defining workflows using Python code, allowing you to create custom logic and handle various data processing scenarios.
-
Dependency Management: Airflow lets you define dependencies between tasks, ensuring that tasks are executed in the correct order.
-
Monitoring and Logging: You can monitor and log the execution of tasks, making it easier to troubleshoot issues and track progress.
-
Scalability: Airflow is scalable and can handle both small and large workflows, making it suitable for a wide range of data processing needs.
-
Extensibility: You can extend Airflow's functionality by integrating it with external systems and tools.
What is Airflow?
Apache Airflow is an open-source platform used for orchestrating, scheduling, and automating complex data workflows. It allows users to define, schedule, and monitor workflows as directed acyclic graphs (DAGs), making it easier to manage and automate data processing, ETL (Extract, Transform, Load) tasks, and other workflow-related activities. Airflow is highly customizable, extensible, and widely used for data pipeline automation and management.
Airflow's Core Components
-
Scheduler:
- The Scheduler is responsible for scheduling when and how often a DAG (Directed Acyclic Graph) should run. It continuously checks the status of DAGs and triggers the execution of tasks based on predefined schedules.
-
Work Queue:
- Airflow uses a message queuing system (such as Celery or RabbitMQ) as a central communication hub between the Scheduler and Workers. It enqueues tasks for Workers to execute and monitors their progress.
-
Metadata Database:
- Airflow uses a database (commonly PostgreSQL) to store metadata related to DAGs, tasks, schedules, and job status. This database is crucial for tracking the state and history of workflow executions.
-
Executor:
- The Executor is a critical component responsible for executing tasks on worker nodes.
- Airflow supports various Executors, including the LocalExecutor (for single-machine deployments), CeleryExecutor (for distributed execution using Celery), and more.
- The Executor orchestrates the parallel execution of tasks across worker nodes, ensuring efficient task execution based on dependencies and scheduling.
-
Worker:
- Workers are responsible for executing the tasks defined within DAGs. They pull tasks from the work queue, execute them, and report the results back to the Scheduler.
-
DAGs (Directed Acyclic Graphs):
- DAGs are at the heart of Airflow. They define the structure and dependencies of your workflow. Each DAG represents a workflow composed of tasks, with defined execution order and dependencies.
-
Web Interface (UI):
- The web-based UI provides a user-friendly way to interact with Airflow. Users can monitor the status of DAGs, view logs, trigger manual runs (Triggering), and visualize the structure of their workflows using the DAG visualization tool.
-
CLI (Command Line Interface):
- Airflow offers a command-line interface for interacting with and managing Airflow components, such as triggering DAGs, testing tasks, and performing administrative tasks.
-
Triggering:
- Triggering allows users to manually start the execution of a DAG run, even if it's not scheduled to run at that moment. It is often used for ad-hoc or on-demand execution of workflows.
- To trigger a DAG, users can use the Airflow web interface (UI) or command-line interface (CLI) to initiate a run.
- Once triggered, Airflow will create a new DAG run instance and execute the tasks defined within the DAG based on their dependencies and order.
Core Concepts
Certainly, let's explore the core concepts of Apache Airflow: DAGs, Operators (Action Operators, Transfer Operators, and Sensor Operators), and Task/Task Instance. We'll discuss the relationships between these concepts and provide an ordered example of how they interact.
-
DAG (Directed Acyclic Graph):
- A DAG in Apache Airflow represents a workflow or a series of tasks with dependencies.
- It is a collection of tasks and the order in which they should be executed.
- DAGs are directed (tasks have a defined order) and acyclic (no circular dependencies).
-
Operator:
- Operators define what gets done by a task in a DAG.
- There are different types of operators, including Action Operators, Transfer Operators, and Sensor Operators, each serving a specific purpose.
-
Action Operators:
- Action Operators are used to perform actions or computations.
- Examples include the
PythonOperatorfor running Python functions,BashOperatorfor executing shell commands, andDummyOperatorfor creating placeholder tasks.
-
Transfer Operators:
- Transfer Operators are used to transfer data between systems or tasks.
- Examples include the
SqlSensorfor waiting on the availability of data in a database and theS3ToRedshiftOperatorfor transferring data from Amazon S3 to Redshift.
-
Sensor Operators:
- Sensor Operators are used to wait for a certain condition to be met before proceeding with the execution of downstream tasks.
- Examples include the
ExternalTaskSensorfor waiting on the status of external tasks and theHttpSensorfor checking the availability of a web service.
-
Task/Task Instance:
- A task represents a unit of work to be done in a DAG.
- Each task is an instance of an operator and is associated with a specific execution date.
- Task Instances are created when a DAG run is triggered and represent the individual execution of a task at a particular point in time.
Single-Node and Multi-Node Architectures
Single-Node Architecture:
-
Components on a Single Machine:
- In a single-node architecture, all Airflow components (Scheduler, Workers, Database, and Web Interface) are hosted on a single machine.
-
Simplicity and Ease of Setup:
- Single-node setups are straightforward to install and suitable for smaller workflows, development, and testing purposes.
- They require minimal configuration and are easier to manage.
-
Limited Scalability:
- Single-node setups have limited scalability because they are constrained by the resources (CPU, memory, and storage) of a single machine.
- They may not be suitable for running large or resource-intensive workflows.
-
Low Fault Tolerance:
- Single-node architectures lack built-in high availability and fault tolerance.
- If the single machine fails, the entire Airflow instance becomes unavailable.
Multi-Node Architecture:
-
Distributed Setup:
- In a multi-node architecture, Airflow components are distributed across multiple machines or nodes.
- Common components are the Scheduler, Workers, Database (often hosted on a separate server), and Web Interface (which may still be on a separate machine).
-
Scalability and Resource Isolation:
- Multi-node architectures provide better scalability by allowing you to add more worker nodes to handle increased task loads.
- Resources can be allocated more flexibly, enabling resource isolation for different workflows.
-
High Availability:
- Multi-node setups can be configured for high availability, ensuring that the Airflow system remains accessible even if one node fails.
- Redundant components and load balancing can be used to achieve this.
-
Complexity and Maintenance:
- Multi-node setups are more complex to configure and maintain compared to single-node setups.
- They often involve setting up a distributed message queue (e.g., Celery) and configuring external databases for better performance and reliability.
-
Suitable for Production:
- Multi-node architectures are well-suited for production environments where reliability, scalability, and fault tolerance are crucial.
- They can handle large, mission-critical workflows effectively.
How does it work?
For the purpose of the course, a single-node architecture will be used, containing the Web Server, the DAGs folder, the Scheduler, the Metastore, and the Executor.
This is how Apache Airflow works: When a DAG is placed in the DAGs folder, the Scheduler scans for new DAGs every 5 minutes. When it finds a new Python file, it checks for any changes or errors. The Scheduler performs these checks on existing DAGs every 30 seconds.
The Scheduler runs the DAG, creating the DagRun Object. With the DagRun Object in a "running" status, it takes the first task to be executed, turning it into a TaskInstance Object. This object has two statuses: "none" or "scheduled."
The Scheduler then sends the TaskInstance Object to the Executor, where it enters the Executor Queue and gets a "Queued" status. The Executor creates a subprocess to run the task, which can result in success or failure.
Finally, the Web Server updates its UI with the Metastore data about the execution statuses. ˘
DAG Views
DAG View
The DAG view displays all DAGs that have been recognized and loaded successfully in Airflow. Multiple example DAGs are often loaded automatically. To activate a DAG, simply select the button on the left corner next to the DAG name.
The Owner field provides information about the owner of the DAG. Information under RUN shows statuses, SCHEDULE points to schedules (e.g., daily, monthly, etc.), and there is support for using CRON expressions. LAST RUN shows when the DAG last ran; NEXT RUN shows when the next execution is scheduled. The RECENT TASK acts as a small history of the most recent executions of the DAG in question. Finally, we have the RUN button that allows us to manually trigger the DAG, the trash bin representing the DELETE option (it does not delete the DAG file, but rather its metadata in the interface), and the links to other useful views.
Grid View
A useful visualization for quickly seeing the status of task histories. There is a diagram with some bars and a right-hand side view with more detailed information about the execution.
Graph View
A pleasing view that visually displays the structure of the DAG. A view with some rectangles, which represent tasks, also shows the execution order, and through the outlines of the geometric shapes, we can use colors to understand the execution statuses.
Landing Times View
After some time of execution, as a history is generated, this graph becomes useful for monitoring the execution time, which helps to consider future optimizations.
Calendar View
A calendar view with colors marking the proportion of successes and failures. This helps to have a clear view of how the DAG has been behaving over time, and through some points on the calendar, observe the days marked for future executions.
Gantt View
A Gantt chart that helps observe bottlenecks. The more filled the rectangles are, the more attention we should pay, as this shows longer execution times, meaning some bottleneck may be occurring. It is also possible to see, by the interleaving of the rectangles, parallel executions.
Code View
Allows viewing the code of the DAG and making necessary modifications if desired
