Airflow has a BaseOperator as Base Class for all operators created in Airflow. For creating a custom operator, you have to extend the Operator class and implement its abstract methods. The execute method needs to be designed for the newly created Operator class.
For maintaining any results and data, any tool will have some sort of database storage. Airflow also uses DB. It uses SQLite by default, and there is a way to use MySQL and Postgres DB for more performance improvisation.
Generally, Big data ETL jobs include data migration jobs such as getting data from mysql or any relational database, perform some transformations on it and then moving the data to Hadoop tables such as Hive.
Airflow provides sqoop operators, spark operators, and hive operators, so Airflow can be used to invoke any of the Big data tasks and Airflow can also sequence and monitor the jobs. In this way, Airflow is useful in Big data ETL jobs.
Airflow is a workflow management tool. A workflow consists of various tasks that can perform processing at various other tools such as MySQL, S3, Hive and Shell scripts. Airflow cannot provide a single platform to support all processing engines, instead it provides a way to integrate and connect these engines. Operator is a generic way to connect those processing engines and perform the tasks.
To apply tasks dependencies in a DAG, all tasks must belong to the same DAG. Now, relations can be given using the up_stream() and down_stream() methods. Airflow also provides bit wise operators such as >> and << to apply the relations. Bit wise operators are easy to use and help to easily understand the task relations.
DAG, or Directed Acyclic Graph, is a data structure that helps maintain the task dependency at any given time. Airflow uses DAG for maintaining tasks relations to ensure that tasks are executed in an expected order.