Airflow DAG basics
A DAG defines tasks and dependencies so a workflow can run in the right order with retries and scheduling.
- Directed acyclic graph
- Tasks model units of work
- Scheduler triggers runs
Quick recall
A DAG defines tasks and dependencies so a workflow can run in the right order with retries and scheduling.
Batch processes chunks on a schedule, while streaming processes events continuously with low latency.
ETL transforms before loading, while ELT loads raw data first and transforms inside the warehouse later.
Kafka is a distributed log used for durable event streaming and decoupled producers and consumers.
Orchestration decides when and in what order work runs; transformation changes the data itself.
CSV is simple but weakly typed, JSON is flexible but verbose, and Parquet is compressed columnar storage optimized for analytics.
Retries help recover transient failures, while dependencies prevent downstream work from running on incomplete upstream data.
Data quality checks validate assumptions like null rates, uniqueness, schema shape, and accepted value ranges.
Incremental loads process only new or changed data instead of reprocessing the full dataset every run.
Partitioning splits data into logical chunks so reads and writes can target less data and scale better.
A data lake stores large volumes of raw or semi-structured data cheaply for later processing.
A data warehouse is a system optimized for analytical queries over integrated, historical business data.