Airflow DAG basics
A DAG defines tasks and dependencies so a workflow can run in the right order with retries and scheduling.
- Directed acyclic graph
- Tasks model units of work
- Scheduler triggers runs
Airflow DAG basics
ETL / Data Engineering cards
A DAG defines tasks and dependencies so a workflow can run in the right order with retries and scheduling.
Airflow DAG basics
Batch processes chunks on a schedule, while streaming processes events continuously with low latency.
Batch vs streaming
ETL transforms before loading, while ELT loads raw data first and transforms inside the warehouse later.
ETL vs ELT
Kafka is a distributed log used for durable event streaming and decoupled producers and consumers.
Kafka basics
Orchestration decides when and in what order work runs; transformation changes the data itself.
Orchestration vs transformation
CSV is simple but weakly typed, JSON is flexible but verbose, and Parquet is compressed columnar storage optimized for analytics.
Parquet vs CSV vs JSON
Retries help recover transient failures, while dependencies prevent downstream work from running on incomplete upstream data.
Retries and dependencies in pipelines
Data quality checks validate assumptions like null rates, uniqueness, schema shape, and accepted value ranges.
What are data quality checks?
Incremental loads process only new or changed data instead of reprocessing the full dataset every run.
What are incremental loads?
Partitioning splits data into logical chunks so reads and writes can target less data and scale better.
What does partitioning mean in data engineering?