Subject revision

ETL / Data Engineering

Pipelines, orchestration, and data platform topics.

Clear

Results update as you type. Press / to jump straight into search.

ETL / Data Engineering cards

16 cards

ETL / Data Engineering Easy Theory

Airflow DAG basics

A DAG defines tasks and dependencies so a workflow can run in the right order with retries and scheduling.

  • Directed acyclic graph
  • Tasks model units of work
  • Scheduler triggers runs

Airflow DAG basics

ETL / Data Engineering Easy Theory

Batch vs streaming

Batch processes chunks on a schedule, while streaming processes events continuously with low latency.

  • Batch is simpler
  • Streaming lowers freshness delay
  • Trade latency for complexity

Batch vs streaming

ETL / Data Engineering Easy Theory

ETL vs ELT

ETL transforms before loading, while ELT loads raw data first and transforms inside the warehouse later.

  • ETL fits strict downstream schemas
  • ELT fits scalable warehouses
  • Tooling and cost model differ

ETL vs ELT

ETL / Data Engineering Medium Theory

Kafka basics

Kafka is a distributed log used for durable event streaming and decoupled producers and consumers.

  • Topics store ordered partitions
  • Consumers track offsets
  • Great for event-driven pipelines

Kafka basics

ETL / Data Engineering Easy Theory

Parquet vs CSV vs JSON

CSV is simple but weakly typed, JSON is flexible but verbose, and Parquet is compressed columnar storage optimized for analytics.

  • CSV is easy to inspect
  • JSON handles nested structure
  • Parquet is best for warehouse scans

Parquet vs CSV vs JSON

ETL / Data Engineering Medium Theory

What are data quality checks?

Data quality checks validate assumptions like null rates, uniqueness, schema shape, and accepted value ranges.

  • Catch bad data early
  • Automate checks in pipelines
  • Alert instead of silently passing errors

What are data quality checks?