Subject revision

ETL / Data Engineering

Pipelines, orchestration, and data platform topics.

Results update as you type. Press / to jump straight into search.

Tag Difficulty Keywords only

ETL / Data Engineering cards

16 cards

ETL / Data Engineering Easy Theory

Airflow DAG basics

A DAG defines tasks and dependencies so a workflow can run in the right order with retries and scheduling.

Directed acyclic graph
Tasks model units of work
Scheduler triggers runs

#airflow #dags #orchestration

Airflow DAG basics

ETL / Data Engineering Easy Theory

Batch vs streaming

Batch processes chunks on a schedule, while streaming processes events continuously with low latency.

Batch is simpler
Streaming lowers freshness delay
Trade latency for complexity

#etl #pipelines #streaming

Batch vs streaming

ETL / Data Engineering Easy Theory

ETL vs ELT

ETL transforms before loading, while ELT loads raw data first and transforms inside the warehouse later.

ETL fits strict downstream schemas
ELT fits scalable warehouses
Tooling and cost model differ

#data-engineering #etl #pipelines

ETL vs ELT

ETL / Data Engineering Medium Theory

Kafka basics

Kafka is a distributed log used for durable event streaming and decoupled producers and consumers.

Topics store ordered partitions
Consumers track offsets
Great for event-driven pipelines

#kafka #messaging #streaming

Kafka basics

ETL / Data Engineering Medium Theory

Orchestration vs transformation

Orchestration decides when and in what order work runs; transformation changes the data itself.

Airflow schedules tasks
dbt or Spark transform data
Keep responsibilities clear

#etl #orchestration #transformation

Orchestration vs transformation

ETL / Data Engineering Easy Theory

Parquet vs CSV vs JSON

CSV is simple but weakly typed, JSON is flexible but verbose, and Parquet is compressed columnar storage optimized for analytics.

CSV is easy to inspect
JSON handles nested structure
Parquet is best for warehouse scans

#data-formats #etl #storage

Parquet vs CSV vs JSON

ETL / Data Engineering Medium Theory

Retries and dependencies in pipelines

Retries help recover transient failures, while dependencies prevent downstream work from running on incomplete upstream data.

Retry only safe tasks
Use backoff for transient issues
Surface hard failures early

#etl #orchestration #retries

Retries and dependencies in pipelines

ETL / Data Engineering Medium Theory

What are data quality checks?

Data quality checks validate assumptions like null rates, uniqueness, schema shape, and accepted value ranges.

Catch bad data early
Automate checks in pipelines
Alert instead of silently passing errors

#data-quality #etl #validation

What are data quality checks?

ETL / Data Engineering Medium Theory

What are incremental loads?

Incremental loads process only new or changed data instead of reprocessing the full dataset every run.

Use watermarks or timestamps
Lower cost and faster runs
Need late-arrival strategy

#etl #incremental-loads #performance

What are incremental loads?

ETL / Data Engineering Medium Theory

What does partitioning mean in data engineering?

Partitioning splits data into logical chunks so reads and writes can target less data and scale better.

Common on date columns
Improves pruning and performance
Bad partition choices can hurt

#etl #partitioning #performance

What does partitioning mean in data engineering?