Quick recall mode

Keywords first, details second.

Use this mode when you want memory triggers only: title, summary, tags, and bullet anchors without long answers getting in the way.

Results update as you type. Press / to jump straight into search.

Type Category Tag Difficulty Keywords only

Quick recall

374 cards

ETL / Data Engineering Easy Theory

Airflow DAG basics

A DAG defines tasks and dependencies so a workflow can run in the right order with retries and scheduling.

Directed acyclic graph
Tasks model units of work
Scheduler triggers runs

#airflow #dags #orchestration

ETL / Data Engineering Easy Theory

Batch vs streaming

Batch processes chunks on a schedule, while streaming processes events continuously with low latency.

Batch is simpler
Streaming lowers freshness delay
Trade latency for complexity

#etl #pipelines #streaming

ETL / Data Engineering Easy Theory

ETL vs ELT

ETL transforms before loading, while ELT loads raw data first and transforms inside the warehouse later.

ETL fits strict downstream schemas
ELT fits scalable warehouses
Tooling and cost model differ

#data-engineering #etl #pipelines

ETL / Data Engineering Medium Theory

Kafka basics

Kafka is a distributed log used for durable event streaming and decoupled producers and consumers.

Topics store ordered partitions
Consumers track offsets
Great for event-driven pipelines

#kafka #messaging #streaming

ETL / Data Engineering Medium Theory

Orchestration vs transformation

Orchestration decides when and in what order work runs; transformation changes the data itself.

Airflow schedules tasks
dbt or Spark transform data
Keep responsibilities clear

#etl #orchestration #transformation

ETL / Data Engineering Easy Theory

Parquet vs CSV vs JSON

CSV is simple but weakly typed, JSON is flexible but verbose, and Parquet is compressed columnar storage optimized for analytics.

CSV is easy to inspect
JSON handles nested structure
Parquet is best for warehouse scans

#data-formats #etl #storage

ETL / Data Engineering Medium Theory

Retries and dependencies in pipelines

Retries help recover transient failures, while dependencies prevent downstream work from running on incomplete upstream data.

Retry only safe tasks
Use backoff for transient issues
Surface hard failures early

#etl #orchestration #retries

ETL / Data Engineering Medium Theory

What are data quality checks?

Data quality checks validate assumptions like null rates, uniqueness, schema shape, and accepted value ranges.

Catch bad data early
Automate checks in pipelines
Alert instead of silently passing errors

#data-quality #etl #validation

ETL / Data Engineering Medium Theory

What are incremental loads?

Incremental loads process only new or changed data instead of reprocessing the full dataset every run.

Use watermarks or timestamps
Lower cost and faster runs
Need late-arrival strategy

#etl #incremental-loads #performance

ETL / Data Engineering Medium Theory

What does partitioning mean in data engineering?

Partitioning splits data into logical chunks so reads and writes can target less data and scale better.

Common on date columns
Improves pruning and performance
Bad partition choices can hurt

#etl #partitioning #performance

ETL / Data Engineering Easy Theory

What is a data lake?

A data lake stores large volumes of raw or semi-structured data cheaply for later processing.

Raw and flexible storage
Schema can be applied later
Needs governance to avoid becoming messy

#data-lake #etl #storage

ETL / Data Engineering Easy Theory

What is a data warehouse?

A data warehouse is a system optimized for analytical queries over integrated, historical business data.

Read-heavy analytics store
Often columnar
Different goals from OLTP databases

#analytics #etl #warehouse