Tag view

#etl

Cross-subject tag search for related interview cards.

Results update as you type. Press / to jump straight into search.

Type Category Tag Difficulty Keywords only

Tagged with etl

14 cards

ETL / Data Engineering Easy Theory

Batch vs streaming

Batch processes chunks on a schedule, while streaming processes events continuously with low latency.

Batch is simpler
Streaming lowers freshness delay
Trade latency for complexity

#etl #pipelines #streaming

Batch vs streaming

ETL / Data Engineering Easy Theory

ETL vs ELT

ETL transforms before loading, while ELT loads raw data first and transforms inside the warehouse later.

ETL fits strict downstream schemas
ELT fits scalable warehouses
Tooling and cost model differ

#data-engineering #etl #pipelines

ETL vs ELT

ETL / Data Engineering Medium Theory

Orchestration vs transformation

Orchestration decides when and in what order work runs; transformation changes the data itself.

Airflow schedules tasks
dbt or Spark transform data
Keep responsibilities clear

#etl #orchestration #transformation

Orchestration vs transformation

ETL / Data Engineering Easy Theory

Parquet vs CSV vs JSON

CSV is simple but weakly typed, JSON is flexible but verbose, and Parquet is compressed columnar storage optimized for analytics.

CSV is easy to inspect
JSON handles nested structure
Parquet is best for warehouse scans

#data-formats #etl #storage

Parquet vs CSV vs JSON

ETL / Data Engineering Medium Theory

Retries and dependencies in pipelines

Retries help recover transient failures, while dependencies prevent downstream work from running on incomplete upstream data.

Retry only safe tasks
Use backoff for transient issues
Surface hard failures early

#etl #orchestration #retries

Retries and dependencies in pipelines

ETL / Data Engineering Medium Theory

What are data quality checks?

Data quality checks validate assumptions like null rates, uniqueness, schema shape, and accepted value ranges.

Catch bad data early
Automate checks in pipelines
Alert instead of silently passing errors

#data-quality #etl #validation

What are data quality checks?

ETL / Data Engineering Medium Theory

What are incremental loads?

Incremental loads process only new or changed data instead of reprocessing the full dataset every run.

Use watermarks or timestamps
Lower cost and faster runs
Need late-arrival strategy

#etl #incremental-loads #performance

What are incremental loads?

ETL / Data Engineering Medium Theory

What does partitioning mean in data engineering?

Partitioning splits data into logical chunks so reads and writes can target less data and scale better.

Common on date columns
Improves pruning and performance
Bad partition choices can hurt

#etl #partitioning #performance

What does partitioning mean in data engineering?

ETL / Data Engineering Easy Theory

What is a data lake?

A data lake stores large volumes of raw or semi-structured data cheaply for later processing.

Raw and flexible storage
Schema can be applied later
Needs governance to avoid becoming messy

#data-lake #etl #storage

What is a data lake?

ETL / Data Engineering Easy Theory

What is a data warehouse?

A data warehouse is a system optimized for analytical queries over integrated, historical business data.

Read-heavy analytics store
Often columnar
Different goals from OLTP databases

#analytics #etl #warehouse

What is a data warehouse?