Tag view

#etl

Cross-subject tag search for related interview cards.

Clear

Results update as you type. Press / to jump straight into search.

Tagged with etl

14 cards

ETL / Data Engineering Easy Theory

Batch vs streaming

Batch processes chunks on a schedule, while streaming processes events continuously with low latency.

  • Batch is simpler
  • Streaming lowers freshness delay
  • Trade latency for complexity

Batch vs streaming

ETL / Data Engineering Easy Theory

ETL vs ELT

ETL transforms before loading, while ELT loads raw data first and transforms inside the warehouse later.

  • ETL fits strict downstream schemas
  • ELT fits scalable warehouses
  • Tooling and cost model differ

ETL vs ELT

ETL / Data Engineering Easy Theory

Parquet vs CSV vs JSON

CSV is simple but weakly typed, JSON is flexible but verbose, and Parquet is compressed columnar storage optimized for analytics.

  • CSV is easy to inspect
  • JSON handles nested structure
  • Parquet is best for warehouse scans

Parquet vs CSV vs JSON

ETL / Data Engineering Medium Theory

What are data quality checks?

Data quality checks validate assumptions like null rates, uniqueness, schema shape, and accepted value ranges.

  • Catch bad data early
  • Automate checks in pipelines
  • Alert instead of silently passing errors

What are data quality checks?

ETL / Data Engineering Easy Theory

What is a data lake?

A data lake stores large volumes of raw or semi-structured data cheaply for later processing.

  • Raw and flexible storage
  • Schema can be applied later
  • Needs governance to avoid becoming messy

What is a data lake?

ETL / Data Engineering Easy Theory

What is a data warehouse?

A data warehouse is a system optimized for analytical queries over integrated, historical business data.

  • Read-heavy analytics store
  • Often columnar
  • Different goals from OLTP databases

What is a data warehouse?