Batch vs streaming
Batch processes chunks on a schedule, while streaming processes events continuously with low latency.
- Batch is simpler
- Streaming lowers freshness delay
- Trade latency for complexity
Batch vs streaming
Tagged with etl
Batch processes chunks on a schedule, while streaming processes events continuously with low latency.
Batch vs streaming
ETL transforms before loading, while ELT loads raw data first and transforms inside the warehouse later.
ETL vs ELT
Orchestration decides when and in what order work runs; transformation changes the data itself.
Orchestration vs transformation
CSV is simple but weakly typed, JSON is flexible but verbose, and Parquet is compressed columnar storage optimized for analytics.
Parquet vs CSV vs JSON
Retries help recover transient failures, while dependencies prevent downstream work from running on incomplete upstream data.
Retries and dependencies in pipelines
Data quality checks validate assumptions like null rates, uniqueness, schema shape, and accepted value ranges.
What are data quality checks?
Incremental loads process only new or changed data instead of reprocessing the full dataset every run.
What are incremental loads?
Partitioning splits data into logical chunks so reads and writes can target less data and scale better.
What does partitioning mean in data engineering?
A data lake stores large volumes of raw or semi-structured data cheaply for later processing.
What is a data lake?
A data warehouse is a system optimized for analytical queries over integrated, historical business data.
What is a data warehouse?