Explain the difference between batch processing and stream processing. When would you choose each?
Model answer
Batch processing: processes large volumes of data at scheduled intervals (hourly, daily). Tools: Apache Spark, AWS Glue, dbt. Characteristics: high latency (data is not fresh until next batch), simpler error handling (re-run the batch), lower infrastructure cost. Use cases: daily reporting, end-of-day reconciliation, model training data preparation, historical analysis. Stream processing: processes data in real-time as events arrive. Tools: Apache Kafka + Flink, Kinesis Data Analytics, Spark Structured Streaming. Characteristics: low latency (seconds to milliseconds), more complex state management and fault tolerance, higher cost. Use cases: fraud detection, real-time dashboards, user activity feeds, IoT sensor alerting. Lambda architecture combines both. In practice (India 2026): most data teams start with batch and add streaming for specific high-value use cases (fraud, recommendations). dbt + Airflow for batch; Kafka + Flink for streaming are the standard stacks.