Skip to content
Interview prep

Data Engineer interview questions & answers — India 2026

The most commonly asked data engineer interview questions in India, with detailed model answers. Covers technical, behavioural, and situational questions asked by Indian recruiters.

TechnicalBehaviouralSituational— question type tags throughout this page
01

Explain the difference between batch processing and stream processing. When would you choose each?

Technical

Model answer

Batch processing: processes large volumes of data at scheduled intervals (hourly, daily). Tools: Apache Spark, AWS Glue, dbt. Characteristics: high latency (data is not fresh until next batch), simpler error handling (re-run the batch), lower infrastructure cost. Use cases: daily reporting, end-of-day reconciliation, model training data preparation, historical analysis. Stream processing: processes data in real-time as events arrive. Tools: Apache Kafka + Flink, Kinesis Data Analytics, Spark Structured Streaming. Characteristics: low latency (seconds to milliseconds), more complex state management and fault tolerance, higher cost. Use cases: fraud detection, real-time dashboards, user activity feeds, IoT sensor alerting. Lambda architecture combines both. In practice (India 2026): most data teams start with batch and add streaming for specific high-value use cases (fraud, recommendations). dbt + Airflow for batch; Kafka + Flink for streaming are the standard stacks.

02

Walk me through how you would design a data pipeline to ingest 500GB of daily transaction data from 10 source systems into a data warehouse.

Technical

Model answer

Design in 3 layers: Ingestion → Processing → Serving. Ingestion: Use CDC (Change Data Capture) tools like Debezium for databases, or Kafka Connectors for event sources; store raw data in S3 (data lake) in Parquet format, partitioned by date. Processing: Apache Airflow for orchestration — one DAG per source system; PySpark or AWS Glue for transformation; dbt for business logic transformation in the warehouse layer. Warehouse: Snowflake or BigQuery — partition tables by transaction_date; cluster by merchant_id or user_id for query performance. Data Quality: Great Expectations or dbt tests for null checks, referential integrity, volume anomaly detection. Monitoring: Airflow alerts for DAG failures; data freshness SLA monitoring. Error handling: DLQ (Dead Letter Queue) for failed records, automated retry with exponential backoff. For 500GB daily, Spark on EMR or Glue can process this in under 1 hour in practice.

03

What is dbt and how does it improve the data transformation workflow?

Technical

Model answer

dbt (data build tool) is a transformation framework that allows data analysts and engineers to write transformation logic as SQL SELECT statements, with dbt handling the materialisation (views, tables, incremental models), dependency management, testing, and documentation. Key improvements: (1) Version control: all transformation SQL lives in Git, enabling code review, rollbacks, and CI/CD; (2) Testing: built-in data quality tests (not_null, unique, accepted_values, relationships) run automatically; (3) Documentation: auto-generated data lineage and column-level docs; (4) Dependency management: dbt automatically resolves model dependencies and executes in the right order; (5) Incremental models: only process new/changed records, not full table refreshes. In practice, dbt sits in the T of ELT pipelines — after data is loaded into the warehouse, dbt transforms it into clean, tested, documented data models. Used with Snowflake, BigQuery, Redshift, or Databricks.

04

Tell me about a time a data pipeline you owned went down in production. How did you respond?

Behavioural

Model answer

Use STAR. Key elements: (1) What was the pipeline and what it supported (downstream dashboards, ML models, business reports); (2) How you detected the failure (monitoring alert, stakeholder complaint, scheduled check); (3) Diagnosis process: what you checked first (recent code changes, source system issues, resource constraints, upstream data anomalies); (4) Remediation: hotfix, backfill strategy, communication to stakeholders; (5) Root cause and prevention. Good answer characteristics: proactive monitoring (not discovered by a stakeholder), systematic diagnosis, clear communication to affected business users, and a post-mortem that led to a durable fix (improved tests, alerts, runbooks). Demonstrates: technical problem-solving, ownership mindset, communication under pressure.

05

What is the difference between a data warehouse, data lake, and data lakehouse?

Technical

Model answer

Data Warehouse: structured, schema-on-write, optimised for SQL query performance, typically holds curated, business-ready data. Examples: Snowflake, BigQuery, Redshift. Best for business analysts running complex aggregation queries. Data Lake: stores raw data in any format (Parquet, JSON, CSV, binary) at low cost in object storage (S3, GCS). Schema-on-read. Scales to petabytes cheaply. Requires data engineering to make usable. Best for data science (raw feature extraction) and archival. Data Lakehouse: combines the cost advantages of data lakes with the performance and governance of data warehouses. Open table formats: Delta Lake (Databricks), Apache Iceberg (Snowflake, BigQuery, EMR), Apache Hudi. Supports ACID transactions, time travel, and schema enforcement on object storage. Best of both worlds. In India 2026: most modern data platforms are moving toward lakehouse architectures — Databricks (Delta Lake) and Snowflake (Iceberg support) are the leading implementations.

06

The analytics team reports that your dashboard numbers do not match the finance team's numbers for the same metric. How do you investigate?

Situational

Model answer

Data discrepancy investigation process: (1) Identify the specific metric, time period, and the magnitude of the discrepancy; (2) Trace both numbers to their source: what query or report generates each figure? (3) Compare data sources: do the analytics and finance teams use the same underlying data source? (4) Check business logic differences: is "revenue" defined the same way — is it gross or net? Does it include or exclude cancelled orders, refunds, GST? (5) Check the time zone and cut-off time: analytics may use UTC, finance may use IST; (6) Check for double-counting or data quality issues: duplicate transaction IDs, unprocessed records. Most data discrepancies in Indian companies come from different business logic definitions, not technical errors. Resolution: document the agreed definition and create a single source of truth data model that all teams use, governed by a data dictionary.

Interview tips for Data Engineer roles in India

  • Know Airflow concepts deeply — DAG design, XComs, task dependencies, and backfill strategy are common interview topics
  • Be able to write SQL window functions, CTEs, and incremental refresh logic on a whiteboard — SQL depth is tested at every level
  • Understand the difference between Spark RDDs, DataFrames, and Datasets conceptually — even if you only work at the DataFrame level day-to-day
  • For Flipkart, Swiggy, Zomato, and PhonePe interviews, expect system design questions for data platforms: design a real-time fraud detection pipeline, design an event tracking system at 1M events/second
  • Know dbt architecture (models, tests, sources, macros) in detail — it is now a standard expectation at most product company data engineering roles in India

Got the interview? Now get your CV ready.

Use CV Prime to build an ATS-optimised Data Engineer CV tailored to the exact job description — so you pass the automated screen before the interview even happens.

CV Prime is a free CV maker and free AI CV builder for India. No credit card required.

Help us improve CV Prime

We use privacy-conscious product analytics only after consent. No CV text or API keys are tracked.