✅ Core Data Engineering Concepts:
1️⃣ What are the key responsibilities of a Data Engineer?
2️⃣ OLTP vs OLAP – What’s the difference?
3️⃣ ETL vs ELT – When do you use each?
4️⃣ What are the most common challenges in data engineering?
5️⃣ Explain data partitioning in a data warehouse.
6️⃣ How do you optimize slow SQL queries?
7️⃣ What is indexing in databases, and how does it help performance?
8️⃣ Types of SQL joins with examples.
9️⃣ What is denormalization, and when is it useful?
🔟 How do you handle duplicate records in SQL?
✅ Apache Spark & PySpark – Real-World Use Cases:
1️⃣ What is Apache Spark? How does it compare to Hadoop MapReduce?
2️⃣ Spark RDD vs DataFrame vs Dataset – Key differences?
3️⃣ cache() vs persist() – When and why?
4️⃣ What is a shuffle operation in Spark? How do you optimize it?
✅ Types of transformations in Spark.
1️⃣ PySpark code to remove duplicates from a DataFrame.
2️⃣ PySpark code to calculate moving average of a column.
3️⃣ Scala program to count words in a text file using Spark.
4️⃣ PySpark function to fill missing values with column mean.
5️⃣ PySpark code to group by a column and sum another column.
✅ Hive Integration & Query Optimization
1️⃣ Partitioning vs Bucketing – What’s the difference?
2️⃣ How does Hive store data internally?
3️⃣ What are the commonly used file formats in Hive?
4️⃣ Explain dynamic partitioning in Hive with an example.
5️⃣ How does indexing work in Hive?