Section 1: SQL Questions (Coding + Scenario-Based)
# ๐ Basic Queries
1. How would you find the 2nd highest salary from an employee table without using TOP or LIMIT?
2. Write a query to fetch duplicate records from a table with employee name and salary.
3. How will you remove duplicate records without using DISTINCT?
4. Suppose you have a table with Employee, Salary, and Department. How would you return the top 3 highest salaries from each department?
5. What will be your approach to increment salary by 10% for employees having salary less than 50K in one query?
6. How would you identify the missing numbers between 1 to 100 from an employee ID column?
Section 2: Data Engineering Tools & Frameworks
# ๐ฅ PySpark Coding Questions
1. What is the difference between DataFrame and RDD?
2. Write a PySpark code to read CSV file from S3 Bucket and convert it into Parquet format.
3. How would you broadcast small datasets in PySpark?
4. How would you handle Data Skewness in PySpark?
5. Write PySpark code to remove duplicates based on multiple columns.
6. Write Pyspark code to implement SCD logic.
Scenario-Based PySpark Questions
โ If your source file is having 100 million records and only 2000 records are duplicated, how will you remove duplicates?
โ Your job is failing due to OOM (Out of Memory) on the last shuffle step โ how will you debug and optimize this issue?
โ How will you repartition data to improve performance in PySpark?
โ What is the best way to handle null values during data transformation?
Section 3: Azure Data Engineering
1. What is the difference between Azure Data Factory and Azure Databricks?
2. How would you create an ETL Pipeline in Azure Data Factory to process incremental data?
3. How will you monitor and debug failed ADF pipelines?
4. What is the purpose of Data Flow in Azure Data Factory?
Scenario-Based Azure Questions
โ How will you load data from Azure Blob Storage into Azure SQL Database with data validation checks?
โ Your ADF Pipeline is running slowly โ how will you optimize it?
โ How will you design a parameterized pipeline for dynamic data ingestion from multiple files?
โ What is the best way to schedule daily pipeline with failure retry and notification alerts?
Section 4: AWS Glue + S3 + Lambda
1. What is AWS Glue? How is it different from Databricks?
2. How will you read data from S3 bucket and write into another bucket using Glue Job?
3. How can you trigger Glue job automatically on S3 file arrival?
4. What is the role of Crawler in AWS Glue?
5. How would you optimize Glue job for large files processing?