HintsToday

Hints and Answers for Everything

about

Coding Questions in Spark SQL, Pyspark, and Python

To find the second-highest salary for each department in spark sql and pyspark dataframe API?

Here’s how to find the second-highest salary in each department using:

✅ Spark SQL
✅ PySpark DataFrame API
✅ Python Pandas

🧱 Sample Schema

Assume the following columns:

+----+----------+--------+------------+
| id | name     | salary | department |
+----+----------+--------+------------+

✅ 1. Spark SQL

📌 Register Temp Table

df.createOrReplaceTempView("employees")

📌 SQL Query using `DENSE_RANK()`

SELECT *
FROM (
    SELECT *,
           DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
    FROM employees
) ranked
WHERE rank = 2

📌 Run in PySpark

spark.sql("""
    SELECT *
    FROM (
        SELECT *,
               DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
        FROM employees
    ) ranked
    WHERE rank = 2
""").show()

✅ 2. PySpark DataFrame API

from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank

windowSpec = Window.partitionBy("department").orderBy(col("salary").desc())

df_with_rank = df.withColumn("rank", dense_rank().over(windowSpec))

second_highest_df = df_with_rank.filter(col("rank") == 2)
second_highest_df.show()

✅ 3. Python Pandas

import pandas as pd

data = [
    {"id": 1, "name": "Alice", "salary": 70000, "department": "HR"},
    {"id": 2, "name": "Bob", "salary": 50000, "department": "HR"},
    {"id": 3, "name": "Charlie", "salary": 60000, "department": "HR"},
    {"id": 4, "name": "David", "salary": 80000, "department": "IT"},
    {"id": 5, "name": "Eve", "salary": 70000, "department": "IT"},
]

df = pd.DataFrame(data)

df["rank"] = df.groupby("department")["salary"].rank(method="dense", ascending=False)

second_highest = df[df["rank"] == 2]
print(second_highest)

✅ Output Example

id	name	salary	department	rank
3	Charlie	60000	HR	2
5	Eve	70000	IT	2

HintsToday

recent posts

about

Coding Questions in Spark SQL, Pyspark, and Python

To find the second-highest salary for each department in spark sql and pyspark dataframe API?

🧱 Sample Schema

✅ 1. Spark SQL

📌 Register Temp Table

📌 SQL Query using `DENSE_RANK()`

📌 Run in PySpark

✅ 2. PySpark DataFrame API

✅ 3. Python Pandas

✅ Output Example

Like this:

recent posts

about

Coding Questions in Spark SQL, Pyspark, and Python

To find the second-highest salary for each department in spark sql and pyspark dataframe API?

🧱 Sample Schema

✅ 1. Spark SQL

📌 Register Temp Table

📌 SQL Query using DENSE_RANK()

📌 Run in PySpark

✅ 2. PySpark DataFrame API

✅ 3. Python Pandas

✅ Output Example

Like this:

📌 SQL Query using `DENSE_RANK()`