To find the second-highest salary for each department in spark sql and pyspark dataframe API?
Here’s how to find the second-highest salary in each department using:
- โ Spark SQL
- โ PySpark DataFrame API
- โ Python Pandas
๐งฑ Sample Schema
Assume the following columns:
+----+----------+--------+------------+
| id | name | salary | department |
+----+----------+--------+------------+
โ 1. Spark SQL
๐ Register Temp Table
df.createOrReplaceTempView("employees")
๐ SQL Query using DENSE_RANK()
SELECT *
FROM (
SELECT *,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
FROM employees
) ranked
WHERE rank = 2
๐ Run in PySpark
spark.sql("""
SELECT *
FROM (
SELECT *,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
FROM employees
) ranked
WHERE rank = 2
""").show()
โ 2. PySpark DataFrame API
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
windowSpec = Window.partitionBy("department").orderBy(col("salary").desc())
df_with_rank = df.withColumn("rank", dense_rank().over(windowSpec))
second_highest_df = df_with_rank.filter(col("rank") == 2)
second_highest_df.show()
โ 3. Python Pandas
import pandas as pd
data = [
{"id": 1, "name": "Alice", "salary": 70000, "department": "HR"},
{"id": 2, "name": "Bob", "salary": 50000, "department": "HR"},
{"id": 3, "name": "Charlie", "salary": 60000, "department": "HR"},
{"id": 4, "name": "David", "salary": 80000, "department": "IT"},
{"id": 5, "name": "Eve", "salary": 70000, "department": "IT"},
]
df = pd.DataFrame(data)
df["rank"] = df.groupby("department")["salary"].rank(method="dense", ascending=False)
second_highest = df[df["rank"] == 2]
print(second_highest)
โ Output Example
id | name | salary | department | rank |
---|---|---|---|---|
3 | Charlie | 60000 | HR | 2 |
5 | Eve | 70000 | IT | 2 |