Coding Questions in Spark SQL, Pyspark, and Python

To find the second-highest salary for each department in spark sql and pyspark dataframe API?

Here’s how to find the second-highest salary in each department using:

  • โœ… Spark SQL
  • โœ… PySpark DataFrame API
  • โœ… Python Pandas

๐Ÿงฑ Sample Schema

Assume the following columns:

+----+----------+--------+------------+
| id | name     | salary | department |
+----+----------+--------+------------+

โœ… 1. Spark SQL

๐Ÿ“Œ Register Temp Table

df.createOrReplaceTempView("employees")

๐Ÿ“Œ SQL Query using DENSE_RANK()

SELECT *
FROM (
    SELECT *,
           DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
    FROM employees
) ranked
WHERE rank = 2

๐Ÿ“Œ Run in PySpark

spark.sql("""
    SELECT *
    FROM (
        SELECT *,
               DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
        FROM employees
    ) ranked
    WHERE rank = 2
""").show()

โœ… 2. PySpark DataFrame API

from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank

windowSpec = Window.partitionBy("department").orderBy(col("salary").desc())

df_with_rank = df.withColumn("rank", dense_rank().over(windowSpec))

second_highest_df = df_with_rank.filter(col("rank") == 2)
second_highest_df.show()

โœ… 3. Python Pandas

import pandas as pd

data = [
    {"id": 1, "name": "Alice", "salary": 70000, "department": "HR"},
    {"id": 2, "name": "Bob", "salary": 50000, "department": "HR"},
    {"id": 3, "name": "Charlie", "salary": 60000, "department": "HR"},
    {"id": 4, "name": "David", "salary": 80000, "department": "IT"},
    {"id": 5, "name": "Eve", "salary": 70000, "department": "IT"},
]

df = pd.DataFrame(data)

df["rank"] = df.groupby("department")["salary"].rank(method="dense", ascending=False)

second_highest = df[df["rank"] == 2]
print(second_highest)

โœ… Output Example

idnamesalarydepartmentrank
3Charlie60000HR2
5Eve70000IT2

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19