Coding Questions in Spark SQL, Pyspark, and Python

You have two tables: ‘response_times’ with columns (request_id, response_time_ms, device_type_id) and ‘device_types’ with columns (device_type_id, device_name, manufacturer). Write a query to calculate the 95th percentile of response times for each device manufacturer.

MySQL

MySQL does not have a built-in function to directly calculate percentiles, but you can use the PERCENT_RANK() window function to achieve this. The PERCENT_RANK() function assigns a rank to each row within the partition of the result set and calculates the relative rank of that row. Here’s how you can do it:

WITH ranked_responses AS (
    SELECT
        rt.device_type_id,
        dt.manufacturer,
        rt.response_time_ms,
        PERCENT_RANK() OVER (PARTITION BY dt.manufacturer ORDER BY rt.response_time_ms) AS percentile_rank
    FROM
        response_times rt
    JOIN
        device_types dt ON rt.device_type_id = dt.device_type_id
)
SELECT
    manufacturer,
    MAX(response_time_ms) AS response_time_95th_percentile
FROM
    ranked_responses
WHERE
    percentile_rank <= 0.95
GROUP BY
    manufacturer;

Explanation:

Common Table Expression (CTE):
- ranked_responses CTE calculates the percentile rank for each response time within each manufacturer.
PERCENT_RANK():
- Calculates the rank of each response time within each manufacturer, resulting in values between 0 and 1.
Filtering and Grouping:
- The main query filters for rows with a percentile rank of 0.95 or less and then groups by manufacturer to get the 95th percentile.

Spark SQL

Spark SQL has a PERCENTILE function which simplifies calculating percentiles directly. Here’s how you can achieve this in Spark SQL:

SELECT
    dt.manufacturer,
    PERCENTILE_APPROX(rt.response_time_ms, 0.95) AS response_time_95th_percentile
FROM
    response_times rt
JOIN
    device_types dt ON rt.device_type_id = dt.device_type_id
GROUP BY
    dt.manufacturer;

Explanation:

Join:
- Joins the response_times and device_types tables on device_type_id.
PERCENTILE_APPROX():
- Uses the PERCENTILE_APPROX() function to approximate the 95th percentile of response times for each manufacturer.
Group By:
- Groups the results by manufacturer to get the 95th percentile for each manufacturer.

These queries will give you the 95th percentile of response times for each device manufacturer in both MySQL and Spark SQL.

Given a table ‘daily_visits’ with columns (visit_date, visit_count), write a query to calculate the 7-day moving average of daily visits for each date.

To calculate the 7-day moving average of daily visits for each date, you can use window functions in both MySQL and Spark SQL. Here are the queries for each:

MySQL

In MySQL, you can use the AVG() window function with a frame specification to calculate the 7-day moving average:

SELECT
    visit_date,
    visit_count,
    AVG(visit_count) OVER (ORDER BY visit_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average_7_days
FROM
    daily_visits
ORDER BY
    visit_date;

Explanation:

SELECT Clause:
- visit_date: Selects the visit date.
- visit_count: Selects the visit count.
- AVG(visit_count) OVER (...): Calculates the average visit count over the specified window frame.
Window Frame:
- ORDER BY visit_date: Orders the rows by visit_date.
- ROWS BETWEEN 6 PRECEDING AND CURRENT ROW: Defines the window frame to include the current row and the previous 6 rows, making a total of 7 days.
ORDER BY Clause:
- Orders the result set by visit_date.

Spark SQL

In Spark SQL, the approach is similar, using the AVG() window function with a frame specification:

SELECT
    visit_date,
    visit_count,
    AVG(visit_count) OVER (ORDER BY visit_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average_7_days
FROM
    daily_visits
ORDER BY
    visit_date;

Explanation:

The Spark SQL query is almost identical to the MySQL query. It calculates the 7-day moving average using the same window function and frame specification.

HintsToday

recent posts

about

You have two tables: ‘response_times’ with columns (request_id, response_time_ms, device_type_id) and ‘device_types’ with columns (device_type_id, device_name, manufacturer). Write a query to calculate the 95th percentile of response times for each device manufacturer.

MySQL

Explanation:

Spark SQL

Explanation:

Given a table ‘daily_visits’ with columns (visit_date, visit_count), write a query to calculate the 7-day moving average of daily visits for each date.

MySQL

Explanation:

Spark SQL

Explanation: