You have two tables: ‘response_times’ with columns (request_id, response_time_ms, device_type_id) and ‘device_types’ with columns (device_type_id, device_name, manufacturer). Write a query to calculate the 95th percentile of response times for each device manufacturer.
MySQL
MySQL does not have a built-in function to directly calculate percentiles, but you can use the PERCENT_RANK() window function to achieve this. The PERCENT_RANK() function assigns a rank to each row within the partition of the result set and calculates the relative rank of that row. Here’s how you can do it:
WITH ranked_responses AS (
SELECT
rt.device_type_id,
dt.manufacturer,
rt.response_time_ms,
PERCENT_RANK() OVER (PARTITION BY dt.manufacturer ORDER BY rt.response_time_ms) AS percentile_rank
FROM
response_times rt
JOIN
device_types dt ON rt.device_type_id = dt.device_type_id
)
SELECT
manufacturer,
MAX(response_time_ms) AS response_time_95th_percentile
FROM
ranked_responses
WHERE
percentile_rank <= 0.95
GROUP BY
manufacturer;
Explanation:
- Common Table Expression (CTE):
ranked_responsesCTE calculates the percentile rank for each response time within each manufacturer.
- PERCENT_RANK():
- Calculates the rank of each response time within each manufacturer, resulting in values between 0 and 1.
- Filtering and Grouping:
- The main query filters for rows with a percentile rank of 0.95 or less and then groups by manufacturer to get the 95th percentile.
Spark SQL
Spark SQL has a PERCENTILE function which simplifies calculating percentiles directly. Here’s how you can achieve this in Spark SQL:
SELECT
dt.manufacturer,
PERCENTILE_APPROX(rt.response_time_ms, 0.95) AS response_time_95th_percentile
FROM
response_times rt
JOIN
device_types dt ON rt.device_type_id = dt.device_type_id
GROUP BY
dt.manufacturer;
Explanation:
- Join:
- Joins the
response_timesanddevice_typestables ondevice_type_id.
- Joins the
- PERCENTILE_APPROX():
- Uses the
PERCENTILE_APPROX()function to approximate the 95th percentile of response times for each manufacturer.
- Uses the
- Group By:
- Groups the results by
manufacturerto get the 95th percentile for each manufacturer.
- Groups the results by
These queries will give you the 95th percentile of response times for each device manufacturer in both MySQL and Spark SQL.
Given a table ‘daily_visits’ with columns (visit_date, visit_count), write a query to calculate the 7-day moving average of daily visits for each date.
To calculate the 7-day moving average of daily visits for each date, you can use window functions in both MySQL and Spark SQL. Here are the queries for each:
MySQL
In MySQL, you can use the AVG() window function with a frame specification to calculate the 7-day moving average:
SELECT
visit_date,
visit_count,
AVG(visit_count) OVER (ORDER BY visit_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average_7_days
FROM
daily_visits
ORDER BY
visit_date;
Explanation:
- SELECT Clause:
visit_date: Selects the visit date.visit_count: Selects the visit count.AVG(visit_count) OVER (...): Calculates the average visit count over the specified window frame.
- Window Frame:
ORDER BY visit_date: Orders the rows byvisit_date.ROWS BETWEEN 6 PRECEDING AND CURRENT ROW: Defines the window frame to include the current row and the previous 6 rows, making a total of 7 days.
- ORDER BY Clause:
- Orders the result set by
visit_date.
- Orders the result set by
Spark SQL
In Spark SQL, the approach is similar, using the AVG() window function with a frame specification:
SELECT
visit_date,
visit_count,
AVG(visit_count) OVER (ORDER BY visit_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average_7_days
FROM
daily_visits
ORDER BY
visit_date;
Explanation:
The Spark SQL query is almost identical to the MySQL query. It calculates the 7-day moving average using the same window function and frame specification.