Common Table Expressions in Oracle Pl/Sql and Hive explained and Compared

by | Jun 6, 2024 | SQL | 0 comments

Common Table Expressions (CTEs) are a useful feature in SQL for simplifying complex queries and improving readability. Both Oracle PL/SQL and Apache Hive support CTEs, although there may be slight differences in their syntax and usage.

Common Table Expressions in Oracle PL/SQL

In Oracle, CTEs are defined using the WITH clause. They are used to create a temporary result set that can be referenced within the main query.

Syntax:

WITH cte_name AS (
SELECT query
)
SELECT columns
FROM cte_name
WHERE condition;

Example Usage:

WITH department_totals AS (
SELECT department_id, SUM(salary) AS total_salary
FROM employees
GROUP BY department_id
)
SELECT d.department_id, d.department_name, t.total_salary
FROM departments d
JOIN department_totals t ON d.department_id = t.department_id;

In this example:

  • The CTE department_totals calculates the total salary for each department.
  • The main query then joins the departments table with the department_totals CTE to display the department name and total salary.

Common Table Expressions in Apache Hive

In Hive, CTEs are also defined using the WITH clause. They function similarly to those in Oracle, allowing you to create temporary result sets for use within a larger query.

Syntax:

WITH cte_name AS (
SELECT query
)
SELECT columns
FROM cte_name
WHERE condition;

Example Usage:

WITH department_totals AS (
SELECT department_id, SUM(salary) AS total_salary
FROM employees
GROUP BY department_id
)
SELECT d.department_id, d.department_name, t.total_salary
FROM departments d
JOIN department_totals t ON d.department_id = t.department_id;

In this example:

  • The syntax and functionality are essentially the same as in Oracle PL/SQL.
  • The CTE department_totals is used to calculate the total salary for each department.
  • The main query joins the departments table with the department_totals CTE to display the department name and total salary.

Key Points

  • Scope: CTEs in both Oracle and Hive are limited to the scope of the query they are defined in. Once the query execution is complete, the CTEs are discarded.
  • Multiple CTEs: You can define multiple CTEs in a single query by separating them with commas.
  • Recursion: Oracle supports recursive CTEs, which allow a CTE to refer to itself. Hive also supports recursive CTEs starting from version 3.1.0.

Example of Multiple CTEs:

WITH
cte1 AS (
SELECT department_id, SUM(salary) AS total_salary
FROM employees
GROUP BY department_id
),
cte2 AS (
SELECT department_id, COUNT(*) AS employee_count
FROM employees
GROUP BY department_id
)
SELECT c1.department_id, c1.total_salary, c2.employee_count
FROM cte1 c1
JOIN cte2 c2 ON c1.department_id = c2.department_id;

In both Oracle and Hive, this example:

  • Defines two CTEs: cte1 and cte2.
  • The main query joins these two CTEs to display the department ID, total salary, and employee count for each department.

CTEs are a powerful tool in both Oracle PL/SQL and Hive for breaking down complex queries, making them easier to read and maintain.

Here’s an example of a complex query using multiple CTEs and embedded CTEs in a single WITH statement. This query is designed to work in both MySQL and Spark SQL. It involves six layers of CTEs to illustrate a comprehensive use case.

Scenario:

  • You have two tables: products with columns (product_id, product_name, category_id, price) and categories with columns (category_id, category_name).
  • You want to perform various aggregations and calculations to ultimately determine the price difference between each product and the next most expensive product in the same category, while including additional metrics such as category-level statistics and rank.

SQL Query with Six Layers of CTEs

WITH
-- Layer 1: CTE to join products and categories
product_categories AS (
    SELECT
        p.product_id,
        p.product_name,
        p.category_id,
        p.price,
        c.category_name
    FROM
        products p
    JOIN
        categories c ON p.category_id = c.category_id
),

-- Layer 2: CTE to calculate the minimum and maximum price per category
category_min_max_prices AS (
    SELECT
        category_id,
        MIN(price) AS min_price,
        MAX(price) AS max_price
    FROM
        product_categories
    GROUP BY
        category_id
),

-- Layer 3: CTE to rank products within each category by price
ranked_products AS (
    SELECT
        pc.product_id,
        pc.product_name,
        pc.category_id,
        pc.price,
        pc.category_name,
        ROW_NUMBER() OVER (PARTITION BY pc.category_id ORDER BY pc.price DESC) AS price_rank
    FROM
        product_categories pc
),

-- Layer 4: CTE to calculate the next higher price for each product
next_higher_prices AS (
    SELECT
        rp.product_id,
        rp.product_name,
        rp.category_id,
        rp.price,
        rp.category_name,
        rp.price_rank,
        LEAD(rp.price) OVER (PARTITION BY rp.category_id ORDER BY rp.price DESC) AS next_higher_price
    FROM
        ranked_products rp
),

-- Layer 5: CTE to calculate the price difference with the next higher price
price_differences AS (
    SELECT
        nhp.product_id,
        nhp.product_name,
        nhp.category_id,
        nhp.price,
        nhp.category_name,
        nhp.price_rank,
        nhp.next_higher_price,
        nhp.price - nhp.next_higher_price AS price_difference
    FROM
        next_higher_prices nhp
),

-- Layer 6: CTE to calculate category-level statistics and combine all previous data
category_statistics AS (
    SELECT
        pd.category_id,
        pd.category_name,
        AVG(pd.price) AS avg_price,
        COUNT(pd.product_id) AS total_products,
        cm.min_price,
        cm.max_price,
        pd.product_id,
        pd.product_name,
        pd.price,
        pd.price_rank,
        pd.next_higher_price,
        pd.price_difference
    FROM
        price_differences pd
    JOIN
        category_min_max_prices cm ON pd.category_id = cm.category_id
    GROUP BY
        pd.category_id, pd.category_name, cm.min_price, cm.max_price, pd.product_id, pd.product_name, pd.price, pd.price_rank, pd.next_higher_price, pd.price_difference
)

-- Final query to select all relevant columns
SELECT
    cs.product_id,
    cs.product_name,
    cs.category_id,
    cs.category_name,
    cs.price,
    cs.price_rank,
    cs.next_higher_price,
    cs.price_difference,
    cs.avg_price,
    cs.total_products,
    cs.min_price,
    cs.max_price
FROM
    category_statistics cs
ORDER BY
    cs.category_id, cs.price DESC;

Explanation:

  1. product_categories CTE:
    • Joins the products and categories tables to select relevant columns.
  2. category_min_max_prices CTE:
    • Calculates the minimum and maximum price for each category.
  3. ranked_products CTE:
    • Ranks products within each category by price.
  4. next_higher_prices CTE:
    • Calculates the next higher price for each product within the same category.
  5. price_differences CTE:
    • Calculates the price difference between each product and the next higher price.
  6. category_statistics CTE:
    • Calculates category-level statistics and combines data from all previous CTEs.
  7. Final Query:
    • Selects the final results, including product details, price difference, and category-level statistics.

This approach demonstrates how to use multiple and embedded CTEs to build a complex query step-by-step, making it easier to understand and maintain.

Let us suppose a scenario in which i have converted six steps of pyspark scripts (Mostly Dataframes ) into a multiple CTEs with ‘single with’ statement in Spark sql Query. which will be performing better?

When deciding whether to use multiple CTEs in a single WITH statement in Spark SQL versus writing equivalent transformations in PySpark, performance can depend on several factors:

Factors Affecting Performance:

  1. Execution Plan Optimization:
    • Spark SQL: The Catalyst optimizer in Spark SQL can optimize the query plan, potentially reducing the execution time. Spark SQL queries are usually well-optimized because of this.
    • PySpark: The DataFrame API also benefits from the Catalyst optimizer. However, PySpark scripts can be more flexible in terms of optimization strategies and custom transformations.
  2. Readability and Maintainability:
    • Spark SQL: Using multiple CTEs can make the query easier to read and maintain, especially for complex transformations. It provides a declarative approach.
    • PySpark: While PySpark scripts can be more verbose, they offer more control over the transformations and are often easier to debug step-by-step.
  3. Lazy Evaluation:
    • Both Spark SQL and PySpark follow lazy evaluation principles, meaning transformations are not executed until an action (e.g., show(), count(), collect()) is called.
  4. Resource Utilization:
    • The performance of Spark SQL and PySpark can vary depending on cluster resource configurations, data size, and complexity of transformations.

Example: Converting Multiple Steps of PySpark to CTEs in Spark SQL

PySpark Script Example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, lead
from pyspark.sql.window import Window

spark = SparkSession.builder.appName("Example").getOrCreate()

# Step 1: Read data
products = spark.read.csv("products.csv", header=True, inferSchema=True)
categories = spark.read.csv("categories.csv", header=True, inferSchema=True)

# Step 2: Calculate total number of products
product_totals = products.groupBy("category_id").count().alias("total_products")

# Step 3: Calculate average price in each category
category_averages = products.groupBy("category_id").agg(avg("price").alias("avg_price"))

# Step 4: Identify products with prices above the average in their category
above_average_products = products.join(category_averages, "category_id").filter(col("price") > col("avg_price"))

# Step 5: Rank products within their categories by price
window_spec = Window.partitionBy("category_id").orderBy(col("price").desc())
ranked_products = products.withColumn("price_rank", rank().over(window_spec))

# Step 6: Calculate the next most expensive product price within each category
next_expensive_prices = products.withColumn("next_higher_price", lead("price").over(window_spec))

# Final Step: Combine all previous steps
final_result = products \
    .join(product_totals, "category_id") \
    .join(category_averages, "category_id") \
    .join(above_average_products, "product_id", "left") \
    .join(ranked_products, "product_id") \
    .join(next_expensive_prices, "product_id") \
    .select(products["*"], 
            product_totals["total_products"], 
            category_averages["avg_price"], 
            above_average_products["price"].alias("above_avg_price"),
            ranked_products["price_rank"], 
            next_expensive_prices["next_higher_price"],
            (products["price"] - next_expensive_prices["next_higher_price"]).alias("price_difference"))

final_result.show()

Equivalent Spark SQL with Multiple CTEs:

WITH
product_totals AS (
    SELECT
        category_id,
        COUNT(*) AS total_products
    FROM
        products
    GROUP BY
        category_id
),
category_averages AS (
    SELECT
        category_id,
        AVG(price) AS avg_price
    FROM
        products
    GROUP BY
        category_id
),
above_average_products AS (
    SELECT
        p.product_id,
        p.product_name,
        p.category_id,
        p.price,
        ca.avg_price
    FROM
        products p
    JOIN
        category_averages ca ON p.category_id = ca.category_id
    WHERE
        p.price > ca.avg_price
),
ranked_products AS (
    SELECT
        p.product_id,
        p.product_name,
        p.category_id,
        p.price,
        RANK() OVER (PARTITION BY p.category_id ORDER BY p.price DESC) AS price_rank
    FROM
        products p
),
next_expensive_prices AS (
    SELECT
        p.product_id,
        p.product_name,
        p.category_id,
        p.price,
        LEAD(p.price) OVER (PARTITION BY p.category_id ORDER BY p.price DESC) AS next_higher_price
    FROM
        products p
),
final_result AS (
    SELECT
        p.product_id,
        p.product_name,
        p.category_id,
        c.category_name,
        p.price,
        pt.total_products,
        ca.avg_price,
        ap.price AS above_avg_price,
        rp.price_rank,
        nep.next_higher_price,
        p.price - nep.next_higher_price AS price_difference
    FROM
        products p
    JOIN
        categories c ON p.category_id = c.category_id
    LEFT JOIN
        product_totals pt ON p.category_id = pt.category_id
    LEFT JOIN
        category_averages ca ON p.category_id = ca.category_id
    LEFT JOIN
        above_average_products ap ON p.product_id = ap.product_id
    LEFT JOIN
        ranked_products rp ON p.product_id = rp.product_id
    LEFT JOIN
        next_expensive_prices nep ON p.product_id = nep.product_id
)
SELECT
    product_id,
    product_name,
    category_name,
    price,
    total_products,
    avg_price,
    above_avg_price,
    price_rank,
    next_higher_price,
    price_difference
FROM
    final_result
ORDER BY
    category_name, price DESC;

Performance Considerations:

  1. Execution Plan:
    • Run EXPLAIN or EXPLAIN EXTENDED on both the Spark SQL and PySpark transformations to analyze the execution plans. This will help you understand how Spark plans to execute each query and transformation.
  2. Benchmarking:
    • Benchmark both approaches with representative data sizes. Measure the execution time, resource usage (CPU, memory), and shuffle read/write operations. This will give you practical insights into which approach performs better in your specific context.
  3. Caching Intermediate Results:
    • In PySpark, you can cache intermediate DataFrames if they are reused multiple times, which can improve performance. This can be controlled more explicitly compared to Spark SQL.
  4. Complex Transformations:
    • If your transformations are very complex and require custom logic, PySpark may offer more flexibility and control. However, for standard SQL operations, Spark SQL might be more concise and easier to optimize automatically.

In general, both approaches benefit from Spark’s Catalyst optimizer, but the best choice depends on your specific use case, data size, and complexity of transformations. Performing benchmarking and analyzing execution plans will provide the most accurate assessment of performance differences.

Written By HintsToday Team

undefined

Related Posts

Temporary Functions in PL/Sql Vs Spark Sql

Temporary functions allow users to define functions that are session-specific and used to encapsulate reusable logic within a database session. While both PL/SQL and Spark SQL support the concept of user-defined functions, their implementation and usage differ...

read more

Spark SQL windows Function and Best Usecases

For Better understanding on Spark SQL windows Function and Best Usecases do refer our post Window functions in Oracle Pl/Sql and Hive explained and compared with examples. Window functions in Spark SQL are powerful tools that allow you to perform calculations across a...

read more

Spark SQL Join Types- Syntax examples, Comparision

Spark SQL supports several types of joins, each suited to different use cases. Below is a detailed explanation of each join type, including syntax examples and comparisons. Types of Joins in Spark SQL Inner Join Left (Outer) Join Right (Outer) Join Full (Outer) Join...

read more

Indexing in SQL- Explain with examples

Indexing in SQL is a technique used to improve the performance of queries by creating special data structures (indexes) that allow for faster data retrieval. Indexes are created on one or more columns of a table, and they store the values of those columns in a sorted...

read more

Pattern matching in SQL- Like Operator

LIKE Operator: The LIKE operator is used to search for a specified pattern in a column. It allows the use of wildcards: % (percent sign): Matches zero or more characters. _ (underscore): Matches any single character. Examples: SELECT * FROM employees WHERE last_name...

read more

Order of appearance and execution in the SELECT query?

For Appearance( So Few Jawans & Workers Go Home On Late) SELECT – FROM – JOIN – ON – WHERE – GROUP BY – HAVING – ORDER BY – LIMIT FROM â€“ JOIN â€“ ON â€“ WHERE â€“ GROUP BY â€“ HAVING â€“ SELECT â€“ ORDER...

read more

Functions in SQL- Examples

SQL provides various functions to perform operations on data stored in a database. Here are some commonly used SQL functions categorized based on their functionality: Aggregate Functions: COUNT(): Counts the number of rows. SUM(): Calculates the sum of values. AVG():...

read more

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *