• Here’s a comprehensive, logically structured, and interactive guide to SQL Indexing, consolidating and enhancing all the content you’ve shared, complete with real examples, platform-specific insights, and advanced use cases:


    🧠 Mastering Indexing in SQL: A Complete Guide

    🔍 What is Indexing in SQL?

    Indexing is a performance optimization technique that allows fast retrieval of rows from a table, similar to a book index that helps locate topics quickly.

    🎯 Core Idea

    Instead of scanning every row (full table scan), the database engine can jump directly to matching rows using a sorted data structure like a B-tree or hash table.


    ⚙️ How Does Indexing Work?

    1. Index Creation: You specify one or more columns to index.
    2. Index Structure: Data is stored in a B-tree, bitmap, hash, or other structures depending on DB type.
    3. Query Execution: On query, the DB engine checks the query plan and uses the index if available and efficient.
    4. Row Lookup: Index holds pointers (row IDs or offsets) to locate the actual data.

    📂 Types of Indexes (with Syntax & Use Cases)

    TypeDescriptionSyntaxExample Use Case
    Single-column IndexIndex on one columnCREATE INDEX idx_col ON table(col);Fast lookup on a specific column
    Composite IndexIndex on multiple columnsCREATE INDEX idx_multi ON table(col1, col2);Filtering by both columns
    Unique IndexEnsures values are uniqueCREATE UNIQUE INDEX idx_unique ON table(col);Enforce no duplicate emails
    Clustered IndexReorders actual rowsCREATE CLUSTERED INDEX idx_cl ON table(col);Only one allowed per table
    Non-clustered IndexSeparate structure; references rowsCREATE NONCLUSTERED INDEX idx ON table(col);Most common index type
    Function-Based IndexIndex on function outputCREATE INDEX idx_lower ON users(LOWER(email));Used in search-insensitive queries
    Partial Index (PostgreSQL)Index on filtered rowsCREATE INDEX idx_active ON users(last_login) WHERE is_active = true;Speed up selective queries
    Full-Text IndexFor full-text searchCREATE FULLTEXT INDEX idx_text ON docs(content);Search documents by keyword
    Spatial IndexFor geospatial dataCREATE SPATIAL INDEX idx_geo ON locations(geom);Optimized for GIS queries

    🚀 When to Use Indexes (Use Cases)

    ✅ Best Use Cases for Indexes

    Query PatternWhy Index HelpsExample
    WHERE clauseDirectly locates matching rowsSELECT * FROM orders WHERE order_id = 12345;
    JOIN conditionsFast row match between tablesJOIN orders ON customer_id
    ORDER BY / GROUP BYEfficient sorting/groupingORDER BY created_at
    Foreign key columnsUsed frequently in joinscustomer_id in orders
    Search in expressionsIf functional index usedLOWER(email) = 'abc@test.com'
    Columns with high cardinalityDistinct values benefit moreemail, SSN, transaction_id

    ⚠️ Cases Where Indexes May Not Help

    • Small tables: full scan is often faster.
    • Columns with low cardinality: e.g., gender = ‘M/F’.
    • Queries without WHERE/JOIN/ORDER BY.
    • Filtering by non-leading column of composite index.

    🔨 Creating Indexes — Syntax Examples

    -- Basic Index
    CREATE INDEX idx_email ON employees(email);
    
    -- Composite Index
    CREATE INDEX idx_email_dept ON employees(email, department);
    
    -- Unique Index
    CREATE UNIQUE INDEX idx_unique_email ON users(email);
    
    -- Function-based Index (PostgreSQL)
    CREATE INDEX idx_lower_email ON users(LOWER(email));
    
    -- Partial Index (PostgreSQL)
    CREATE INDEX idx_active_users ON users(last_login) WHERE is_active = true;
    
    -- Full-text Index (MySQL, PostgreSQL)
    CREATE FULLTEXT INDEX idx_text_content ON documents(content);
    
    -- Drop Index
    DROP INDEX idx_email; -- PostgreSQL
    DROP INDEX idx_email ON employees; -- MySQL
    

    🔬 Platform-Specific Hands-On Demos

    🐘 PostgreSQL Indexing Demo

    CREATE TABLE customers (
      id SERIAL PRIMARY KEY,
      name VARCHAR(100),
      email VARCHAR(100),
      city VARCHAR(50),
      created_at DATE
    );
    
    -- Insert sample data
    INSERT INTO customers (name, email, city, created_at)
    SELECT 
      'User' || i, 
      'user' || i || '@example.com', 
      CASE WHEN i % 2 = 0 THEN 'Delhi' ELSE 'Mumbai' END,
      CURRENT_DATE - (i || ' days')::interval
    FROM generate_series(1, 100000) AS i;
    
    -- Create index
    CREATE INDEX idx_city ON customers(city);
    
    -- Analyze performance
    EXPLAIN ANALYZE SELECT * FROM customers WHERE city = 'Delhi';
    

    🐬 MySQL Indexing Demo

    CREATE TABLE orders (
      id INT AUTO_INCREMENT PRIMARY KEY,
      customer_id INT,
      amount DECIMAL(10,2),
      order_date DATE
    );
    
    -- Insert test data using procedure
    DELIMITER $$
    CREATE PROCEDURE insert_orders()
    BEGIN
      DECLARE i INT DEFAULT 1;
      WHILE i <= 100000 DO
        INSERT INTO orders (customer_id, amount, order_date)
        VALUES (
          FLOOR(RAND()*1000),
          RAND()*1000,
          DATE_SUB(CURDATE(), INTERVAL FLOOR(RAND()*365) DAY)
        );
        SET i = i + 1;
      END WHILE;
    END$$
    DELIMITER ;
    
    CALL insert_orders();
    
    -- Index and performance check
    CREATE INDEX idx_order_date ON orders(order_date);
    EXPLAIN SELECT * FROM orders WHERE order_date = '2024-01-01';
    

    🔥 PySpark + Delta Lake Z-Ordering

    # Create delta table
    from pyspark.sql.functions import expr
    df = spark.range(0, 1000000).withColumn("order_id", expr("id % 1000")) \
                                 .withColumn("country", expr("CASE WHEN id % 2 = 0 THEN 'India' ELSE 'US' END")) \
                                 .withColumn("amount", expr("rand() * 1000"))
    
    df.write.format("delta").mode("overwrite").save("/mnt/delta/orders")
    spark.sql("CREATE TABLE delta_orders USING DELTA LOCATION '/mnt/delta/orders'")
    
    # Optimize with Z-Ordering
    spark.sql("OPTIMIZE delta_orders ZORDER BY (country)")
    
    # Query performance
    spark.sql("SELECT * FROM delta_orders WHERE country = 'India'").explain(True)
    

    📈 How to Check if Index is Used

    • PostgreSQL: EXPLAIN ANALYZE SELECT ... → Look for Index Scan or Bitmap Index Scan
    • MySQL: EXPLAIN SELECT ... → Look for Using index or ref
    • Spark: Use .explain(True) or Query Execution Plan

    📉 Disadvantages of Indexing

    ConcernExplanation
    Slower WritesINSERT/UPDATE/DELETE are slower due to index maintenance
    Disk SpaceIndexes consume additional space
    Over-indexingToo many indexes confuse the optimizer and bloat storage
    FragmentationMay reduce performance over time

    🧠 Advanced Tips & Best Practices

    • Start with frequently queried columns (WHERE/JOIN).
    • Use composite indexes carefully – Always filter using left-most column.
    • Don’t index everything – Focus on high-cardinality, high-impact queries.
    • Regularly analyze and drop unused indexes – Monitor via query plans.
    • Monitor index usage – Most RDBMSs provide pg_stat_user_indexes, SHOW INDEX, or query performance tables.

    🧊 Final Summary

    FeaturePurposeExample
    🔎 Fast lookupsSingle-column indexCREATE INDEX ON employees(email)
    🧩 Multiple filtersComposite indexCREATE INDEX ON orders(customer_id, order_date)
    🔐 UniquenessUnique indexCREATE UNIQUE INDEX ON users(email)
    🧠 Function queriesFunction-based indexCREATE INDEX ON users(LOWER(email))
    🎯 ConditionalPartial indexCREATE INDEX ON users(last_login) WHERE is_active = true
    🔄 Maintain speedZ-Order (Delta Lake)OPTIMIZE delta_orders ZORDER BY (country)

  • Spark SQL Operators Cheatsheet

    1. Arithmetic Operators

    OperatorSyntaxDescriptionExample
    +a + bAdds two valuesSELECT 5 + 3;
    -a - bSubtracts one value from anotherSELECT 5 - 3;
    *a * bMultiplies two valuesSELECT 5 * 3;
    /a / bDivides one value by anotherSELECT 6 / 2;
    %a % bReturns the remainder of divisionSELECT 5 % 2;

    2. Comparison Operators

    OperatorSyntaxDescriptionExample
    =a = bEqual toSELECT * FROM table WHERE col = 10;
    != or <>a != b or a <> bNot equal toSELECT * FROM table WHERE col != 10;
    >a > bGreater thanSELECT * FROM table WHERE col > 10;
    <a < bLess thanSELECT * FROM table WHERE col < 10;
    >=a >= bGreater than or equal toSELECT * FROM table WHERE col >= 10;
    <=a <= bLess than or equal toSELECT * FROM table WHERE col <= 10;

    3. Logical Operators

    OperatorSyntaxDescriptionExample
    ANDa AND bReturns true if both conditions are trueSELECT * FROM table WHERE a > 10 AND b < 20;
    ORa OR bReturns true if any condition is trueSELECT * FROM table WHERE a > 10 OR b < 20;
    NOTNOT aReverses the boolean value of a conditionSELECT * FROM table WHERE NOT a > 10;

    4. String Operators

    OperatorSyntaxDescriptionExample
    ```ab`Concatenates two strings`SELECT ‘Hello’‘ World’;`
    LIKEa LIKE patternMatches a string to a patternSELECT * FROM table WHERE name LIKE 'A%';
    RLIKEa RLIKE patternMatches a string to a regex patternSELECT * FROM table WHERE name RLIKE '^A';
    ILIKEa ILIKE patternCase-insensitive LIKE pattern matchingSELECT * FROM table WHERE name ILIKE 'a%';

    5. Null Handling Operators

    OperatorSyntaxDescriptionExample
    IS NULLa IS NULLChecks if a value is NULLSELECT * FROM table WHERE col IS NULL;
    IS NOT NULLa IS NOT NULLChecks if a value is NOT NULLSELECT * FROM table WHERE col IS NOT NULL;
    NULLIFNULLIF(a, b)Returns NULL if a equals b, otherwise aSELECT NULLIF(10, 10);
    COALESCECOALESCE(a, b, ...)Returns the first non-NULL valueSELECT COALESCE(NULL, 'default');

    6. Set Operators

    OperatorSyntaxDescriptionExample
    UNIONquery1 UNION query2Combines results of two queries (removes duplicates)SELECT col FROM table1 UNION SELECT col FROM table2;
    UNION ALLquery1 UNION ALL query2Combines results of two queries (includes duplicates)SELECT col FROM table1 UNION ALL SELECT col FROM table2;
    INTERSECTquery1 INTERSECT query2Returns common rows from both queriesSELECT col FROM table1 INTERSECT SELECT col FROM table2;
    EXCEPTquery1 EXCEPT query2Returns rows from the first query not in the secondSELECT col FROM table1 EXCEPT SELECT col FROM table2;

    7. Aggregate Functions (Operators)

    FunctionSyntaxDescriptionExample
    COUNTCOUNT(*) or COUNT(col)Counts rows or non-NULL valuesSELECT COUNT(*) FROM table;
    SUMSUM(col)Sums up numeric valuesSELECT SUM(sales) FROM table;
    AVGAVG(col)Calculates the averageSELECT AVG(price) FROM table;
    MINMIN(col)Finds the minimum valueSELECT MIN(price) FROM table;
    MAXMAX(col)Finds the maximum valueSELECT MAX(price) FROM table;
    GROUP BYGROUP BY colGroups rows based on a column valueSELECT category, SUM(sales) FROM table GROUP BY category;

    8. Window Functions

    FunctionSyntaxDescriptionExample
    ROW_NUMBERROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)Assigns a unique number to each row within a partitionSELECT ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales) AS rank FROM table;
    RANKRANK() OVER (PARTITION BY col ORDER BY col2)Assigns rank to rows with gaps for duplicatesSELECT RANK() OVER (PARTITION BY category ORDER BY sales DESC) AS rank FROM table;
    DENSE_RANKDENSE_RANK() OVER (PARTITION BY col ORDER BY col2)Assigns rank to rows without gapsSELECT DENSE_RANK() OVER (PARTITION BY category ORDER BY sales DESC) AS rank FROM table;
    NTILENTILE(n) OVER (PARTITION BY col ORDER BY col2)Divides rows into n bucketsSELECT NTILE(4) OVER (ORDER BY sales) AS quartile FROM table;
    LEADLEAD(col, offset, default) OVER (PARTITION BY col ORDER BY col2)Accesses the value of the next rowSELECT LEAD(sales, 1, 0) OVER (ORDER BY date) FROM table;
    LAGLAG(col, offset, default) OVER (PARTITION BY col ORDER BY col2)Accesses the value of the previous rowSELECT LAG(sales, 1, 0) OVER (ORDER BY date) FROM table;

    9. Miscellaneous Operators

    OperatorSyntaxDescriptionExample
    CASTCAST(expr AS type)Converts a value to a specified typeSELECT CAST(price AS STRING) FROM table;
    CASECASE WHEN condition THEN result ELSE result ENDConditional logicSELECT CASE WHEN sales > 100 THEN 'High' ELSE 'Low' END AS category FROM table;
    DISTINCTDISTINCT colReturns unique valuesSELECT DISTINCT category FROM table;

    This cheatsheet provides a comprehensive overview of commonly used Spark SQL operators and functions with their syntax, descriptions, and examples. Use it as a reference to efficiently write and optimize Spark SQL queries!

  • Syntax Rules for Pseudocode

    • Natural Language: Use simple and clear natural language to describe steps.
    • Keywords: Use standard control flow keywords such as:
      • IF, ELSE, ENDIF
      • FOR, WHILE, ENDWHILE
      • FUNCTION, CALL
      • INPUT, OUTPUT
    • Indentation: Indent blocks within loops or conditionals to signify nesting.
    • Capitalization: Write pseudocode keywords in UPPERCASE to distinguish them from variables and logic descriptions.
    • Variables: Use meaningful and descriptive names (e.g., customerName, totalSales).

    Standards:

    1. Use simple language: Avoid complex sentences and focus on concise, easy-to-understand statements.
    2. Be consistent: Use the same terminology and formatting throughout the pseudocode.
    3. Use indentation: Indent code blocks to show hierarchy and improve readability.

    Terms

    1. INPUT: Used to describe user input or data input.
      Example: INPUT name
    2. OUTPUT: Used to describe output or results.
      Example: OUTPUT "Hello, " + name
    3. SET: Used to assign a value to a variable.
      Example: SET count = 0
    4. IF: Used to describe conditional statements.
      Example: IF age > 18 THEN ...
    5. WHILE: Used to describe loops.
      Example: WHILE count < 10 DO ...
    6. FOR: Used to describe loops with a counter.
      Example: FOR i = 1 TO 10 DO ...
    7. REPEAT: Used to describe loops that repeat until a condition is met.
      Example: REPEAT ... UNTIL count = 10
    8. UNTIL: Used to describe loops that repeat until a condition is met.
      Example: REPEAT ... UNTIL count = 10
    9. CASE: Used to describe multi-branch conditional statements.
      Example: CASE color OF ...
    10. PROCEDURE: Used to describe a subroutine or function.
      Example: PROCEDURE greet(name) ...

    Common Terms and Constructs in Pseudocode

    Control Structures

    Conditionals:
    IF condition 
    THEN // Actions
     ELSE // Alternative actions
     ENDIF
    LoopsFOR variable FROM start TO end
     DO // Actions 
    ENDFOR 
    WHILE condition 
    DO // Actions 
    ENDWHILE

    Input/Output

    Use clear and direct terms for input and output:
    INPUT "Enter a number:", 
    user Number
     OUTPUT "The result is:", result

    Functions and Procedures

    Define reusable components:FUNCTION calculateSum(a, b)
     RETURN a + b
     ENDFUNCTION

    Error Handling

    Describe how to handle errors in a readable way:IF error occurs 
    THEN OUTPUT "Error encountered. Exiting process." 
    EXIT 
    ENDIF

    Data Manipulation

    Data operations like adding, updating, or deleting:                                                                                 SET total TO total + itemCost 
    REMOVE item FROM shoppingCart

    Example of Good Pseudocodes

    Here’s an example of a simple algorithm to calculate the average of three numbers:

    INPUT num1, num2, num3
    
    SET sum = num1 + num2 + num3
    SET average = sum / 3
    
    OUTPUT "The average is: " + average

    Problem: Calculate the factorial of a given number.

    START
    
    INPUT "Enter a positive integer:", num
    IF num < 0 THEN
        OUTPUT "Error: Number must be positive."
        EXIT
    ENDIF
    
    SET factorial TO 1
    FOR i FROM 1 TO num DO
        SET factorial TO factorial * i
    ENDFOR
    
    OUTPUT "The factorial of", num, "is", factorial
    
    END

    By following these guidelines, syntax, standards, and terms, you can write perfect pseudocode that is easy to read, understand, and implement.

  • Window functions in PySpark allow you to perform operations on a subset of your data using a “window” that defines a range of rows. These functions are similar to SQL window functions and are useful for tasks like ranking, cumulative sums, and moving averages. Let’s go through various PySpark DataFrame window functions, compare them with Spark SQL window functions, and provide examples with a large sample dataset.

    PySpark’s window functions allow operations across a specified “window” of rows, such as performing aggregations, ranking, or comparisons. The functionality mimics SQL window functions but uses PySpark’s syntax.


    Syntax Structure

    Define a Window Specification: The Window object specifies how rows are partitioned and ordered for the operation.

    from pyspark.sql.window import Window 
    window_spec = Window.partitionBy("column1").orderBy("column2")

    Apply the Window Function: Use PySpark functions like row_number()rank()dense_rank(), etc., with the window specification.

    from pyspark.sql.functions import row_number, rank, dense_rank, sum 
    df.withColumn("row_num", row_number().over(window_spec))

    Window Specification Options

    OptionDescriptionSyntax
    partitionBy()Divides the data into partitions for independent calculations.Window.partitionBy("column1")
    orderBy()Specifies the order of rows within each partition.Window.orderBy("column2")
    rowsBetween()Defines a window frame by rows relative to the current row..rowsBetween(-1, 1)
    rangeBetween()Defines a window frame based on the range of values in the ordering column..rangeBetween(-10, 10)
    unboundedPrecedingIndicates all rows before the current row in the partition.Window.rowsBetween(Window.unboundedPreceding, 0)
    unboundedFollowingIndicates all rows after the current row in the partition.Window.rowsBetween(0, Window.unboundedFollowing)
    currentRowRefers to the current row in the partition.Window.rowsBetween(Window.currentRow, Window.currentRow)

    Common PySpark Window Functions

    FunctionDescription
    row_number()Assigns a unique number to each row in a window.
    rank()Assigns a rank to each row, with gaps for ties.
    dense_rank()Assigns a rank to each row, without gaps for ties.
    ntile(n)Divides rows into n buckets and assigns a bucket number to each row.
    lead(column, n)Returns the value of the column from n rows ahead of the current row.
    lag(column, n)Returns the value of the column from n rows behind the current row.
    first()Returns the first value in the window frame.
    last()Returns the last value in the window frame.
    sum()Computes the sum of the column over the window frame.
    avg()Computes the average of the column over the window frame.
    max()Returns the maximum value of the column over the window frame.
    min()Returns the minimum value of the column over the window frame.
    count()Returns the count of rows in the window frame.

    When using the RANK window function in Spark SQL or DataFrame API, if there are duplicates within a partition, the behavior is as follows:

    • The RANK function assigns the same rank to duplicate values.
    • The next rank value is skipped. For example, if two rows have the same value and are assigned rank 1, the next row will be assigned rank 3.

    Here’s an example:

    +--------+-------+
    |  value | rank  |
    +--------+-------+
    |     10 |     1 |
    |     10 |     1 |
    |      9 |     3 |
    |      8 |     4 |
    +--------+-------+

    In contrast, the DENSE_RANK function does not skip rank values. If there are duplicates, the next rank value will be consecutive.

    +--------+-----------+
    |  value | dense_rank|
    +--------+-----------+
    |     10 |         1 |
    |     10 |         1 |
    |      9 |         2 |
    |      8 |         3 |
    +--------+-----------+

    Examples

    1. Ranking Employees by Salary

    from pyspark.sql.window import Window
    from pyspark.sql.functions import row_number, rank, dense_rank
    
    data = [(1, "Alice", 5000), (2, "Bob", 6000), (3, "Charlie", 4000), (4, "Alice", 7000)]
    columns = ["EmpID", "Name", "Salary"]
    
    df = spark.createDataFrame(data, columns)
    
    window_spec = Window.partitionBy("Name").orderBy("Salary")
    
    df = df.withColumn("row_number", row_number().over(window_spec))
           .withColumn("rank", rank().over(window_spec))
           .withColumn("dense_rank", dense_rank().over(window_spec))
    
    df.show()
    

    Output:

    EmpIDNameSalaryrow_numberrankdense_rank
    3Charlie4000111
    1Alice5000111
    4Alice7000222
    2Bob6000111

    2. Cumulative Sum

    from pyspark.sql.functions import sum
    
    window_spec = Window.partitionBy("Name").orderBy("Salary").rowsBetween(Window.unboundedPreceding, Window.currentRow)
    
    df = df.withColumn("cumulative_sum", sum("Salary").over(window_spec))
    df.show()
    

    Output:

    EmpIDNameSalarycumulative_sum
    3Charlie40004000
    1Alice50005000
    4Alice700012000
    2Bob60006000

    Options for Handling NULLs

    1. Exclude NULLs in Order: Use NULLS FIRST or NULLS LAST in orderBy(). Window.orderBy(col("Salary").desc().asc_nulls_last())
    2. Filter NULLs in Partition: Use .filter() before applying the window function. df.filter(col("Salary").isNotNull())

    Important Notes

    • PartitionBy: Breaks data into logical groups for independent calculations.
    • OrderBy: Determines the order within each partition.
    • Frame Specification: Allows cumulative, rolling, or specific-frame calculations using rowsBetween or rangeBetween

    Setting Up the Environment

    First, let’s set up the environment and create a sample dataset.

    from pyspark.sql import SparkSession
    from pyspark.sql.window import Window
    from pyspark.sql.functions import col, row_number, rank, dense_rank, percent_rank, ntile, lag, lead, sum, avg

    # Initialize Spark session
    spark = SparkSession.builder
    .appName("PySpark Window Functions")
    .getOrCreate()

    # Create a sample dataset
    data = [(1, "Alice", 1000),
    (2, "Bob", 1200),
    (3, "Catherine", 1200),
    (4, "David", 800),
    (5, "Eve", 950),
    (6, "Frank", 800),
    (7, "George", 1200),
    (8, "Hannah", 1000),
    (9, "Ivy", 950),
    (10, "Jack", 1200)]
    columns = ["id", "name", "salary"]

    df = spark.createDataFrame(data, schema=columns)
    df.show()

    PySpark Window Functions

    1. Row Number

    The row_number function assigns a unique number to each row within a window partition.

    windowSpec = Window.partitionBy("salary").orderBy("id")
    df.withColumn("row_number", row_number().over(windowSpec)).show()

    2. Rank

    The rank function provides ranks to rows within a window partition, with gaps in ranking.

    df.withColumn("rank", rank().over(windowSpec)).show()

    3. Dense Rank

    The dense_rank function provides ranks to rows within a window partition, without gaps in ranking.

    df.withColumn("dense_rank", dense_rank().over(windowSpec)).show()

    4. Percent Rank

    The percent_rank function calculates the percentile rank of rows within a window partition.

    df.withColumn("percent_rank", percent_rank().over(windowSpec)).show()
    

    5. NTile

    The ntile function divides the rows within a window partition into n buckets.

    df.withColumn("ntile", ntile(4).over(windowSpec)).show()

    6. Lag

    The lag function provides access to a row at a given physical offset before the current row within a window partition.

    df.withColumn("lag", lag("salary", 1).over(windowSpec)).show()

    7. Lead

    The lead function provides access to a row at a given physical offset after the current row within a window partition.

    df.withColumn("lead", lead("salary", 1).over(windowSpec)).show()

    8. Cumulative Sum

    The sum function calculates the cumulative sum of values within a window partition.

    df.withColumn("cumulative_sum", sum("salary").over(windowSpec)).show()
    

    9. Moving Average

    The avg function calculates the moving average of values within a window partition.

    df.withColumn("moving_avg", avg("salary").over(windowSpec)).show()


    There are multiple ways to apply window functions on DataFrames in PySpark. While withColumn is the most commonly used method to add a new column with a window function, there are other approaches to apply window functions, depending on the specific use case.

    Here are different methods for applying window functions:

    1. Using select() with window functions

    Instead of withColumn(), you can use select() to directly apply a window function to the columns of the DataFrame. This is useful when you only want to return a subset of columns along with the windowed column.

    from pyspark.sql.functions import row_number
    from pyspark.sql import Window
    
    # Define the window specification
    windowSpec = Window.partitionBy("salary").orderBy("id")
    
    # Use select to apply the window function
    df.select("id", "salary", row_number().over(windowSpec).alias("row_number")).show()
    

    2. Using agg() with window functions

    Window functions can also be applied when performing aggregations (agg()). This is useful when you want to calculate aggregated metrics (e.g., sum, avg) over a window.

    from pyspark.sql.functions import sum
    from pyspark.sql import Window
    
    # Define the window specification
    windowSpec = Window.partitionBy("salary")
    
    # Apply window function during aggregation
    df.groupBy("id").agg(sum("salary").over(windowSpec).alias("total_salary")).show()
    

    3. Using filter() or where()

    Sometimes, window functions are used in conjunction with filters to extract specific rows, such as filtering the first or last row per partition.

    from pyspark.sql.functions import row_number
    
    # Define the window specification
    windowSpec = Window.partitionBy("salary").orderBy("id")
    
    # Apply window function and filter based on the rank
    ranked_df = df.withColumn("row_number", row_number().over(windowSpec))
    ranked_df.filter(ranked_df["row_number"] == 1).show()  # Filter to get the first row per partition
    

    4. Using groupBy() with window functions

    Though groupBy() is usually used for aggregations, you can combine it with window functions. Window functions won’t replace groupBy(), but you can apply them after aggregations.

    from pyspark.sql.functions import rank
    
    # Define the window specification
    windowSpec = Window.partitionBy("salary").orderBy("id")
    
    # First, group by some column and then apply a window function
    grouped_df = df.groupBy("salary").count()
    grouped_df.withColumn("rank", rank().over(windowSpec)).show()
    

    5. Using withColumnRenamed() with window functions

    You can also rename the result of a window function when adding it as a new column.

    from pyspark.sql.functions import row_number
    
    # Define the window specification
    windowSpec = Window.partitionBy("salary").orderBy("id")
    
    # Apply the window function and rename the column
    df.withColumn("row_number", row_number().over(windowSpec)).withColumnRenamed("row_number", "rank").show()
    

    6. Combining multiple window functions in one step

    You can apply multiple window functions in a single step using either select() or withColumn().

    from pyspark.sql.functions import row_number, rank
    
    # Define the window specification
    windowSpec = Window.partitionBy("salary").orderBy("id")
    
    # Apply multiple window functions
    df.select("id", "salary",
              row_number().over(windowSpec).alias("row_number"),
              rank().over(windowSpec).alias("rank")
             ).show()

    Comparison with Spark SQL Window Functions

    All the above operations can also be performed using Spark SQL. Here are the equivalent SQL queries:

    1. Row Number

    SELECT id, name, salary,
           ROW_NUMBER() OVER (PARTITION BY salary ORDER BY id) AS row_number
    FROM df
    

    2. Rank

    SELECT id, name, salary,
           RANK() OVER (PARTITION BY salary ORDER BY id) AS rank
    FROM df
    

    3. Dense Rank

    SELECT id, name, salary,
           DENSE_RANK() OVER (PARTITION BY salary ORDER BY id) AS dense_rank
    FROM df
    

    4. Percent Rank

    SELECT id, name, salary,
           PERCENT_RANK() OVER (PARTITION BY salary ORDER BY id) AS percent_rank
    FROM df
    

    5. NTile

    SELECT id, name, salary,
           NTILE(4) OVER (PARTITION BY salary ORDER BY id) AS ntile
    FROM df
    

    6. Lag

    SELECT id, name, salary,
           LAG(salary, 1) OVER (PARTITION BY salary ORDER BY id) AS lag
    FROM df
    

    7. Lead

    SELECT id, name, salary,
           LEAD(salary, 1) OVER (PARTITION BY salary ORDER BY id) AS lead
    FROM df
    

    8. Cumulative Sum

    SELECT id, name, salary,
           SUM(salary) OVER (PARTITION BY salary ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum
    FROM df
    

    9. Moving Average

    SELECT id, name, salary,
    AVG(salary) OVER (PARTITION BY salary ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS moving_avg
    FROM df

    Large Sample Dataset Example

    Let’s create a larger dataset and apply window functions.

    import random
    
    # Create a larger dataset
    large_data = [(i, f"Name_{i}", random.choice([1000, 1200, 950, 800])) for i in range(1, 101)]
    large_df = spark.createDataFrame(large_data, schema=columns)
    large_df.show(10)
    
    # Apply window functions
    large_windowSpec = Window.partitionBy("salary").orderBy("id")
    
    large_df.withColumn("row_number", row_number().over(large_windowSpec)).show(10)
    large_df.withColumn("rank", rank().over(large_windowSpec)).show(10)
    large_df.withColumn("dense_rank", dense_rank().over(large_windowSpec)).show(10)
    large_df.withColumn("percent_rank", percent_rank().over(large_windowSpec)).show(10)
    large_df.withColumn("ntile", ntile(4).over(large_windowSpec)).show(10)
    large_df.withColumn("lag", lag("salary", 1).over(large_windowSpec)).show(10)
    large_df.withColumn("lead", lead("salary", 1).over(large_windowSpec)).show(10)
    large_df.withColumn("cumulative_sum", sum("salary").over(large_windowSpec)).show(10)
    large_df.withColumn("moving_avg", avg("salary").over(large_windowSpec)).show(10)
    

    Window functions in PySpark and Spark SQL are powerful tools for data analysis. They allow you to perform complex calculations and transformations on subsets of your data, similar to SQL window functions. By using window functions, you can easily implement features like ranking, cumulative sums, and moving averages in your PySpark applications.

    Examples:-

    1.pyspark dataframes Remove duplicates based on specific columns and then order by different columns

    To remove duplicates from a PySpark DataFrame based on specific columns and order the remaining rows by different columns, you can use a combination of the dropDuplicates() function and the orderBy() (or sort()) function.

    Here is an example that demonstrates this process:

    1. Remove duplicates based on specific columns.
    2. Order the resulting DataFrame by different columns.
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col

    # Initialize SparkSession
    spark = SparkSession.builder.appName("RemoveDuplicatesAndOrder").getOrCreate()

    # Sample data
    data = [
    (1, "Alice", 29),
    (2, "Bob", 30),
    (3, "Alice", 29),
    (4, "David", 35),
    (5, "Alice", 25)
    ]

    # Create DataFrame
    columns = ["id", "name", "age"]
    df = spark.createDataFrame(data, columns)

    # Show the original DataFrame
    print("Original DataFrame:")
    df.show()

    # Step 1: Remove duplicates based on specific columns (e.g., "name", "age")
    df_no_duplicates = df.dropDuplicates(["name", "age"])

    # Step 2: Order the resulting DataFrame by different columns (e.g., "age" in descending order)
    df_ordered = df_no_duplicates.orderBy(col("age").desc())

    # Show the resulting DataFrame
    print("DataFrame after removing duplicates and ordering:")
    df_ordered.show()

    # Stop SparkSession
    spark.stop()

    Explanation:

    1. Initialization and Data Preparation:
      • A SparkSession is created.
      • Sample data is provided, and a DataFrame is created from this data.
    2. Removing Duplicates:
      • The dropDuplicates() function is used to remove rows that have the same values in the specified columns ("name" and "age" in this case).
    3. Ordering Data:
      • The orderBy() function is used to sort the DataFrame by the specified columns. In this case, the DataFrame is ordered by "age" in descending order.
    4. Displaying Results:
      • The original and resulting DataFrames are displayed using the show() function.

    Example Output:

    Original DataFrame:
    +---+-----+---+
    | id| name|age|
    +---+-----+---+
    | 1|Alice| 29|
    | 2| Bob| 30|
    | 3|Alice| 29|
    | 4|David| 35|
    | 5|Alice| 25|
    +---+-----+---+

    DataFrame after removing duplicates and ordering:
    +---+-----+---+
    | id| name|age|
    +---+-----+---+
    | 4|David| 35|
    | 2| Bob| 30|
    | 1|Alice| 29|
    | 5|Alice| 25|
    +---+-----+---+

    Additional Notes:

    • dropDuplicates(): This function removes duplicate rows based on the specified columns. If no columns are specified, it removes rows that are identical in all columns.
    • orderBy()/sort(): These functions are used to order the DataFrame. You can specify multiple columns and the sorting order (ascending or descending) for each column.
    • You can chain multiple DataFrame operations together. For example, you can combine dropDuplicates() and orderBy() in a single statement: df.dropDuplicates(["name", "age"]).orderBy(col("age").desc())

    2.pyspark dataframes Remove duplicates based on specific columns while ordering by different columns

    To remove duplicates from a PySpark DataFrame based on specific columns while ensuring the ordering of the data based on other columns, you can use the window functions in PySpark. This approach allows you to specify how to handle duplicates and maintain the desired order.

    Below is an example to demonstrate how to remove duplicates based on specific columns (name, age) while ordering the rows by different columns (age descending and id ascending):

    from pyspark.sql import SparkSession
    from pyspark.sql.window import Window
    from pyspark.sql.functions import col, row_number
    
    # Initialize SparkSession
    spark = SparkSession.builder.appName("RemoveDuplicatesWithOrdering").getOrCreate()
    
    # Sample data
    data = [
        (1, "Alice", 29),
        (2, "Bob", 30),
        (3, "Alice", 29),
        (4, "David", 35),
        (5, "Alice", 25)
    ]
    
    # Create DataFrame
    columns = ["id", "name", "age"]
    df = spark.createDataFrame(data, columns)
    
    # Show the original DataFrame
    print("Original DataFrame:")
    df.show()
    
    # Define a window specification
    windowSpec = Window.partitionBy("name", "age").orderBy(col("age").desc(), col("id").asc())
    
    # Add a row number to each partition
    df_with_row_number = df.withColumn("row_number", row_number().over(windowSpec))
    
    # Filter the rows to keep only the first occurrence of each combination of name and age
    df_no_duplicates = df_with_row_number.filter(col("row_number") == 1).drop("row_number")
    
    # Show the resulting DataFrame
    print("DataFrame after removing duplicates and ordering:")
    df_no_duplicates.show()
    
    # Stop SparkSession
    spark.stop()
    

    Explanation:

    1. Initialization and Data Preparation:
      • A SparkSession is created.
      • Sample data is provided, and a DataFrame is created from this data.
    2. Define Window Specification:
      • A window specification is defined using Window.partitionBy("name", "age").orderBy(col("age").desc(), col("id").asc()). This specifies that the data should be partitioned by the columns name and age and ordered within each partition by age in descending order and id in ascending order.
    3. Add Row Number:
      • The row_number() function is used to add a row number to each row within the specified window. This row number helps to identify and keep only the first occurrence of each partition.
    4. Filter Rows:
      • The DataFrame is filtered to keep only the rows where the row_number is 1. This effectively removes duplicates while maintaining the specified order.
    5. Display Results:
      • The original and resulting DataFrames are displayed using the show() function.

    Example Output:

    Original DataFrame:
    +---+-----+---+
    | id| name|age|
    +---+-----+---+
    | 1|Alice| 29|
    | 2| Bob| 30|
    | 3|Alice| 29|
    | 4|David| 35|
    | 5|Alice| 25|
    +---+-----+---+

    DataFrame after removing duplicates and ordering:
    +---+-----+---+
    | id| name|age|
    +---+-----+---+
    | 4|David| 35|
    | 2| Bob| 30|
    | 1|Alice| 29|
    | 5|Alice| 25|
    +---+-----+---+

    Notes:

    • Window Specification: The Window specification defines how the data should be partitioned and ordered.
    • Row Number: The row_number() function assigns a unique row number within each window partition.
    • Filter and Drop: The resulting DataFrame is filtered to keep only the rows where the row_number is 1, and the temporary row_number column is dropped.

    This approach ensures that duplicates are removed based on the specified columns while maintaining the desired order of the data. You can adjust the partitioning and ordering columns according to your specific requirements.


    Codes for Revising Pyspark Windows

    
    # Sample data
    data = [
        (1, "Alice", 10, 8000, "New York"),
        (2, "Bob", 11, 9000, "New York"),
        (3, "Charlie", 10, 10000, "Chicago"),
        (4, "David", 12, 9000, "New York"),
        (6, "Eve", 13, 9000, "Chicago"),
        (7, "GEve", 13, 10000, "Chicago"),
        (8, "REve", 13, 5000, "Chicago"),
        (9, "ScEve", 14, 5600, "LA"),
        (10, "DEve", 15, 11000, "LA"),
        (11, "Ram", 14, 11000, "LA"),
        (12, "Hem", 10, 8000, "LA"),
        (13, "Hith", 11, 6000, "Chicago"),
        (14, "Pit", 15, 13000, "Chicago"),
        (15, "Evelyn", 15, 14000, "New York"),
        (16, "FteEve", 12, 9200, "New York"),
        (17, "sctuEve", 12, None, "Chicago"),
    ]
    
    # Define schema
    columns = ["EmpID", "Emp_name", "Manager_id", "Salary", "Location"]
    
    df = spark.createDataFrame(data, schema=columns)
    df.show()
    
    from pyspark.sql.functions import row_number, rank, dense_rank
    from pyspark.sql.window import Window
    from pyspark.sql.functions import row_number, rank, dense_rank, col
    
    wf=Window.partitionBy("Location").orderBy("Salary")
    
    # Calculate row_number, rank, and dense_rank separately
    row_number_col = row_number().over(wf).alias("row_number")
    rank_col = rank().over(wf).alias("rank")
    dense_rank_col = dense_rank().over(wf).alias("dense_rank")
    
    # Select columns including calculated window function results
    df.select(
        "EmpID",
        "Emp_name",
        "Manager_id",
        "salary",
        "Location",
        row_number_col,
        rank_col,
        dense_rank_col
    ).show()
    
    
    df.select(
        "EmpID",
        "Emp_name",
        "Manager_id",
        "salary",
        "Location",
        row_number().over(wf).alias("row_number"),
        rank().over(wf).alias("rank"),
        dense_rank().over(wf).alias("dense_rank")
    ).show()
    
    
    #Using withColumn with window functions
    
    
    df.withColumn("row_number", row_number().over(wf))
      .withColumn("rank", rank().over(wf))
      .withColumn("dense_rank", dense_rank().over(wf))
      .show()
    
    #Using selectExpr with window functions
    
    df.selectExpr(
        "EmpID",
        "Emp_name",
        "Manager_id",
        "salary",
        "Location",
        "row_number() OVER (PARTITION BY Location ORDER BY Salary) AS row_number",  # Define window here
        "rank() OVER (PARTITION BY Location ORDER BY Salary) AS rank",          # Define window here
        "dense_rank() OVER (PARTITION BY Location ORDER BY Salary) AS dense_rank"  # Define window here
    ).show()
    
    #Using withColumn with window functions and chaining
    
    df.withColumn("row_number", row_number().over(wf))
      .withColumn("rank", rank().over(wf))
      .withColumn("dense_rank", dense_rank().over(wf))
      .drop("salary")
      .filter(col("row_number") == 1)
      .show()
    
    df.createOrReplaceTempView("dfview")
    spark.sql(""" select EmpID,Emp_name,Manager_id,salary,Location,row_number() OVER (PARTITION BY Location ORDER BY Salary) AS row_number,
              rank() OVER (PARTITION BY Location ORDER BY Salary) AS rank,dense_rank() OVER (PARTITION BY Location ORDER BY Salary) AS dense_rank
               from dfview """ ) .show()
    
    
    spark.sql("""
    SELECT EmpID, Emp_name, Manager_id, salary, Location,
           row_number() OVER w AS row_number,
           rank() OVER w AS rank,
           dense_rank() OVER w AS dense_rank
    FROM dfview
    WINDOW w AS (PARTITION BY Location ORDER BY Salary)
    """).show()           
  • For Better understanding on Spark SQL windows Function and Best Usecases do refer our post Window functions in Oracle Pl/Sql and Hive explained and compared with examples.

    Window functions in Spark SQL are powerful tools that allow you to perform calculations across a set of table rows that are somehow related to the current row. These functions are particularly useful for tasks that require a global view of the data, such as running totals, ranking, and time-series analysis.

    Overview of Window Functions

    In Spark SQL, window functions are used with the OVER clause and can be categorized into several types:

    Window Specification

    To use a window function, you need to define a window specification that includes:

    • Partitioning: Defines the subsets of data to which the function is applied.
    • Ordering: Defines the order of rows within each partition.
    • Frame: Defines the subset of rows relative to the current row.

    Basic Syntax of Window Fuction in Spark SQL

    SELECT column1, column2, 
    aggregate_function(column) OVER (PARTITION BY column_name ORDER BY column_name [ROWS/RANGE frame_clause]) AS alias
    FROM table_name;

    Components of a Window Function

    1. Aggregate or Analytical Function:
      • Examples: SUM(), AVG(), COUNT(), ROW_NUMBER(), RANK(), etc.
    2. OVER Clause:
      • Specifies the window over which the function operates.
    3. PARTITION BY Clause (optional):
      • Divides the dataset into partitions.
      • Each partition is processed independently.
    4. ORDER BY Clause (optional):
      • Specifies the ordering of rows within a partition.
    5. Frame Specification (optional):
      • Defines the range of rows within a partition for computation.
      • Options: ROWS, RANGE.


    1. PARTITION BY Clause

    • Divides the dataset into groups (partitions) for which the window function is applied independently.
    • Syntax:PARTITION BY column_name
    • Example:AVG(Salary) OVER (PARTITION BY Department) This computes the average salary separately for each department.

    2. ORDER BY Clause

    • Defines the order of rows within each partition.
    • Syntax:ORDER BY column_name [ASC | DESC] [NULLS FIRST | NULLS LAST]
      • ASC: Ascending order (default).
      • DESC: Descending order.
      • NULLS FIRST: Places NULL values at the beginning of the order.
      • NULLS LAST: Places NULL values at the end of the order.
    • Example:RANK() OVER (PARTITION BY Department ORDER BY Salary DESC NULLS LAST)

    3. Frame Specification

    • Defines the range of rows to consider for the window function.
    • Frame types:
      • ROWS: Based on row positions.
      • RANGE: Based on value ranges.

    Frame Syntax

    [ROWS | RANGE] BETWEEN frame_start AND frame_end

    Frame Options

    1. UNBOUNDED PRECEDING:
      • Includes all rows from the beginning of the partition.
      • Example: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
    2. CURRENT ROW:
      • Includes the current row only.
      • Example: ROWS BETWEEN CURRENT ROW AND CURRENT ROW.
    3. UNBOUNDED FOLLOWING:
      • Includes all rows until the end of the partition.
      • Example: ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING.
    4. Specific Number of Rows:
      • Includes a fixed number of preceding or following rows.
      • Example: ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING.

    Examples

    • Cumulative sum:SUM(Sales) OVER (PARTITION BY CustomerID ORDER BY OrderDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    • Moving average:AVG(Salary) OVER (PARTITION BY Department ORDER BY Salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

    4. Handling NULL Values in ORDER BY

    • By default:
      • NULLS FIRST for ascending order.
      • NULLS LAST for descending order.

    Example

    RANK() OVER (PARTITION BY Department ORDER BY Salary DESC NULLS FIRST)

    5. Supported Functions with OVER Clause

    Spark SQL supports many window functions, including:

    Aggregate Functions

    • SUM(), AVG(), COUNT(), MIN(), MAX()

    Ranking Functions

    • ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(n)

    Value Functions

    • FIRST_VALUE(), LAST_VALUE(), LEAD(column, offset, default), LAG(column, offset, default)

    6. Combining PARTITION BY, ORDER BY, and Frames

    You can combine all components for more control:

    Example:

    SELECT EmpID,
           Department,
           Salary,
           AVG(Salary) OVER (PARTITION BY Department ORDER BY Salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS running_avg,
           RANK() OVER (PARTITION BY Department ORDER BY Salary DESC NULLS LAST) AS salary_rank
    FROM Employee;
    

    This example:

    1. Computes a running average for the last three salaries within each department.
    2. Ranks employees by salary in descending order, placing NULL salaries at the end.

    Summary of Options

    ClauseDescriptionSyntax Example
    PARTITION BYDivides rows into partitionsPARTITION BY Department
    ORDER BYOrders rows within partitionsORDER BY Salary DESC NULLS FIRST
    ROWSRow-based range for window frameROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    RANGEValue-based range for window frameRANGE BETWEEN INTERVAL 1 DAY PRECEDING AND CURRENT ROW
    NULLSSpecifies how nulls are handled in orderingORDER BY Salary DESC NULLS LAST

    By combining these options, Spark SQL allows flexible and efficient computation over distributed datasets!



    Examples

    1. Rank Employees by Salary in Each Department

    SELECT EmpID,
           Emp_name,
           Department,
           Salary,
           RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) AS rank
    FROM Employee;
    

    Explanation:

    • PARTITION BY Department: Separate ranking by department.
    • ORDER BY Salary DESC: Ranks employees by salary in descending order.

    2. Calculate Cumulative Sum of Sales

    SELECT CustomerID, 
    OrderDate,
    ProductID,
    SUM(Sales) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS cumulative_sales
    FROM Orders;

    Explanation:

    • PARTITION BY CustomerID: Cumulative sales calculated per customer.
    • ORDER BY OrderDate: Sales are accumulated in chronological order.

    3. Find Running Average Salary

    SELECT EmpID,
           Emp_name,
           Department,
           AVG(Salary) OVER (PARTITION BY Department ORDER BY Salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS running_avg
    FROM Employee;
    

    Explanation:

    • ROWS BETWEEN 2 PRECEDING AND CURRENT ROW: Calculates the average salary of the current row and the two preceding rows.

    4. Lead and Lag Example

    SELECT EmpID,
           Emp_name,
           Department,
           Salary,
           LAG(Salary, 1, 0) OVER (PARTITION BY Department ORDER BY Salary) AS previous_salary,
           LEAD(Salary, 1, 0) OVER (PARTITION BY Department ORDER BY Salary) AS next_salary
    FROM Employee;
    

    Explanation:

    • LAG(Salary, 1, 0): Fetches the previous salary within the partition. Default is 0 if there is no previous value.
    • LEAD(Salary, 1, 0): Fetches the next salary within the partition. Default is 0 if there is no next value.

    5. Find Employees Above Average Salary in Their Department

    SELECT EmpID,
           Emp_name,
           Department,
           Salary,
           AVG(Salary) OVER (PARTITION BY Department) AS avg_salary
    FROM Employee
    WHERE Salary > AVG(Salary) OVER (PARTITION BY Department);
    

    Explanation:

    • AVG(Salary) OVER (PARTITION BY Department): Calculates average salary for each department.
    • The WHERE clause filters employees whose salary is above the department average.

    Notes

    1. Spark SQL does not support DISTINCT in window functions.
      • For example, COUNT(DISTINCT column) OVER (...) is not allowed in Spark SQL.
    2. Optimized for Distributed Computing:
      • Unlike traditional SQL, Spark SQL handles large-scale datasets by distributing computations across the cluster.
    3. Frame Specification:
      • Use ROWS for row-level computations.
      • Use RANGE for value-based computations.

    Integration in PySpark

    Spark SQL window functions can also be used within PySpark via .selectExpr():

    df.selectExpr("EmpID", 
    "Salary",
    "AVG(Salary) OVER (PARTITION BY Department ORDER BY EmpID) AS avg_salary").show()

    Best Use Cases for Window Functions

    1. Running Totals and Moving Averages

    Calculate running totals or moving averages over a specified window of rows.

    SELECT
        customer_id,
        order_date,
        amount,
        SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
    FROM orders;
    

    2. Ranking

    Assign ranks to rows based on the values in a specific column.

    SELECT
        customer_id,
        order_date,
        amount,
        RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rank
    FROM orders;
    

    3. Time-Series Analysis

    Perform operations like lead and lag for time-series data.

    SELECT
        customer_id,
        order_date,
        amount,
        LAG(amount, 1) OVER (PARTITION BY customer_id ORDER BY order_date) AS previous_amount,
        LEAD(amount, 1) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_amount
    FROM orders;
    

    4. Percentile and Distribution Analysis

    Compute percentiles and cumulative distributions.

    SELECT
        customer_id,
        order_date,
        amount,
        PERCENT_RANK() OVER (PARTITION BY customer_id ORDER BY amount) AS percent_rank,
        CUME_DIST() OVER (PARTITION BY customer_id ORDER BY amount) AS cumulative_distribution
    FROM orders;
    

    5. Identifying First and Last Records in Groups

    Identify the first and last records within each group.

    SELECT
        customer_id,
        order_date,
        amount,
        FIRST_VALUE(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS first_amount,
        LAST_VALUE(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS last_amount
    FROM orders;
    

    6.For SAS programmers:-How to translate sas first. and last. in spark sql?

    In Spark SQL, there’s no direct equivalent of SAS FIRST. and LAST. variables. However, you can achieve similar functionality using window functions like row_number() or first_value() and last_value() depending on your specific needs.

    Here’s how you can translate the concept in Spark SQL:

    1. Using row_number():

    This approach assigns a unique row number within each group defined by the BY variable(s). You can then use conditional logic to identify the first and last rows based on the row number.

    SELECT *,
           CASE WHEN row_number() OVER (PARTITION BY variable1, variable2 ORDER BY sort_column) = 1 THEN 1 ELSE 0 END AS is_first,
           CASE WHEN row_number() OVER (PARTITION BY variable1, variable2 ORDER BY sort_column) = COUNT(*) OVER (PARTITION BY variable1, variable2) THEN 1 ELSE 0 END AS is_last
    FROM your_data
    ORDER BY variable1, variable2, sort_column;
    

    Explanation:

    • We use row_number() with a PARTITION BY clause based on your BY variable(s) and an ORDER BY clause to define the order within each group (replace sort_column with the actual column for sorting).
    • The CASE statement assigns 1 to the first row (where row_number is 1) and 0 to others within the group for is_first.
    • Similarly, the second CASE statement identifies the last row by checking if the row_number is equal to the total count of rows within the group (using COUNT(*) OVER (PARTITION BY...)).

    2. Using first_value() and last_value():

    These functions can be used if you only need the values of specific columns for the first and last rows in each group.

    SELECT *,
           first_value(column_to_get) OVER (PARTITION BY variable1, variable2 ORDER BY sort_column) AS first_value,
           last_value(column_to_get) OVER (PARTITION BY variable1, variable2 ORDER BY sort_column) AS last_value
    FROM your_data
    ORDER BY variable1, variable2, sort_column;
    

    Explanation:

    • first_value() and last_value() are used within the SELECT clause with a similar windowing definition as before.
    • They return the value of the specified column_to_get for the first and last rows within each group, respectively.

    Choosing the Right Approach:

    • If you need additional information beyond just identifying the first and last rows (like row numbers), use row_number().
    • If you only need the values of specific columns for the first and last rows, use first_value() and last_value().

    Remember:

    • Ensure your data is sorted by the BY variable(s) before applying these window functions.
    • Adjust the windowing specifications (partitioning and ordering) based on your actual requirements.

    By leveraging these window functions in Spark SQL, you can effectively achieve functionalities similar to SAS FIRST. and LAST. variables for identifying and processing the first and last observations within groups defined by your BY variable(s).

    Scenario

    Assume we have a dataset with columns group1, group2, order1, and order2, and we want to identify the first and last occurrence of each group based on the combined ordering of order1 and order2.

    SAS Code Example

    data example;
    set example;
    by group1 group2 order1 order2;
    if first.group1 then first_group1 = 1;
    else first_group1 = 0;
    if last.group1 then last_group1 = 1;
    else last_group1 = 0;
    if first.group2 then first_group2 = 1;
    else first_group2 = 0;
    if last.group2 then last_group2 = 1;
    else last_group2 = 0;
    run;

    Equivalent in Spark SQL

    To translate this logic into Spark SQL, you can use the ROW_NUMBER function along with window functions to identify the first and last occurrence of each group.

    Spark SQL Code Example

    1. Create the DataFrame and Register as a Temporary View:
      • This step involves creating a DataFrame and registering it as a temporary view for SQL operations.
    2. Define the Window Specification:
      • The window specification is partitioned by the grouping variables and ordered by the ordering variables.
    3. Identify the First and Last Occurrences:
      • Use ROW_NUMBER with appropriate ordering to mark the first and last occurrences.

    Here’s how you can achieve this:

    from pyspark.sql import SparkSession

    # Initialize Spark session
    spark = SparkSession.builder.appName("SAS First Last Equivalent").getOrCreate()

    # Sample data
    data = [
    ("group1", "subgroup1", "2023-01-01", "10:00", 100),
    ("group1", "subgroup1", "2023-01-01", "10:30", 200),
    ("group1", "subgroup2", "2023-01-01", "11:00", 300),
    ("group2", "subgroup1", "2023-01-01", "09:00", 150),
    ("group2", "subgroup1", "2023-01-01", "09:30", 250),
    ("group2", "subgroup2", "2023-01-02", "08:00", 130),
    ("group2", "subgroup2", "2023-01-02", "08:30", 170)
    ]

    columns = ["group1", "group2", "order1", "order2", "amount"]

    # Create DataFrame
    df = spark.createDataFrame(data, columns)

    # Register the DataFrame as a temporary view
    df.createOrReplaceTempView("example")

    # Use Spark SQL to identify first and last occurrences
    sql_query = """
    WITH RankedData AS (
    SELECT
    group1,
    group2,
    order1,
    order2,
    amount,
    ROW_NUMBER() OVER (PARTITION BY group1 ORDER BY order1, order2) AS rn_group1_asc,
    ROW_NUMBER() OVER (PARTITION BY group1 ORDER BY order1 DESC, order2 DESC) AS rn_group1_desc,
    ROW_NUMBER() OVER (PARTITION BY group2 ORDER BY order1, order2) AS rn_group2_asc,
    ROW_NUMBER() OVER (PARTITION BY group2 ORDER BY order1 DESC, order2 DESC) AS rn_group2_desc
    FROM example
    )
    SELECT
    group1,
    group2,
    order1,
    order2,
    amount,
    CASE WHEN rn_group1_asc = 1 THEN 1 ELSE 0 END AS first_group1,
    CASE WHEN rn_group1_desc = 1 THEN 1 ELSE 0 END AS last_group1,
    CASE WHEN rn_group2_asc = 1 THEN 1 ELSE 0 END AS first_group2,
    CASE WHEN rn_group2_desc = 1 THEN 1 ELSE 0 END AS last_group2
    FROM RankedData
    """

    # Execute the SQL query
    result_df = spark.sql(sql_query)

    # Show the result
    result_df.show()

    Explanation

    1. RankedData Common Table Expression (CTE):
      • We create a CTE called RankedData where we calculate row numbers for each group based on the specified order.
      • ROW_NUMBER() OVER (PARTITION BY group1 ORDER BY order1, order2) assigns a row number to each row within group1 ordered by order1 and order2.
      • ROW_NUMBER() OVER (PARTITION BY group1 ORDER BY order1 DESC, order2 DESC) assigns a reverse row number to identify the last occurrence within group1.
      • The same logic is applied for group2.
    2. Select with CASE Statements:
      • We select from the RankedData CTE and use CASE statements to mark the first and last occurrences for each group.

    Sample Output

    The resulting DataFrame will include flags indicating the first and last occurrences for each group:

    +-------+---------+----------+-------+------+------------+-----------+------------+-----------+
    | group1| group2 | order1 | order2|amount|first_group1|last_group1|first_group2|last_group2|
    +-------+---------+----------+-------+------+------------+-----------+------------+-----------+
    | group1|subgroup1|2023-01-01| 10:00 | 100| 1| 0| 1| 0|
    | group1|subgroup1|2023-01-01| 10:30 | 200| 0| 0| 0| 1|
    | group1|subgroup2|2023-01-01| 11:00 | 300| 0| 1| 1| 1|
    | group2|subgroup1|2023-01-01| 09:00 | 150| 1| 0| 1| 0|
    | group2|subgroup1|2023-01-01| 09:30 | 250| 0| 1| 0| 1|
    | group2|subgroup2|2023-01-02| 08:00 | 130| 0| 0| 1| 0|
    | group2|subgroup2|2023-01-02| 08:30 | 170| 0| 1| 0| 1|
    +-------+---------+----------+-------+------+------------+-----------+------------+-----------+

    Examples from Official Spark SQL Doc:-
    CREATE TABLE employees (name STRING, dept STRING, salary INT, age INT);

    INSERT INTO employees VALUES ("Lisa", "Sales", 10000, 35);
    INSERT INTO employees VALUES ("Evan", "Sales", 32000, 38);
    INSERT INTO employees VALUES ("Fred", "Engineering", 21000, 28);
    INSERT INTO employees VALUES ("Alex", "Sales", 30000, 33);
    INSERT INTO employees VALUES ("Tom", "Engineering", 23000, 33);
    INSERT INTO employees VALUES ("Jane", "Marketing", 29000, 28);
    INSERT INTO employees VALUES ("Jeff", "Marketing", 35000, 38);
    INSERT INTO employees VALUES ("Paul", "Engineering", 29000, 23);
    INSERT INTO employees VALUES ("Chloe", "Engineering", 23000, 25);

    SELECT * FROM employees;
    +-----+-----------+------+-----+
    | name| dept|salary| age|
    +-----+-----------+------+-----+
    |Chloe|Engineering| 23000| 25|
    | Fred|Engineering| 21000| 28|
    | Paul|Engineering| 29000| 23|
    |Helen| Marketing| 29000| 40|
    | Tom|Engineering| 23000| 33|
    | Jane| Marketing| 29000| 28|
    | Jeff| Marketing| 35000| 38|
    | Evan| Sales| 32000| 38|
    | Lisa| Sales| 10000| 35|
    | Alex| Sales| 30000| 33|
    +-----+-----------+------+-----+

    SELECT name, dept, salary, RANK() OVER (PARTITION BY dept ORDER BY salary) AS rank FROM employees;
    +-----+-----------+------+----+
    | name| dept|salary|rank|
    +-----+-----------+------+----+
    | Lisa| Sales| 10000| 1|
    | Alex| Sales| 30000| 2|
    | Evan| Sales| 32000| 3|
    | Fred|Engineering| 21000| 1|
    | Tom|Engineering| 23000| 2|
    |Chloe|Engineering| 23000| 2|
    | Paul|Engineering| 29000| 4|
    |Helen| Marketing| 29000| 1|
    | Jane| Marketing| 29000| 1|
    | Jeff| Marketing| 35000| 3|
    +-----+-----------+------+----+

    SELECT name, dept, salary, DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary ROWS BETWEEN
    UNBOUNDED PRECEDING AND CURRENT ROW) AS dense_rank FROM employees;
    +-----+-----------+------+----------+
    | name| dept|salary|dense_rank|
    +-----+-----------+------+----------+
    | Lisa| Sales| 10000| 1|
    | Alex| Sales| 30000| 2|
    | Evan| Sales| 32000| 3|
    | Fred|Engineering| 21000| 1|
    | Tom|Engineering| 23000| 2|
    |Chloe|Engineering| 23000| 2|
    | Paul|Engineering| 29000| 3|
    |Helen| Marketing| 29000| 1|
    | Jane| Marketing| 29000| 1|
    | Jeff| Marketing| 35000| 2|
    +-----+-----------+------+----------+

    SELECT name, dept, age, CUME_DIST() OVER (PARTITION BY dept ORDER BY age
    RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cume_dist FROM employees;
    +-----+-----------+------+------------------+
    | name| dept|age | cume_dist|
    +-----+-----------+------+------------------+
    | Alex| Sales| 33|0.3333333333333333|
    | Lisa| Sales| 35|0.6666666666666666|
    | Evan| Sales| 38| 1.0|
    | Paul|Engineering| 23| 0.25|
    |Chloe|Engineering| 25| 0.75|
    | Fred|Engineering| 28| 0.25|
    | Tom|Engineering| 33| 1.0|
    | Jane| Marketing| 28|0.3333333333333333|
    | Jeff| Marketing| 38|0.6666666666666666|
    |Helen| Marketing| 40| 1.0|
    +-----+-----------+------+------------------+

    SELECT name, dept, salary, MIN(salary) OVER (PARTITION BY dept ORDER BY salary) AS min
    FROM employees;
    +-----+-----------+------+-----+
    | name| dept|salary| min|
    +-----+-----------+------+-----+
    | Lisa| Sales| 10000|10000|
    | Alex| Sales| 30000|10000|
    | Evan| Sales| 32000|10000|
    |Helen| Marketing| 29000|29000|
    | Jane| Marketing| 29000|29000|
    | Jeff| Marketing| 35000|29000|
    | Fred|Engineering| 21000|21000|
    | Tom|Engineering| 23000|21000|
    |Chloe|Engineering| 23000|21000|
    | Paul|Engineering| 29000|21000|
    +-----+-----------+------+-----+

    6.Another Scenario in SAS:- How to translate a SAS scenario using PROC SORT with BY (ordering based on multiple keys) and NODUPKEY (removing duplicate observations) into Spark SQL:

    SAS Scenario:

    Imagine you have a dataset named sales_data containing columns like customer_id, product_id, date, and sales_amount. You want to:

    1. Sort the data by customer_id, product_id, and date (in ascending order).
    2. Remove any duplicate observations based on the combination of customer_id, product_id, and date.

    SAS Code:

    proc sort data=sales_data;
      by customer_id product_id date;
      nodupkey;
    run;
    

    Spark SQL Translation:

    There are two main approaches to achieve this functionality in Spark SQL:

    Approach 1: Using DISTINCT and ORDER BY:

    SELECT DISTINCT customer_id, product_id, date, sales_amount
    FROM sales_data
    ORDER BY customer_id, product_id, date;
    

    Explanation:

    • DISTINCT: This clause eliminates duplicate rows based on all columns in the SELECT clause (in this case, customer_id, product_id, and date).
    • ORDER BY: This clause sorts the resulting unique data based on the specified order (ascending for all three columns).

    Approach 2: Using ROW_NUMBER() and WHERE:

    SELECT customer_id, product_id, date, sales_amount
    FROM (
      SELECT customer_id, product_id, date, sales_amount,
             row_number() OVER (PARTITION BY customer_id, product_id, date ORDER BY date) AS row_num
      FROM sales_data
    ) AS ranked_data
    WHERE row_num = 1;
    

    Explanation:

    1. Window Function (row_number):
      • A subquery assigns a unique row number (row_num) within each group defined by customer_id, product_id, and date.
      • The ORDER BY date ensures rows are ordered by the date for each customer/product combination.
    2. Filtering Unique Rows:
      • The outer query filters the ranked data to keep only rows where row_num is 1. This selects the first occurrence for each unique combination of customer_id, product_id, and date.

    Choosing the Right Approach:

    • If your primary goal is sorting and eliminating duplicates efficiently, DISTINCT with ORDER BY might be a simpler solution.
    • If you need the row number information for further processing, or if the data volume is large and DISTINCT performance becomes an issue, the window function approach (ROW_NUMBER()) might be preferable.

    Additional Considerations:

    • Both approaches achieve the desired outcome of removing duplicate observations based on the specified key combination and sorting the remaining data.
    • You can adjust the ORDER BY clause to define a different sorting order if needed.

    By using either of these approaches in Spark SQL, you can effectively translate the functionality of SAS PROC SORT with BY and NODUPKEY for handling multiple sorting keys and deduplication within your Spark DataFrame.

    7.SAS TO SQL

    data xyz;set abc;
    by xy;
    retain x1 y1;
    length x1 $400 y1 $100;
    if first.xy then do;
    x1=' ';
    y1=' ';
    end;
    x1=catx('',trim(x1),x2); y1=catx("",trim(y1),y2);
    if last.xy then output;
    run;

    provided SAS code is performing a group-by operation on xy, retaining values across rows within each group, concatenating those values, and outputting the final concatenated results for each group.

    # Define and execute SQL query
    sql_query = """
    WITH concatenated AS (
        SELECT
            xy,
            COLLECT_LIST(x2) OVER (PARTITION BY xy) AS x2_list,
            COLLECT_LIST(y2) OVER (PARTITION BY xy) AS y2_list,
            ROW_NUMBER() OVER (PARTITION BY xy ORDER BY xy) AS row_num,
            COUNT(*) OVER (PARTITION BY xy) AS count_xy
        FROM abc
    )
    SELECT
        xy,
        CONCAT_WS('_', x2_list) AS x1,
        CONCAT_WS('_', y2_list) AS y1
    FROM concatenated
    WHERE row_num = count_xy
    """
    
    # Execute the SQL query
    result = spark.sql(sql_query)
    
    # Show the results
    result.show()
    

  • PySpark Architecture Cheat Sheet

    1. Core Components of PySpark

    ComponentDescriptionKey Features
    Spark CoreThe foundational Spark component for scheduling, memory management, and fault tolerance.Task scheduling, data partitioning, RDD APIs.
    Spark SQLEnables interaction with structured data via SQL, DataFrames, and Datasets.Supports SQL queries, schema inference, integration with Hive.
    Spark StreamingAllows real-time data processing through micro-batching.DStreams, integration with Kafka, Flume, etc.
    Spark MLlibProvides scalable machine learning algorithms.Algorithms for classification, clustering, regression, etc.
    Spark GraphXSupports graph processing and analysis for complex networks.Graph algorithms, graph-parallel computation.

    2. PySpark Layered Architecture

    LayerDescriptionKey Functions
    Application LayerContains user applications and custom PySpark code.Custom data workflows and business logic.
    Spark API LayerProvides PySpark APIs to interact with Spark Core and other components.High-level abstractions for data manipulation, SQL, streaming.
    Spark Core LayerProvides core functionalities including task scheduling and fault tolerance.Data locality, memory management, RDD transformations.
    Execution LayerManages the execution of Spark tasks across the cluster.Task scheduling, load balancing, error handling.
    Storage LayerManages data storage with distributed storage solutions.Supports HDFS, S3, Cassandra, and other storage integrations.

    3. Spark Cluster Architecture

    ComponentDescriptionKey Role
    Driver NodeRuns the Spark application, creates Spark Context, and coordinates tasks execution.Manages jobs, scheduling, and resource allocation.
    Executor NodesRun tasks assigned by the driver node and process data on each partition.Execute tasks, store intermediate results, and return output.
    Cluster ManagerManages the resources of the Spark cluster.Allocates resources, manages executor lifecycle.
    Distributed File SystemStores data across the cluster, ensuring high availability.HDFS, S3, or other compatible storage for data sharing.

    4. Spark Execution Process

    StepDescriptionKey Points
    1. Application SubmissionUser submits a Spark application via command line or a UI.Initializes job creation in Spark.
    2. Job CreationSpark creates jobs and splits them into stages and tasks based on transformations.Directed Acyclic Graph (DAG) creation based on data flow.
    3. Task AssignmentDriver assigns tasks to executors based on data locality and resource availability.Utilizes data partitioning for parallelism.
    4. Task ExecutionExecutors run assigned tasks on data partitions.Processes data in parallel across the cluster.
    5. Result CollectionDriver collects results from all executors, aggregates, and returns the final output.Outputs final results to the user or designated storage.

    5. Spark RDD Architecture

    ComponentDescriptionKey Characteristics
    RDD (Resilient Distributed Dataset)Immutable, distributed collection of objects across the cluster.Fault-tolerant, lineage-based recovery, in-memory processing.
    PartitionSubset of data within an RDD stored on a single node.Enables parallel processing of data.
    TaskSmallest unit of work that operates on a single partition.Executes transformations or actions on data.

    6. Spark DataFrame Architecture

    ComponentDescriptionKey Characteristics
    DataFrameDistributed collection of data organized into named columns.Schema-based data handling, optimized storage, SQL compatibility.
    DatasetStrongly-typed distributed collection of data (Java/Scala).Type safety, combines features of RDDs and DataFrames.
    EncoderConverts data between JVM objects and Spark’s internal format for optimized serialization.Efficient serialization/deserialization for faster processing.

    7. Spark SQL Architecture

    ComponentDescriptionKey Functions
    Catalyst OptimizerOptimizes Spark SQL queries for enhanced performance.Logical plan, physical plan optimization.
    Query PlannerPlans the execution of SQL queries by selecting the best execution strategy.Converts optimized logical plan into physical execution plan.
    Execution EngineExecutes SQL queries using Spark’s distributed computing framework.Leverages cluster resources for parallel query execution.

    8. Spark Streaming Architecture

    ComponentDescriptionKey Features
    DStream (Discretized Stream)Continuous data stream split into micro-batches for processing.Batch processing in near real-time.
    ReceiverIngests data from external sources like Kafka, Flume, etc.Acts as data source for streaming jobs.
    ProcessorProcesses data within DStream by applying transformations and actions.Provides transformations similar to RDDs.

    9. Spark Master-Slave Architecture

    ComponentDescriptionKey Role
    Master NodeCoordinates resource allocation, task scheduling, and overall job management.Central controller for Spark cluster.
    Worker NodesExecute tasks on assigned data partitions as directed by the Master.Run computations, store data, and handle intermediate results.
    ExecutorProcess-running unit on each worker node, responsible for executing tasks.Runs task code, caches data, and sends results to the driver.
    TaskSmallest unit of work on data partitions assigned to executors by the driver.Executes transformations or actions on data partitions.
    Driver ProgramInitiates the Spark application and coordinates overall job execution.Submits tasks to the master and receives results.
    Cluster ManagerManages resources for the entire Spark cluster (YARN, Mesos, Kubernetes).Manages the lifecycle of executors and resource distribution.

    10. Key Concepts in PySpark

    ConceptDescriptionBenefits
    Lazy EvaluationTransformations are not executed until an action is called.Optimizes query execution by grouping operations.
    Fault ToleranceSpark recovers lost RDDs using lineage information when nodes fail.Increases reliability in distributed environments.
    In-Memory ProcessingStores intermediate data in memory instead of writing to disk.Enables faster data processing by avoiding I/O overhead.

    11. Common Use Cases for PySpark

    • Batch Processing: Large-scale ETL (Extract, Transform, Load) jobs.
    • Stream Processing: Real-time analytics, monitoring systems.
    • Machine Learning: Training models at scale using Spark MLlib.
    • Graph Processing: Social network analysis, recommendation systems.
    • Data Warehousing: Leveraging Spark SQL for querying structured datasets.

    More Parts of The Post:-

    Pages: 1 2 3 4

  • Yup! Scientists find a ‘Unique’ Black Hole that is hungier than ever in the Universe! Scientists have observed a fascinating phenomenon involving a supermassive black hole, AT2022dsb, which appears to be devouring a star in a “tidal disruption event” (TDE).

    When a star ventures too close to a black hole, the intense gravitational forces stretch it out in a process known as “spaghettification.”

    As the star’s material spirals toward the black hole, it forms a glowing, donut-shaped torus of hot gas around it, resembling a cosmic feeding frenzy.

    The Discovery and Unique Nature of This Black Hole

    Black holes have long been known as the universe’s “hungry giants,” entities with such immense gravitational pull that not even light can escape them. This recently observed black hole, however, stands out because of the vast amounts of material it’s actively pulling in, including entire stars that stray too close. When a black hole consumes a star, it distorts and stretches it, a process astronomers poetically refer to as “spaghettification.”

    The massive gravitational forces rip apart the star, and its matter spirals toward the event horizon, creating bright, swirling bands of light and energy. This radiant display, known as an accretion disk, reveals the black hole’s “feasting” process and provides scientists with clues about its behavior and growth.

    Observing the Accretion Disk

    The image above shows a vivid portrayal of a black hole in action, illustrating the intense gravitational forces pulling in stellar material. The accretion disk forms as material spirals inward, reaching incredible temperatures and emitting high levels of X-ray radiation that astronomers can observe from Earth.

    Cosmic Impact and Significance

    The discovery of this “voracious” black hole offers astronomers a unique chance to study black hole behavior more closely, including how they grow and influence their surroundings. By examining the material in the accretion disk, scientists hope to understand more about how black holes evolve over time and how they might impact the galaxies that host them. This discovery reinforces the idea that black holes are not just passive entities; they are dynamic and play a crucial role in the cosmic ecosystem by regulating star formation and influencing galactic structures.

    Conclusion

    As scientists continue to observe and learn from this remarkable cosmic phenomenon, we gain more insight into one of the most mysterious and powerful forces in the universe. This voracious black hole serves as a powerful reminder of the awe-inspiring and sometimes destructive beauty of our cosmos.

  • Here’s an enhanced Spark SQL cheatsheet with additional details, covering join types, union types, and set operations like EXCEPT and INTERSECT, along with options for table management (DDL operations like UPDATEINSERTDELETE, etc.). This comprehensive sheet is designed to help with quick Spark SQL reference.


    CategoryConceptSyntax / ExampleDescription
    Basic StatementsSELECTSELECT col1, col2 FROM table WHERE condition;Retrieves specific columns from a table based on a condition.
    DISTINCTSELECT DISTINCT col FROM table;Returns unique values in the specified column.
    LIMITSELECT * FROM table LIMIT 10;Restricts the number of rows returned by a query.
    JoinsINNER JOINSELECT * FROM t1 JOIN t2 ON t1.id = t2.id;Returns rows that have matching values in both tables.
    LEFT JOINSELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id;Returns all rows from the left table and matched rows from the right table; unmatched rows get NULL in columns from the right.
    RIGHT JOINSELECT * FROM t1 RIGHT JOIN t2 ON t1.id = t2.id;Returns all rows from the right table and matched rows from the left table; unmatched rows get NULL in columns from the left.
    FULL OUTER JOINSELECT * FROM t1 FULL JOIN t2 ON t1.id = t2.id;Returns rows when there is a match in either left or right table, including unmatched rows.
    CROSS JOINSELECT * FROM t1 CROSS JOIN t2;Returns the Cartesian product of the two tables.
    Set OperationsUNIONSELECT * FROM t1 UNION SELECT * FROM t2;Combines result sets from multiple queries, removing duplicates by default.
    UNION ALLSELECT * FROM t1 UNION ALL SELECT * FROM t2;Combines result sets from multiple queries without removing duplicates.
    INTERSECTSELECT * FROM t1 INTERSECT SELECT * FROM t2;Returns only the rows present in both queries.
    EXCEPTSELECT * FROM t1 EXCEPT SELECT * FROM t2;Returns rows present in the first query but not in the second query.
    EXCEPT ALLSELECT * FROM t1 EXCEPT ALL SELECT * FROM t2;Returns all rows in the first query that aren’t in the second, including duplicates.
    Table ManagementCREATE TABLECREATE TABLE table_name (id INT, name STRING);Creates a new table with specified columns and data types.
    DESCRIBEDESCRIBE TABLE table_name;Shows the structure and metadata of a table.
    ALTER TABLEALTER TABLE table_name ADD COLUMNS (age INT);Adds columns or modifies a table’s structure.
    DROP TABLEDROP TABLE IF EXISTS table_name;Deletes a table if it exists.
    TRUNCATE TABLETRUNCATE TABLE table_name;Removes all rows from a table without deleting the table structure.
    INSERT INTOINSERT INTO table_name VALUES (1, 'name');Adds new rows to a table.
    INSERT OVERWRITEINSERT OVERWRITE table_name SELECT * FROM other_table;Replaces existing data in a table with the results of a query.
    UPDATEUPDATE table_name SET col = 'value' WHERE condition;Updates specific columns based on a condition (SQL-style syntax may vary by environment).
    DELETEDELETE FROM table_name WHERE condition;Deletes specific rows based on a condition (available in Delta tables, SQL-style syntax).
    Window Functionsrow_number()ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2 DESC)Assigns a unique number to each row within a partition.
    rank()RANK() OVER (PARTITION BY col ORDER BY col2 DESC)Assigns a rank to rows within a partition based on specified column(s).
    lead()lag()LEAD(col) OVER (ORDER BY col2)Accesses data from the following or preceding row.
    Data Manipulation FunctionswithColumn()df.withColumn("newCol", df.oldCol + 1)Adds or replaces a column with the specified expression.
    withColumnRenamed()df.withColumnRenamed("oldName", "newName")Renames a column.
    selectExpr()df.selectExpr("col AS newCol", "col2 + 1")Selects columns or expressions using SQL syntax.
    String Functionsconcat()SELECT concat(col1, col2) FROM table;Concatenates strings from multiple columns.
    substring()SELECT substring(col, 1, 5) FROM table;Extracts a substring from a string column.
    lower() / upper()SELECT lower(col) FROM table;Converts all characters in a string to lowercase or uppercase.
    Date and Time Functionscurrent_date()SELECT current_date();Returns the current date.
    datediff()SELECT datediff(end_date, start_date) FROM table;Returns the difference in days between two dates.
    year()month()day()SELECT year(col) FROM table;Extracts parts of a date.
    Aggregate Functionscollect_list()SELECT collect_list(col) FROM table;Aggregates values into a list for each group.
    collect_set()SELECT collect_set(col) FROM table;Aggregates values into a unique set for each group.
    avg()sum()count()SELECT sum(col), count(col) FROM table GROUP BY group_col;Performs aggregation functions like averaging, summing, or counting.
    Optimization Techniquescache()df.cache()Caches the DataFrame in memory to optimize performance on repeated actions.
    repartition()df.repartition(4, "col")Redistributes data across partitions for load balancing.
    broadcast()broadcast(df)Optimizes joins by broadcasting smaller DataFrames to all nodes.
    Predicate Pushdownspark.sql("SELECT * FROM table WHERE col = 'value'")Pushes filters down to the data source, reducing data scanned.
    UDFsRegister UDFspark.udf.register("addOne", lambda x: x + 1)Registers a custom Python function as a UDF.
    Using a UDFSELECT addOne(col) FROM table;Applies a UDF to a column in Spark SQL.
    Schema ManagementprintSchema()df.printSchema()Displays the schema of a DataFrame.
    schemadf.schemaReturns the schema as a StructType object.
    Schema Mergespark.read.option("mergeSchema", "true")Merges schemas when reading from multiple files.
    Complex TypesArraysARRAY<int>Defines an array type, e.g., ARRAY<int>.
    StructSTRUCT<name: STRING, age: INT>Defines a nested structure.
    Miscellaneousmonotonically_increasing_id()SELECT monotonically_increasing_id() AS id FROM table;Generates unique IDs for rows.
    input_file_name()SELECT input_file_name() FROM table;Retrieves the file name associated with each row.
    coalesce()SELECT coalesce(col1, col2) FROM table;Returns the first non-null value from the specified columns.
    ConceptDescriptionSyntax/Example
    Basic SelectRetrieves data from a table.SELECT column1, column2 FROM table;
    WHERE ClauseFilters records based on conditions.SELECT * FROM table WHERE condition;
    AggregationsSummarizes data (e.g., SUM, COUNT, AVG).SELECT SUM(column) FROM table GROUP BY column2;
    Window FunctionsPerforms calculations across rows, like cumulative sums, rank, or row numbers.SELECT column, SUM(value) OVER (PARTITION BY column ORDER BY date) AS running_total FROM table;
    JoinsCombines rows from two or more tables based on a related column.SELECT * FROM table1 JOIN table2 ON table1.id = table2.id;
    SubqueriesNested queries for complex operations or transformations.SELECT * FROM (SELECT column1 FROM table WHERE condition);
    CTE (WITH Clause)Temporary result set for improved readability and reuse in complex queries.WITH temp AS (SELECT column FROM table) SELECT * FROM temp WHERE condition;
    UNION/UNION ALLCombines results from multiple SELECT statements.SELECT column FROM table1 UNION SELECT column FROM table2;
    PivotConverts rows into columns for specified values (aggregate with columns).SELECT * FROM (SELECT * FROM table) PIVOT (SUM(value) FOR column IN (‘value1’, ‘value2’));
    UnpivotConverts columns into rows, useful for restructuring wide tables.SELECT * FROM table UNPIVOT (value FOR name IN (col1, col2, col3));
    ViewsVirtual table based on a SELECT query, allows simplified access.CREATE VIEW view_name AS SELECT column FROM table;
    Temporary TablesTemporary storage for session-specific tables, deleted when the session ends.CREATE TEMPORARY TABLE temp_table AS SELECT * FROM table;
    Group By with HAVINGGroups data by specified columns and applies conditions to aggregated data.SELECT column, COUNT(*) FROM table GROUP BY column HAVING COUNT(*) > value;
    Case StatementsConditional logic within SQL queries for creating calculated columns.SELECT column, CASE WHEN condition THEN ‘result1’ ELSE ‘result2’ END AS new_column FROM table;
    Window FrameSpecifies the range for window functions, often cumulative or sliding (rows between clauses).SUM(column) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);

    Spark SQL Functions with Complex Use Cases

    FunctionExample Use Case
    collect_list()Aggregate column values into a list for each group, useful in tracking customer purchase history: SELECT customer_id, collect_list(product) FROM purchases GROUP BY customer_id;
    concat()Concatenate multiple address fields into one formatted address: SELECT concat(street, ', ', city, ', ', zip) AS address FROM addresses;
    row_number()Number rows within each group, useful for ranking: SELECT *, row_number() OVER (PARTITION BY category ORDER BY sales DESC) AS rank FROM sales_data;
    date_add()Calculate future dates, such as a payment due date: SELECT order_id, date_add(order_date, 30) AS due_date FROM orders;
    when() and coalesce()Assign risk categories while handling nulls: SELECT customer_id, when(age > 60, 'high').when(age > 30, 'medium').otherwise('low') AS risk, coalesce(income, 0) AS income FROM customers;
    array_contains()Filter for specific tags in an array column: SELECT * FROM posts WHERE array_contains(tags, 'pyspark');
    explode()Expand array items into individual rows: SELECT order_id, explode(items) AS item FROM orders;

    Conditional aggregation in Spark SQL:

    Here’s an example of conditional aggregation in Spark SQL:

    SELECT
      SUM(CASE WHEN age > 30 THEN 1 ELSE 0 END) AS count_over_30,
      SUM(CASE WHEN age <= 30 THEN 1 ELSE 0 END) AS count_under_30
    FROM
      customers;

    In this example, we’re using a CASE statement to conditionally sum the values. If the age is greater than 30, we sum 1, otherwise we sum 0.

    Using IF Function

    Alternatively, you can use the IF function:

    SELECT
      SUM(IF(age > 30, 1, 0)) AS count_over_30,
      SUM(IF(age <= 30, 1, 0)) AS count_under_30
    FROM
      customers;

    Spark SQL commands to manage views

    Here are the Spark SQL commands to manage views:

    Creating Views

    1. Creating Virtual Views

    CREATE VIEW my_view AS SELECT * FROM my_table;

    2. Creating Temporary Views

    CREATE TEMPORARY VIEW my_temp_view AS SELECT * FROM my_table;

    3. Creating or Replacing Temporary Views

    CREATE OR REPLACE TEMPORARY VIEW my_temp_view AS SELECT * FROM my_table;

    Deleting Views

    1. Deleting Temporary Views if Exists

    DROP VIEW IF EXISTS my_temp_view;

    Checking Views

    1. Checking Views DDL

    DESCRIBE FORMATTED my_view;

    2. Checking Extended Query

    EXPLAIN EXTENDED SELECT * FROM my_view;

    3. Checking in Spark Catalog

    SHOW TABLES IN my_database;  // lists all tables and views
    SHOW VIEWS IN my_database;    // lists only views

    Note:

    • CREATE VIEW creates a virtual view, which is a stored query that doesn’t store data.
    • CREATE TEMPORARY VIEW creates a temporary view that is only available in the current Spark session.
    • CREATE OR REPLACE TEMPORARY VIEW creates or replaces a temporary view.
    • DROP VIEW IF EXISTS deletes a view if it exists.
    • DESCRIBE FORMATTED shows the DDL of a view.
    • EXPLAIN EXTENDED shows the extended query plan of a view.
    • SHOW TABLES and SHOW VIEWS list tables and views in the Spark catalog.

    Here are the different types of joins in Spark SQL, along with examples:

    1. Inner Join

    Returns only the rows that have a match in both tables.

    SELECT *
    FROM table1
    INNER JOIN table2
    ON table1.id = table2.id;

    2. Left Outer Join (or Left Join)

    Returns all the rows from the left table and the matching rows from the right table. If there’s no match, the result will contain NULL values.

    SELECT *
    FROM table1
    LEFT JOIN table2
    ON table1.id = table2.id;

    3. Right Outer Join (or Right Join)

    Similar to the left outer join, but returns all the rows from the right table and the matching rows from the left table.

    SELECT *
    FROM table1
    RIGHT JOIN table2
    ON table1.id = table2.id;

    4. Full Outer Join (or Full Join)

    Returns all the rows from both tables, with NULL values in the columns where there are no matches.

    SELECT *
    FROM table1
    FULL OUTER JOIN table2
    ON table1.id = table2.id;

    5. Semi Join

    Returns only the rows from the left table that have a match in the right table.

    SELECT *
    FROM table1
    JOIN table2
    ON table1.id = table2.id;

    6. Anti Join

    Returns only the rows from the left table that do not have a match in the right table.

    SELECT *
    FROM table1
    LEFT ANTI JOIN table2
    ON table1.id = table2.id;

    7. Cross Join

    Returns the Cartesian product of both tables.

    SELECT *
    FROM table1
    CROSS JOIN table2;

    Note: Spark SQL also supports using the USING clause to specify the join condition, like this:

    SELECT *
    FROM table1
    JOIN table2
    USING (id);
  • Here’s a categorized Spark SQL function reference, which organizes common Spark SQL functions by functionality. This can help with selecting the right function based on the operation you want to perform.

    1. Aggregate Functions

    FunctionDescriptionExample
    avg()Calculates the average value.SELECT avg(age) FROM table;
    count()Counts the number of rows.SELECT count(*) FROM table;
    max()Finds the maximum value.SELECT max(salary) FROM table;
    min()Finds the minimum value.SELECT min(age) FROM table;
    sum()Calculates the sum of a column.SELECT sum(salary) FROM table;
    stddev()Calculates the standard deviation.SELECT stddev(salary) FROM table;
    variance()Calculates the variance.SELECT variance(salary) FROM table;

    2. Analytic Functions

    FunctionDescriptionExample
    row_number()Assigns a unique number to each row in a window.ROW_NUMBER() OVER (PARTITION BY city)
    rank()Assigns a rank to each row in a partition.RANK() OVER (ORDER BY salary DESC)
    dense_rank()Similar to rank but without gaps.DENSE_RANK() OVER (ORDER BY age ASC)
    ntile(n)Divides rows into n buckets.NTILE(4) OVER (ORDER BY age)
    lead()Accesses a row after the current row.LEAD(salary, 1) OVER (ORDER BY age)
    lag()Accesses a row before the current row.LAG(salary, 1) OVER (ORDER BY age)

    3. String Functions

    FunctionDescriptionExample
    concat()Concatenates multiple strings.SELECT concat(first_name, last_name) FROM table;
    substring()Extracts a substring from a string.SELECT substring(name, 1, 3) FROM table;
    length()Returns the length of a string.SELECT length(name) FROM table;
    lower()Converts string to lowercase.SELECT lower(name) FROM table;
    upper()Converts string to uppercase.SELECT upper(name) FROM table;
    trim()Trims spaces from both ends of a string.SELECT trim(name) FROM table;
    replace()Replaces a substring within a string.SELECT replace(name, 'a', 'b') FROM table;
    split()Splits a string into an array.SELECT split(email, '@') FROM table;

    4. Date and Time Functions

    FunctionDescriptionExample
    current_date()Returns the current date.SELECT current_date();
    current_timestamp()Returns the current timestamp.SELECT current_timestamp();
    datediff()Returns difference in days between two dates.SELECT datediff(date1, date2) FROM table;
    year(), month(), day()Extracts year, month, day from date.SELECT year(birthdate) FROM table;
    date_add()Adds days to a date.SELECT date_add(date, 10) FROM table;
    date_sub()Subtracts days from a date.SELECT date_sub(date, 10) FROM table;
    to_date()Converts string to date.SELECT to_date(string_date) FROM table;
    to_timestamp()Converts string to timestamp.SELECT to_timestamp(string_timestamp) FROM table;

    5. Mathematical Functions

    FunctionDescriptionExample
    abs()Returns absolute value.SELECT abs(-10) FROM table;
    ceil()Rounds up to the nearest integer.SELECT ceil(salary) FROM table;
    floor()Rounds down to the nearest integer.SELECT floor(salary) FROM table;
    round()Rounds to a specified number of decimal places.SELECT round(salary, 2) FROM table;
    sqrt()Returns the square root.SELECT sqrt(age) FROM table;
    pow()Returns a number raised to a power.SELECT pow(salary, 2) FROM table;
    exp()Returns e^x (exponential).SELECT exp(age) FROM table;
    log()Returns the logarithm of a number.SELECT log(salary) FROM table;

    6. Array Functions

    FunctionDescriptionExample
    array()Creates an array from multiple values.SELECT array('a', 'b', 'c');
    size()Returns the number of elements in an array.SELECT size(array_column) FROM table;
    array_contains()Checks if an array contains a specified value.SELECT array_contains(array_column, 'value') FROM table;
    explode()Creates a new row for each element in the array.SELECT explode(array_column) FROM table;
    sort_array()Sorts the elements of an array in ascending order.SELECT sort_array(array_column) FROM table;
    array_distinct()Removes duplicate values from an array.SELECT array_distinct(array_column) FROM table;

    7. Conditional Functions

    FunctionDescriptionExample
    when()Conditional expression that works like an if-else.SELECT when(age > 18, 'adult').otherwise('minor') FROM table;
    coalesce()Returns the first non-null value.SELECT coalesce(col1, col2) FROM table;
    ifnull()Returns the second value if the first is null.SELECT ifnull(col1, 'unknown') FROM table;
    nullif()Returns NULL if the two values are equal.SELECT nullif(col1, col2) FROM table;
    nvl()Replaces NULL with a specified value.SELECT nvl(col, 'default') FROM table;

    8. Miscellaneous Functions

    FunctionDescriptionExample
    lit()Converts a literal value to a column.SELECT lit(1) FROM table;
    rand()Generates a random number between 0 and 1.SELECT rand() FROM table;
    monotonically_increasing_id()Returns a unique ID for each row.SELECT monotonically_increasing_id() FROM table;
    input_file_name()Returns the file name of the source data.SELECT input_file_name() FROM table;

    This categorized list provides a quick reference for Spark SQL functions based on what kind of operation they perform, making it useful for development and troubleshooting in Spark SQL queries.

    The collect_list() function is categorized under Aggregate Functions in Spark SQL. It aggregates data by collecting values into a list within each group, without removing duplicates. Here’s a quick example and then an exploration of some interesting, complex use cases for various functions, including collect_list().


    1. Advanced Use Cases of Aggregate Functions

    collect_list()

    FunctionExampleDescription
    collect_list()SELECT id, collect_list(name) FROM table GROUP BY id;Collects all values of a column into a list for each group, preserving duplicates.

    Complex Use Case: If you have a dataset where each customer has multiple orders, and you want to get a list of order dates for each customer:

    SELECT customer_id, collect_list(order_date) AS order_dates
    FROM orders
    GROUP BY customer_id;
    

    This is useful in generating lists of values within groups, such as viewing the product purchase history of customers or tracking all updates to a particular row over time.


    2. String Functions with Complex Use Cases

    concat() and replace()

    FunctionExampleDescription
    concat()SELECT concat(city, ', ', state) AS location FROM table;Joins multiple columns or strings together.
    replace()SELECT replace(phone, '-', '') AS phone_no FROM table;Replaces parts of a string based on a pattern.

    Complex Use Case: Concatenating multiple address fields to form a single address and cleaning up data with replace():

    SELECT concat(street, ', ', city, ', ', replace(zip, '-', '')) AS full_address
    FROM addresses;
    

    This would be useful for standardizing or preparing addresses for mailing systems by merging fields and removing unnecessary characters.


    3. Analytic Functions with Interesting Examples

    row_number(), rank(), lead(), and lag()

    FunctionExampleDescription
    row_number()ROW_NUMBER() OVER (PARTITION BY city ORDER BY age DESC)Assigns a unique number to each row in a window.
    lead()LEAD(salary, 1) OVER (ORDER BY age)Retrieves the next row’s value in the current row’s column.
    lag()LAG(salary, 1) OVER (ORDER BY age)Retrieves the previous row’s value in the current row’s column.

    Complex Use Case: Track sales growth over time by calculating month-over-month difference in sales:

    SELECT month, sales,
           sales - LAG(sales, 1) OVER (ORDER BY month) AS sales_diff
    FROM sales_data;
    

    This lets you analyze trends or identify dips and peaks in performance by using the previous row’s data directly.


    4. Date and Time Functions for Advanced Operations

    date_add(), datediff(), year()

    FunctionExampleDescription
    date_add()SELECT date_add(order_date, 30) AS due_date FROM table;Adds a specific number of days to a date.
    datediff()SELECT datediff(due_date, order_date) AS days_to_ship FROM table;Calculates the difference in days between two dates.
    year(), month()SELECT year(birthdate) AS birth_year FROM table;Extracts parts of a date.

    Complex Use Case: Calculate the monthly retention rate by finding customers who ordered in consecutive months:

    SELECT customer_id,
           month(order_date) AS month,
           count(*) AS orders_this_month,
           LEAD(month(order_date), 1) OVER (PARTITION BY customer_id ORDER BY order_date) - month(order_date) = 1 AS retained
    FROM orders
    GROUP BY customer_id, month;
    

    This example is powerful for retention analysis, determining if customers return month after month.


    5. Array Functions for Multi-Value Column Manipulation

    array(), array_contains(), explode()

    FunctionExampleDescription
    array()SELECT array(name, email) AS contact FROM table;Combines multiple columns into an array.
    array_contains()SELECT array_contains(tags, 'Spark') AS has_spark FROM table;Checks if an array contains a value.
    explode()SELECT id, explode(items) AS item FROM orders;Expands array elements into individual rows.

    Complex Use Case: Splitting tags into individual rows for better indexing and searching:

    SELECT post_id, explode(tags) AS tag
    FROM blog_posts;
    

    This approach can help with filtering, analytics, or search functionalities in scenarios where each tag or attribute of an item needs to be analyzed individually.


    6. Conditional Functions for Complex Case Logic

    when(), coalesce(), ifnull()

    FunctionExampleDescription
    when()SELECT when(age > 18, 'adult').otherwise('minor') AS category FROM table;Implements conditional logic similar to if-else.
    coalesce()SELECT coalesce(col1, col2) AS result FROM table;Returns the first non-null value from a list of columns.
    ifnull()SELECT ifnull(name, 'unknown') FROM table;Replaces NULL values with a specified value.

    Complex Use Case: Assign customers to risk categories based on age and income, while handling null values with coalesce():

    SELECT customer_id,
           when(age >= 60, 'high risk')
           .when(age >= 30, 'medium risk')
           .otherwise('low risk') AS risk_category,
           coalesce(income, 0) AS income
    FROM customers;
    

    This is especially useful in customer segmentation, where you may need to categorize data based on multiple conditions and handle null values simultaneously.


    7. Miscellaneous Functions for Unique Use Cases

    monotonically_increasing_id(), input_file_name()

    FunctionExampleDescription
    monotonically_increasing_id()SELECT monotonically_increasing_id() AS id FROM table;Generates unique IDs for rows without requiring a primary key.
    input_file_name()SELECT input_file_name() FROM table;Returns the file name from which a row originated.

    Complex Use Case: Track records to their source file in an ETL process, useful in error tracking:

    SELECT *, input_file_name() AS source_file
    FROM data_files;
    

    This helps you trace errors back to their data origin, making it valuable in complex ETL jobs involving multiple data sources.


    These advanced use cases highlight the flexibility and power of Spark SQL functions when applied to complex data manipulation, making it easier to handle various analytical scenarios and transformations.

  • 3. SQL Command Categories Overview

    SQL commands are classified into five main categories based on their functionality:

    CategoryAcronymDescription
    DDLData Definition LanguageDefine/alter schema structure (tables, views, indexes)
    DMLData Manipulation LanguageModify data (insert, update, delete)
    DCLData Control LanguageManage user access (privileges)
    TCLTransaction Control LanguageControl transaction flow (commit, rollback)
    DQLData Query LanguageQuery and retrieve data

    3.1 Data Definition Language (DDL)

    DDL is used to define or modify schema-level structures.

    CommandPurposeExample
    CREATECreates tables/views/indexesCREATE TABLE students (id INT, name VARCHAR(50));
    ALTERModifies table structureALTER TABLE students ADD COLUMN age INT;
    DROPDeletes a table/view/index permanentlyDROP TABLE students;
    TRUNCATERemoves all data from a tableTRUNCATE TABLE students;
    RENAMERenames a table or columnALTER TABLE students RENAME TO learners;

    3.2 Data Manipulation Language (DML)

    DML is used for managing records within tables.

    CommandPurposeExample
    INSERTAdds new rowsINSERT INTO students (id, name) VALUES (1, 'John');
    UPDATEModifies existing rowsUPDATE students SET name = 'John Doe' WHERE id = 1;
    DELETERemoves rowsDELETE FROM students WHERE id = 1;
    MERGECombines INSERT and UPDATEMERGE INTO target USING source ON condition WHEN MATCHED THEN UPDATE...

    3.3 Data Control Language (DCL)

    DCL controls access and privileges for users.

    CommandPurposeExample
    GRANTAssigns privilegesGRANT SELECT ON students TO user1;
    REVOKERemoves privilegesREVOKE SELECT ON students FROM user1;

    3.4 Transaction Control Language (TCL)

    TCL manages the lifecycle of transactions.

    CommandPurposeExample
    COMMITSaves changesCOMMIT;
    ROLLBACKReverts changesROLLBACK;
    SAVEPOINTSets rollback pointsSAVEPOINT save1;
    SET TRANSACTIONSets isolation levelSET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

    3.5 Data Query Language (DQL)

    DQL focuses on querying and fetching data.

    CommandPurposeExample
    SELECTRetrieve recordsSELECT name FROM students WHERE age > 20;

    4. CRUD Operations (SQL + Spark SQL)

    CRUD stands for Create, Read, Update, Delete, the core operations of data handling.

    OperationSQL ExampleSpark SQL Example
    CreateINSERT INTO users VALUES (1, 'Alice');INSERT INTO employees VALUES (1, 'Alice', 30, 'Engg');
    ReadSELECT * FROM users;SELECT name, department FROM employees;
    UpdateUPDATE users SET name='Bob' WHERE id=1;(Delta Only) UPDATE employees SET age=31 WHERE id=1;
    DeleteDELETE FROM users WHERE id=1;(Delta Only) DELETE FROM employees WHERE age<25;

    4.1 Creating Databases and Tables

    MySQL/PostgreSQL

    CREATE DATABASE my_db;
    CREATE TABLE customers (
      id INT PRIMARY KEY,
      name VARCHAR(100),
      email VARCHAR(100) UNIQUE
    );
    

    Spark SQL

    CREATE TABLE students (
      id INT, name STRING, age INT
    );
    

    4.2 Inserting Data

    • Single Record: INSERT INTO users (name) VALUES ('Alice');
    • Multiple Records: INSERT INTO users (name) VALUES ('A'), ('B');
    • From Another Table: INSERT INTO new_users SELECT * FROM old_users;

    4.3 Reading Data

    SELECT * FROM users;
    SELECT name FROM users WHERE age > 18;
    SELECT COUNT(*) FROM users;
    

    4.4 Updating and Deleting Data (SQL & Delta Lake)

    UPDATE employees SET age = 35 WHERE id = 5;
    DELETE FROM employees WHERE department = 'HR';
    

    4.5 MERGE / UPSERT (Delta Lake)

    MERGE INTO employees AS e
    USING updates AS u
    ON e.id = u.id
    WHEN MATCHED THEN UPDATE SET age = u.age
    WHEN NOT MATCHED THEN INSERT (id, name, age) VALUES (u.id, u.name, u.age);
    

    4.6 Additional Spark SQL Features

    FeatureCommandExample
    View CreationCREATE VIEW v1 AS SELECT * FROM t1;
    CachingCACHE TABLE my_table;
    Broadcast JoinSELECT /*+ BROADCAST(dim) */ * FROM fact JOIN dim;
    Schema EvolutionALTER TABLE CHANGE COLUMN col_name NEW_TYPE;

    5. Summary Table of SQL Categories

    CategoryCommandsMain Purpose
    DDLCREATE, ALTER, DROP, TRUNCATEDefine/modify schema
    DMLINSERT, UPDATE, DELETEManipulate data rows
    DCLGRANT, REVOKEControl user access
    TCLCOMMIT, ROLLBACK, SAVEPOINTManage transactions
    DQLSELECTQuery data

    6. Final Notes

    • Use EXPLAIN or EXPLAIN ANALYZE to check query execution plans.
    • Leverage Delta Lake for full CRUD support with

HintsToday

Hints and Answers for Everything

Skip to content ↓