Quick Spark SQL reference- Spark SQL cheatsheet for Revising in One Go


— by

Here’s an enhanced Spark SQL cheatsheet with additional details, covering join types, union types, and set operations like EXCEPT and INTERSECT, along with options for table management (DDL operations like UPDATE, INSERT, DELETE, etc.). This comprehensive sheet is designed to help with quick Spark SQL reference.


CategoryConceptSyntax / ExampleDescription
Basic StatementsSELECTSELECT col1, col2 FROM table WHERE condition;Retrieves specific columns from a table based on a condition.
DISTINCTSELECT DISTINCT col FROM table;Returns unique values in the specified column.
LIMITSELECT * FROM table LIMIT 10;Restricts the number of rows returned by a query.
JoinsINNER JOINSELECT * FROM t1 JOIN t2 ON t1.id = t2.id;Returns rows that have matching values in both tables.
LEFT JOINSELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id;Returns all rows from the left table and matched rows from the right table; unmatched rows get NULL in columns from the right.
RIGHT JOINSELECT * FROM t1 RIGHT JOIN t2 ON t1.id = t2.id;Returns all rows from the right table and matched rows from the left table; unmatched rows get NULL in columns from the left.
FULL OUTER JOINSELECT * FROM t1 FULL JOIN t2 ON t1.id = t2.id;Returns rows when there is a match in either left or right table, including unmatched rows.
CROSS JOINSELECT * FROM t1 CROSS JOIN t2;Returns the Cartesian product of the two tables.
Set OperationsUNIONSELECT * FROM t1 UNION SELECT * FROM t2;Combines result sets from multiple queries, removing duplicates by default.
UNION ALLSELECT * FROM t1 UNION ALL SELECT * FROM t2;Combines result sets from multiple queries without removing duplicates.
INTERSECTSELECT * FROM t1 INTERSECT SELECT * FROM t2;Returns only the rows present in both queries.
EXCEPTSELECT * FROM t1 EXCEPT SELECT * FROM t2;Returns rows present in the first query but not in the second query.
EXCEPT ALLSELECT * FROM t1 EXCEPT ALL SELECT * FROM t2;Returns all rows in the first query that aren’t in the second, including duplicates.
Table ManagementCREATE TABLECREATE TABLE table_name (id INT, name STRING);Creates a new table with specified columns and data types.
DESCRIBEDESCRIBE TABLE table_name;Shows the structure and metadata of a table.
ALTER TABLEALTER TABLE table_name ADD COLUMNS (age INT);Adds columns or modifies a table’s structure.
DROP TABLEDROP TABLE IF EXISTS table_name;Deletes a table if it exists.
TRUNCATE TABLETRUNCATE TABLE table_name;Removes all rows from a table without deleting the table structure.
INSERT INTOINSERT INTO table_name VALUES (1, 'name');Adds new rows to a table.
INSERT OVERWRITEINSERT OVERWRITE table_name SELECT * FROM other_table;Replaces existing data in a table with the results of a query.
UPDATEUPDATE table_name SET col = 'value' WHERE condition;Updates specific columns based on a condition (SQL-style syntax may vary by environment).
DELETEDELETE FROM table_name WHERE condition;Deletes specific rows based on a condition (available in Delta tables, SQL-style syntax).
Window Functionsrow_number()ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2 DESC)Assigns a unique number to each row within a partition.
rank()RANK() OVER (PARTITION BY col ORDER BY col2 DESC)Assigns a rank to rows within a partition based on specified column(s).
lead(), lag()LEAD(col) OVER (ORDER BY col2)Accesses data from the following or preceding row.
Data Manipulation FunctionswithColumn()df.withColumn("newCol", df.oldCol + 1)Adds or replaces a column with the specified expression.
withColumnRenamed()df.withColumnRenamed("oldName", "newName")Renames a column.
selectExpr()df.selectExpr("col AS newCol", "col2 + 1")Selects columns or expressions using SQL syntax.
String Functionsconcat()SELECT concat(col1, col2) FROM table;Concatenates strings from multiple columns.
substring()SELECT substring(col, 1, 5) FROM table;Extracts a substring from a string column.
lower() / upper()SELECT lower(col) FROM table;Converts all characters in a string to lowercase or uppercase.
Date and Time Functionscurrent_date()SELECT current_date();Returns the current date.
datediff()SELECT datediff(end_date, start_date) FROM table;Returns the difference in days between two dates.
year(), month(), day()SELECT year(col) FROM table;Extracts parts of a date.
Aggregate Functionscollect_list()SELECT collect_list(col) FROM table;Aggregates values into a list for each group.
collect_set()SELECT collect_set(col) FROM table;Aggregates values into a unique set for each group.
avg(), sum(), count()SELECT sum(col), count(col) FROM table GROUP BY group_col;Performs aggregation functions like averaging, summing, or counting.
Optimization Techniquescache()df.cache()Caches the DataFrame in memory to optimize performance on repeated actions.
repartition()df.repartition(4, "col")Redistributes data across partitions for load balancing.
broadcast()broadcast(df)Optimizes joins by broadcasting smaller DataFrames to all nodes.
Predicate Pushdownspark.sql("SELECT * FROM table WHERE col = 'value'")Pushes filters down to the data source, reducing data scanned.
UDFsRegister UDFspark.udf.register("addOne", lambda x: x + 1)Registers a custom Python function as a UDF.
Using a UDFSELECT addOne(col) FROM table;Applies a UDF to a column in Spark SQL.
Schema ManagementprintSchema()df.printSchema()Displays the schema of a DataFrame.
schemadf.schemaReturns the schema as a StructType object.
Schema Mergespark.read.option("mergeSchema", "true")Merges schemas when reading from multiple files.
Complex TypesArraysARRAY<int>Defines an array type, e.g., ARRAY<int>.
StructSTRUCT<name: STRING, age: INT>Defines a nested structure.
Miscellaneousmonotonically_increasing_id()SELECT monotonically_increasing_id() AS id FROM table;Generates unique IDs for rows.
input_file_name()SELECT input_file_name() FROM table;Retrieves the file name associated with each row.
coalesce()SELECT coalesce(col1, col2) FROM table;Returns the first non-null value from the specified columns.

These categories and examples serve as a detailed reference for Spark SQL operations, covering foundational concepts, complex manipulations, and optimizations, ideal for a quick refresher. Let me know if you’d like any further details on specific topics or functions!

Spark SQL Functions with Complex Use Cases

FunctionExample Use Case
collect_list()Aggregate column values into a list for each group, useful in tracking customer purchase history: SELECT customer_id, collect_list(product) FROM purchases GROUP BY customer_id;
concat()Concatenate multiple address fields into one formatted address: SELECT concat(street, ', ', city, ', ', zip) AS address FROM addresses;
row_number()Number rows within each group, useful for ranking: SELECT *, row_number() OVER (PARTITION BY category ORDER BY sales DESC) AS rank FROM sales_data;
date_add()Calculate future dates, such as a payment due date: SELECT order_id, date_add(order_date, 30) AS due_date FROM orders;
when() and coalesce()Assign risk categories while handling nulls: SELECT customer_id, when(age > 60, 'high').when(age > 30, 'medium').otherwise('low') AS risk, coalesce(income, 0) AS income FROM customers;
array_contains()Filter for specific tags in an array column: SELECT * FROM posts WHERE array_contains(tags, 'pyspark');
explode()Expand array items into individual rows: SELECT order_id, explode(items) AS item FROM orders;

These entries capture essential Spark SQL commands and operations for data handling, transformations, optimizations, and schema management, offering a quick-reference sheet for both simple and complex use cases. Let me know if you’d like further examples or explanations on specific functions!


Discover more from AI HintsToday

Subscribe to get the latest posts sent to your email.

Newsletter

Our latest updates in your e-mail.

Comments


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from AI HitsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading