Python Built-in Iterables: Complete Guide with Use Cases & Challenges

February 1, 2025

What are Iterables?

An iterable is any object that can return an iterator, meaning it can be looped over using for loops or passed to functions like map(), filter(), etc.

🔹 List of Built-in Iterables in Python

Python provides several built-in iterable objects:

Iterable Type	Syntax Example	Mutable?	Ordered?	Duplicates Allowed?
List	`[1, 2, 3]`	✅ Yes	✅ Yes	✅ Yes
Tuple	`(1, 2, 3)`	❌ No	✅ Yes	✅ Yes
String	`"hello"`	❌ No	✅ Yes	✅ Yes
Set	`{1, 2, 3}`	✅ Yes	❌ No	❌ No
Dictionary (keys, values, items)	`{"a": 1, "b": 2}`	✅ Yes	✅ Yes (Python 3.7+)	❌ No (Keys)
Range	`range(5)`	❌ No	✅ Yes	✅ Yes
File Objects	`open("file.txt")`	✅ Yes	✅ Yes	✅ Yes
Enumerate	`enumerate([10, 20, 30])`	❌ No	✅ Yes	✅ Yes
Zip	`zip([1, 2], ["a", "b"])`	❌ No	✅ Yes	✅ Yes
Map	`map(str.upper, ["a", "b"])`	❌ No	✅ Yes	✅ Yes
Filter	`filter(lambda x: x > 0, [1, -1, 2])`	❌ No	✅ Yes	✅ Yes

✅ All iterators are iterables, but not all iterables are iterators!

This is a fundamental concept in Python that often confuses beginners. Let’s break it down clearly:

What are Iterables?

An iterable is any object in Python that can be looped over using a for loop. It implements the __iter__() method, which returns an iterator. Examples of iterables include:

Lists: [1, 2, 3]
Tuples: (1, 2, 3)
Strings: "hello"
Dictionaries: {"a": 1, "b": 2}
Sets: {1, 2, 3}
Range objects: range(10)

When you use a for loop, Python automatically calls the __iter__() method of the iterable to get an iterator.

What are Iterators?

An iterator is an object that implements two methods:

__iter__(): Returns the iterator object itself.
__next__(): Returns the next value in the sequence. When there are no more items, it raises the StopIteration exception.

Iterators are stateful, meaning they keep track of where they are in the sequence during iteration.

Examples of iterators:

The object returned by iter() (e.g., iter([1, 2, 3])).
Generator objects (created by generator functions or expressions).
Objects returned by itertools functions (e.g., itertools.count()).

Why Are All Iterators Iterables?

All iterators are iterables because:

They implement the __iter__() method, which returns self (the iterator itself).
This means you can use an iterator in a for loop or anywhere an iterable is expected.

Example:

my_list = [1, 2, 3]
my_iterator = iter(my_list)  # Get an iterator from the list

# Since my_iterator is an iterable, we can loop over it
for item in my_iterator:
    print(item)

Why Are Not All Iterables Iterators?

Not all iterables are iterators because:

Iterables only need to implement the __iter__() method, which returns an iterator.
They do not need to implement the __next__() method, which is required for iteration.

Example:

my_list = [1, 2, 3]  # This is an iterable
# my_list is not an iterator because it doesn't implement __next__()

If you try to call next() directly on an iterable (that is not an iterator), you’ll get an error:

my_list = [1, 2, 3]
next(my_list)  # TypeError: 'list' object is not an iterator

To make it work, you need to convert the iterable into an iterator using iter():

my_iterator = iter(my_list)
print(next(my_iterator))  # 1

Key Differences

Feature	Iterable	Iterator
Definition	An object that can be iterated over.	An object that produces values one at a time.
Methods	Implements `__iter__()`.	Implements `__iter__()` and `__next__()`.
State	Stateless (doesn’t track iteration).	Stateful (tracks iteration progress).
Example	Lists, tuples, strings, dictionaries.	Objects returned by `iter()`, generators.
Usage in `for` loop	Yes (implicitly calls `iter()`).	Yes (directly usable).
Usage with `next()`	No (must convert to iterator first).	Yes (directly usable).

Example to Illustrate

Iterable (List)

my_list = [1, 2, 3]  # This is an iterable
for item in my_list:  # Python calls iter(my_list) internally
    print(item)

Iterator (Object returned by `iter()`)

my_list = [1, 2, 3]
my_iterator = iter(my_list)  # This is an iterator

print(next(my_iterator))  # 1
print(next(my_iterator))  # 2
print(next(my_iterator))  # 3
print(next(my_iterator))  # Raises StopIteration

Custom Example

Custom Iterable

class MyIterable:
    def __init__(self, data):
        self.data = data

    def __iter__(self):
        return iter(self.data)  # Returns an iterator

my_iterable = MyIterable([1, 2, 3])
for item in my_iterable:
    print(item)  # 1, 2, 3

Custom Iterator

class MyIterator:
    def __init__(self, data):
        self.data = data
        self.index = 0

    def __iter__(self):
        return self

    def __next__(self):
        if self.index >= len(self.data):
            raise StopIteration
        value = self.data[self.index]
        self.index += 1
        return value

my_iterator = MyIterator([1, 2, 3])
for item in my_iterator:
    print(item)  # 1, 2, 3

Key Takeaways

Iterables are objects you can iterate over (e.g., lists, tuples).
Iterators are objects that produce values one at a time during iteration.
All iterators are iterables because they implement __iter__().
Not all iterables are iterators because they don’t implement __next__().

Below is a complete guide to Python’s built-in iterables, along with use cases and challenges for each.

1. Lists

Lists are ordered, mutable collections of items.

Use Cases

Storing and manipulating sequences of data.
Iterating over elements using a for loop.
Using list comprehensions for concise transformations.

Best Use Case: Storing multiple values and dynamically modifying them.

nums = [1, 2, 3, 4]
nums.append(5)  # Add element
nums.remove(3)  # Remove element
print(nums)  # [1, 2, 4, 5]

for num in nums:
    print(num)  # Iterate over list

my_list = [1, 2, 3, 4]
for item in my_list:
    print(item)

🔥 Challenge: Find all unique triplets in a list that sum to zero.

Challenges

Flatten a Nested List:

nested = [[1, 2], [3, 4], [5]]
flattened = [item for sublist in nested for item in sublist]
print(flattened)  # [1, 2, 3, 4, 5]

Find the Second Largest Element:

my_list = [10, 20, 4, 45, 99]
sorted_list = sorted(my_list, reverse=True)
print(sorted_list[1]) # 45

2. Tuples

Tuples are ordered, immutable collections of items.

Use Cases

Storing fixed data (e.g., coordinates, database records).
Returning multiple values from a function.

💡Best Use Case: Storing fixed values (e.g., coordinates, database rows).

my_tuple = (1, 2, 3)
for item in my_tuple:
    print(item)


point = (10, 20)  # Immutable (cannot be modified)
for val in point:
    print(val)  # Iterate over tuple

🔥 Challenge: Convert a list of (name, age) tuples into a sorted tuple by age.

Challenges

Swap Two Variables Using Tuples:

a, b = 5, 10
a, b = b, a
print(a, b) # 10, 5

Find the Frequency of Elements:

my_tuple = (1, 2, 2, 3, 3, 3)
frequency = {item: my_tuple.count(item) for item in set(my_tuple)}
print(frequency)  # {1: 1, 2: 2, 3: 3}

3. Strings

Strings are sequences of characters.

Use Cases

Iterating over characters in a string.
Manipulating and processing text data.

💡Best Use Case: Storing & processing text data (e.g., file processing, NLP).


text = "hello"
for char in text:
    print(char)  # Iterate over characters

🔥 Challenge: Find the first non-repeating character in a string.

Challenges

Reverse a String:

my_string = "hello"
reversed_string = my_string[::-1]
print(reversed_string)  # "olleh"

Check if a String is a Palindrome:

def is_palindrome(s):
    return s == s[::-1]

print(is_palindrome("racecar"))  # True

4. Dictionaries

Dictionaries are unordered collections of key-value pairs.

Use Cases

Storing and retrieving data using keys.
Iterating over keys, values, or items.

🔥 Challenge: Find the first non-repeating character in a string.

my_dict = {"a": 1, "b": 2, "c": 3}
for key, value in my_dict.items():
    print(key, value)

data = {"name": "Alice", "age": 25}
for key, value in data.items():
    print(key, value)  # name Alice, age 25

🔥 Challenge: Find the most frequently occurring word in a text file.

Challenges

Merge Two Dictionaries:

dict1 = {"a": 1, "b": 2}
dict2 = {"c": 3, "d": 4}
merged = {**dict1, **dict2}
print(merged)  # {'a': 1, 'b': 2, 'c': 3, 'd': 4}

Invert a Dictionary:

my_dict = {"a": 1, "b": 2, "c": 3}
inverted = {v: k for k, v in my_dict.items()}
print(inverted)  # {1: 'a', 2: 'b', 3: 'c'}

5. Sets

Sets are unordered collections of unique elements.

Use Cases

Removing duplicates from a list.
Performing set operations (e.g., union, intersection).

my_set = {1, 2, 3, 4}
for item in my_set:
    print(item)

Challenges

Find Common Elements Between Two Lists:

list1 = [1, 2, 3, 4]
list2 = [3, 4, 5, 6]
common = set(list1).intersection(list2)
print(common)  # {3, 4}

Check if a List Contains Duplicates:

my_list = [1, 2, 3, 2]
has_duplicates = len(my_list) != len(set(my_list))
print(has_duplicates)  # True

6. Range

range generates a sequence of numbers.

Use Cases

Iterating over a sequence of numbers.
Generating indices for loops.

for i in range(5):
    print(i)  # 0, 1, 2, 3, 4

🔥 Challenge: Generate prime numbers using a range and list comprehension.

Challenges

Generate a List of Even Numbers:

evens = list(range(0, 10, 2))
print(evens)  # [0, 2, 4, 6, 8]

Sum Numbers from 1 to 100:

total = sum(range(1, 101))
print(total)  # 5050

7. Files

File objects are iterable, allowing line-by-line iteration.

Use Cases

Reading large files without loading them entirely into memory.
Processing log files or CSV data.

with open("file.txt", "r") as file:
    for line in file:
        print(line.strip())

Challenges

Count the Number of Lines in a File:

with open("file.txt", "r") as file:
    line_count = sum(1 for line in file)
print(line_count)

Find the Longest Line in a File:

with open("file.txt", "r") as file:
    longest_line = max(file, key=len)
print(longest_line)

🔥 Challenge: Find the longest word in a large file efficiently.

8 Enumerate (`enumerate()`)

💡 Use Case: Tracking index positions while iterating.

names = ["Alice", "Bob", "Charlie"]
for index, name in enumerate(names, start=1):
    print(index, name)

🔥 Challenge: Find the index of all occurrences of a target value in a list.

9 Zip (`zip()`)

💡 Use Case: Merging multiple iterables together.

names = ["Alice", "Bob"]
ages = [25, 30]

for name, age in zip(names, ages):
    print(name, age)  # Alice 25, Bob 30

🔥 Challenge: Transpose a 2D matrix using `zip()`.

10 Map (`map()`)

💡 Use Case: Applying a function to every element of an iterable.

nums = [1, 2, 3]
squared = map(lambda x: x ** 2, nums)
print(list(squared))  # [1, 4, 9]

🔥 Challenge: Convert a list of temperatures from Celsius to Fahrenheit using `map()`.

11. Filter (`filter()`)

💡 Use Case: Selecting elements based on a condition.

pythonCopyEditnums = [1, -2, 3, -4]
positives = filter(lambda x: x > 0, nums)
print(list(positives))  # [1, 3]

🔥 Challenge: Filter out all words from a list that are shorter than 4 letters.

13. Generators

Generators are iterables that produce values on-the-fly.

Use Cases

Handling large datasets or infinite sequences.
Memory-efficient data processing.

def my_generator():
    yield 1
    yield 2
    yield 3

for item in my_generator():
    print(item)

Challenges

Generate Fibonacci Numbers:

def fibonacci():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

fib = fibonacci()
for _ in range(10):
    print(next(fib))

Read Large Files in Chunks:

def read_in_chunks(file, chunk_size=1024):
    while True:
        data = file.read(chunk_size)
        if not data:
            break
        yield data

with open("large_file.txt", "r") as file:
    for chunk in read_in_chunks(file):
        print(chunk)

14. `itertools` Module

The itertools module provides tools for creating and working with iterators.

Use Cases

Combining, filtering, and transforming iterables.
Generating permutations, combinations, and infinite sequences.

import itertools

# Infinite iterator
counter = itertools.count(start=10, step=-1)
for _ in range(5):
    print(next(counter))  # 10, 9, 8, 7, 6

Challenges

Generate All Permutations of a List:

import itertools

data = [1, 2, 3]
permutations = itertools.permutations(data)
print(list(permutations))

Group Consecutive Duplicates:

import itertools

data = [1, 1, 2, 3, 3, 3]
grouped = [list(group) for key, group in itertools.groupby(data)]
print(grouped)  # [[1, 1], [2], [3, 3, 3]]

🔹 Final Challenge Problems 🚀

Here are challenging problems for mastering Python iterables:

Implement a sliding window sum using iterables.
Write a generator that returns Fibonacci numbers infinitely.
Use zip_longest() to merge two lists of different lengths.
Group words by their first letter using dictionaries.
Sort a list of tuples based on the second value dynamically.

🔹 Summary: Best Iterable for Each Task

Task	Best Iterable
Store & modify data	`list`
Immutable ordered data	`tuple`
Unique values & set operations	`set`
Fast lookups & key-value storage	`dict`
Generating numeric sequences	`range`
Processing large files	`file`
Iterating with index	`enumerate`
Merging multiple lists	`zip()`
Applying a function to elements	`map()`
Filtering elements	`filter()`

Python Dictionary in detail- Wholesome Tutorial on Dictionaries

February 1, 2025

What is Dictionary in Python?

First of All it is not sequential like Lists. It is a non-sequential, unordered, redundant and mutable collection as key:value pairs. Keys are always unique but values need not be unique. You use the key to access the corresponding value. Where a list index is always a number, a dictionary key can be a different data type, like a string, integer, float, or even tuples but never a List(it is Mutable!).

1.Python dictionaries are unordered collections of key-value pairs.
2.They are mutable, meaning you can add, remove, and modify elements after creation.
3.Dictionary keys must be unique and immutable (e.g., strings, numbers, tuples).
4.while values can be of any data type.

The contents of a dict can be written as a series of key:value pairs within braces { }, e.g.

dict = {key1:value1, key2:value2, ... }.

The “empty dict” is just an empty pair of curly braces {}.

{'a':'abc', 'k':'xyz'} == {'k':'xyz', 'a':'abc'} # Point 1-  unordered collections of key-value pairs
output:- True

# Create a dictionary # Point 2 They are mutable, meaning you can add, remove, and modify elements after creation.
person = {"name": "John", "age": 30, "city": "New York"}

# Print the original dictionary
print("Original Dictionary:")
print(person)

# Add a new element
person["country"] = "USA"

# Print the updated dictionary
print("\nUpdated Dictionary after adding a new element:")
print(person)

# Modify an existing element
person["age"] = 31

# Print the updated dictionary
print("\nUpdated Dictionary after modifying an existing element:")
print(person)

# Remove an element
del person["city"]

# Print the updated dictionary
print("\nUpdated Dictionary after removing an element:")
print(person)

output:--
Original Dictionary:
{'name': 'John', 'age': 30, 'city': 'New York'}

Updated Dictionary after adding a new element:
{'name': 'John', 'age': 30, 'city': 'New York', 'country': 'USA'}

Updated Dictionary after modifying an existing element:
{'name': 'John', 'age': 31, 'city': 'New York', 'country': 'USA'}

Updated Dictionary after removing an element:
{'name': 'John', 'age': 31, 'country': 'USA'}

#Point 3.Dictionary keys must be unique and immutable (e.g., strings, numbers, tuples).

#Point 4.while values can be of any data type.
person = {
    "name": "John",
    "age": 30,
    "city": "New York"
}
numbers = {
    1: "one",
    2: "two",
    3: "three"
}
coordinates = {
    (1, 2): "point1",
    (3, 4): "point2",
    (5, 6): "point3"
}
mixed_keys = {
    "name": "John",
    1: "one",
    (2, 3): "point1"
}

##Note that while dictionaries can have keys of different data types, they must still be unique and immutable.

Python Programming Language Specials
January 11, 2025
Python is a popular high-level, interpreted programming language known for its readability and ease of use. Python was invented by Guido Van Rossum and it was first released in February, 1991. The name python is inspired from Monte Python Flying Circus, since circus features numerous powerful acts with simplicity which is also a key feature of python.
Python is a known for its simplicity so Very easy for A newbie to start and learn it in no time. These are some features and highlights of it which i have listed here:-
Simple and Readable Syntax:
Python’s code is known for being clear and concise, resembling natural language. This makes it easier to learn and understand, even for beginners with no prior programming experience. For some it may feel like reading Instructions in Simple English.
Interpreted and Interactive:
Python code is executed line by line by the Python interpreter, allowing for quick development and testing through interactive shells. {Python is an interpreted language, which means that each line of Python code is executed one at a time by the Python interpreter. Unlike compiled languages like C or C++, where source code is translated into machine code before execution, Python source code is directly translated into intermediate bytecode instructions by the Python interpreter. This bytecode is then executed by the Python virtual machine (PVM). This interpretation process allows for greater flexibility and portability, as Python code can run on any platform with a compatible Python interpreter without the need for recompilation}. {Python provides an interactive mode, often referred to as the Python shell or REPL (Read-Eval-Print Loop). In this mode, users can enter Python commands one at a time, and the interpreter executes them immediately, displaying the results. This interactive mode allows for rapid prototyping, experimentation, and testing of code snippets without the need to write a complete script or program. It’s particularly useful for learning Python, debugging code}.
High-level Language:
Python abstracts many complex programming tasks, allowing developers to focus on solving problems rather than dealing with low-level details.
High-level languages are characterized by their abstraction from the details of the computer’s hardware. They are designed to be easy for humans to read and write. Here are several reasons why Python is considered a high-level language:
1. Abstraction from Hardware
Python abstracts away most of the complex details of the computer’s hardware, allowing you to focus on solving problems and writing algorithms rather than managing memory and processor instructions.
```
# Simple example of Python code
print("Hello, World!")
```
2. Easy to Read and Write
Python’s syntax is designed to be readable and straightforward. It uses indentation to define blocks of code, which makes it visually clear and consistent.
```
def greet(name):
    print(f"Hello, {name}!")

greet("Alice")
```
3. Rich Standard Library
Python comes with a comprehensive standard library that provides modules and functions for various tasks, from file handling to web development, without needing to write everything from scratch.
```
import os

# List files in the current directory
files = os.listdir(".")
print(files)
```
4. Dynamic Typing
In Python, you do not need to declare the type of a variable. The type is inferred at runtime, which simplifies the coding process.
```
codex = 42       # Integer
x = "Hello"  # String
```
5. Built-in High-Level Data Structures
Python includes high-level data structures like lists, dictionaries, and sets, which make it easy to store and manipulate collections of data.
```
# List
fruits = ["apple", "banana", "cherry"]
print(fruits)

# Dictionary
person = {"name": "Alice", "age": 30}
print(person)
```
6. Automatic Memory Management
Python handles memory management automatically using garbage collection, which means you do not need to manually allocate and deallocate memory.
```
# Creating objects and Python handles memory management
class Person:
    def __init__(self, name):
        self.name = name

p = Person("Alice")
```
7. Cross-Platform
Python is a cross-platform language, meaning that you can run Python code on different operating systems, such as Windows, macOS, and Linux, with little or no modification.
8. Extensive Ecosystem
Python has a vast ecosystem of third-party libraries and frameworks that extend its capabilities. Whether you are working in web development, data science, machine learning, or automation, there’s likely a library that can help you.
```
# Example of using a third-party library
import requests

response = requests.get("https://api.github.com")
print(response.json())
```
Dynamic Typing:
Python uses dynamic typing, meaning you don’t need to declare variable types explicitly. Variables can hold values of any type, and their type can change dynamically during execution.
Python is a dynamically typed language, which means that you don’t need to declare the type of a variable when you create one. The type is inferred at runtime based on the value assigned to the variable. This allows for more flexibility, but also requires careful handling to avoid type-related errors.
Here are some examples to illustrate dynamic typing in Python:
Example 1: Basic Variable Assignment
```
# Assign an integer value
x = 10
print(x)  # Output: 10
print(type(x))  # Output: <class 'int'>

# Reassign a string value
x = "Hello, World!"
print(x)  # Output: Hello, World!
print(type(x))  # Output: <class 'str'>
```
In this example, the variable x is first assigned an integer value and then reassigned a string value. The type of x changes dynamically based on the value it holds.
Example 2: Function with Dynamic Types
```
def add(a, b):
    return a + b

# Use with integers
result = add(5, 3)
print(result)  # Output: 8
print(type(result))  # Output: <class 'int'>

# Use with strings
result = add("Hello, ", "World!")
print(result)  # Output: Hello, World!
print(type(result))  # Output: <class 'str'>
```
The add function works with both integers and strings, showcasing Python’s dynamic typing. The function does not specify the types of its arguments, allowing it to operate on different types of inputs.
Example 3: List with Mixed Types
```
# Create a list with mixed types
my_list = [1, "two", 3.0, [4, 5]]

for item in my_list:
    print(f"Value: {item}, Type: {type(item)}")

# Output:
# Value: 1, Type: <class 'int'>
# Value: two, Type: <class 'str'>
# Value: 3.0, Type: <class 'float'>
# Value: [4, 5], Type: <class 'list'>
```
In this example, my_list contains elements of different types. Python allows this because it dynamically handles the types of elements within the list.
Example 4: Type Checking and Type Casting
```
codex = 5
print(type(x))  # Output: <class 'int'>

# Convert integer to string
x = str(x)
print(type(x))  # Output: <class 'str'>
print(x)  # Output: 5

# Convert string to float
x = float(x)
print(type(x))  # Output: <class 'float'>
print(x)  # Output: 5.0
```
This example demonstrates type casting, where the type of variable x is changed explicitly using type conversion functions like str() and float().
Example 5: Dynamic Typing in Function Arguments
```
def process(data):
    if isinstance(data, int):
        return data * 2
    elif isinstance(data, str):
        return data.upper()
    else:
        return "Unsupported type"

# Process different types of data
print(process(10))  # Output: 20
print(process("hello"))  # Output: HELLO
print(process(3.14))  # Output: Unsupported type
```
The process function behaves differently based on the type of its argument, demonstrating how dynamic typing allows for flexible function definitions.
Automatic Memory Management:
Python uses garbage collection to automatically handle memory allocation and deallocation, relieving developers from managing memory manually.
Extensive Standard Library:
Python comes with a rich set of modules and libraries for tasks such as file I/O, networking, mathematics, and more, making it suitable for a wide range of applications.
Cross-platform:
Python code can run on various operating systems, including Windows, macOS, and Linux, with minimal or no modifications.
Object-Oriented:
Python supports object-oriented programming (OOP) paradigms, allowing developers to create reusable and modular code through classes and objects.
Functional Programming Constructs:
Python supports functional programming concepts like lambda functions, map, reduce, and filter, enabling developers to write clean and concise code.
Community and Ecosystem:
Python has a large and active community of developers who contribute libraries, frameworks, and tools, fostering innovation and providing solutions for various domains.
Readability and Maintainability:
Python’s syntax emphasizes code readability, with clear and expressive code structures, making it easier to write and maintain large-scale projects.
Versatility:
Python is versatile and can be used for various types of programming tasks, including web development, data analysis, artificial intelligence, scientific computing, automation, and more.
Integration Capabilities:
Python easily integrates with other languages and platforms, allowing developers to leverage existing codebases and infrastructure.
Free and Open-Source:
Using Python is completely free, and its open-source nature allows anyone to contribute to its development and libraries.
Some Not So Good Points of Python:-
- Speed: Python is often slower than compiled languages like C++ or Java. This is because Python code is interpreted line by line at runtime, whereas compiled languages are translated into machine code beforehand. If speed is critical for your application, Python might not be the best fit.
- Memory Usage: Python can be less memory-efficient compared to some other languages. This is due to its dynamic typing system and garbage collection mechanism. If you’re working with large datasets or memory-constrained environments, this could be a concern.
- Mobile Development: While there are frameworks for mobile development with Python, it’s generally not the preferred language. Native languages or frameworks like Kotlin for Android or Swift for iOS tend to be more optimized for mobile app performance.
- Strict Indentation: Python relies on indentation to define code blocks, unlike languages using curly braces. While this promotes readability, it can also be a source of errors if not careful with whitespace.
- Global Interpreter Lock (GIL): The GIL is a mechanism in Python that prevents multiple threads from executing Python bytecode at the same time. This can limit performance in multi-core or multi-processor environments where true parallel processing might be beneficial.
Examples:-
1.Dynamic Typing-
```
def firehim():
if x>5:
return 34
print(x)
else:
return "war"
```
```
x=2
firehim()- Result in this case -war
x=6
firehim()- Result in this case- 34
```

Useful Code Snippets in Python and Pyspark

January 7, 2025

#1. create a sample dataframe

# create a sample dataframe
data = [
    ("Sam","Sales", 50000),
    ("Ram","Sales", 60000),
    ("Dan","Sales", 70000),
    ("Gam","Marketing", 40000),
    ("Ham","Marketing", 55000),
    ("RAM","IT", 45000),
    ("Mam","IT", 65000),
    ("MAM","IT", 75000)
]

df = spark.createDataFrame(data, ["Name","Department", "Salary"])

other ways to create dataframe here?

In PySpark, there are multiple ways to create a DataFrame besides using spark.createDataFrame() with a list of tuples. Below are some alternative methods to create the same DataFrame:

1. Using a List of Dictionaries

You can create a DataFrame from a list of dictionaries, where each dictionary represents a row.

data = [
    {"Name": "Sam", "Department": "Sales", "Salary": 50000},
    {"Name": "Ram", "Department": "Sales", "Salary": 60000},
    {"Name": "Dan", "Department": "Sales", "Salary": 70000},
    {"Name": "Gam", "Department": "Marketing", "Salary": 40000},
    {"Name": "Ham", "Department": "Marketing", "Salary": 55000},
    {"Name": "RAM", "Department": "IT", "Salary": 45000},
    {"Name": "Mam", "Department": "IT", "Salary": 65000},
    {"Name": "MAM", "Department": "IT", "Salary": 75000}
]

df = spark.createDataFrame(data)
df.show()

2. Using a Pandas DataFrame

You can create a PySpark DataFrame from a Pandas DataFrame.

import pandas as pd

# Create a Pandas DataFrame
pandas_df = pd.DataFrame({
    "Name": ["Sam", "Ram", "Dan", "Gam", "Ham", "RAM", "Mam", "MAM"],
    "Department": ["Sales", "Sales", "Sales", "Marketing", "Marketing", "IT", "IT", "IT"],
    "Salary": [50000, 60000, 70000, 40000, 55000, 45000, 65000, 75000]
})

# Convert Pandas DataFrame to PySpark DataFrame
df = spark.createDataFrame(pandas_df)
df.show()

3. Using an RDD

You can create a DataFrame from an RDD (Resilient Distributed Dataset).

from pyspark.sql import Row

# Create an RDD
rdd = spark.sparkContext.parallelize([
    Row(Name="Sam", Department="Sales", Salary=50000),
    Row(Name="Ram", Department="Sales", Salary=60000),
    Row(Name="Dan", Department="Sales", Salary=70000),
    Row(Name="Gam", Department="Marketing", Salary=40000),
    Row(Name="Ham", Department="Marketing", Salary=55000),
    Row(Name="RAM", Department="IT", Salary=45000),
    Row(Name="Mam", Department="IT", Salary=65000),
    Row(Name="MAM", Department="IT", Salary=75000)
])

# Convert RDD to DataFrame
df = spark.createDataFrame(rdd)
df.show()

4. Using SQL Query

If you already have a table in a database or another DataFrame, you can create a new DataFrame using a SQL query.

# Register the DataFrame as a temporary view
df.createOrReplaceTempView("employees")

# Create a new DataFrame using a SQL query
new_df = spark.sql("SELECT * FROM employees")
new_df.show()

5. Using CSV or Other File Formats

You can create a DataFrame by reading data from a file (e.g., CSV, JSON, Parquet).

# Example: Reading from a CSV file
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
df.show()

6. Using StructType and StructField

You can define the schema explicitly using StructType and StructField and then create a DataFrame.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define the schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Department", StringType(), True),
    StructField("Salary", IntegerType(), True)
])

# Create the DataFrame
data = [
    ("Sam", "Sales", 50000),
    ("Ram", "Sales", 60000),
    ("Dan", "Sales", 70000),
    ("Gam", "Marketing", 40000),
    ("Ham", "Marketing", 55000),
    ("RAM", "IT", 45000),
    ("Mam", "IT", 65000),
    ("MAM", "IT", 75000)
]

df = spark.createDataFrame(data, schema)
df.show()

Output for All Methods

All the above methods will produce the same DataFrame:

Copy

+----+----------+------+
|Name|Department|Salary|
+----+----------+------+
| Sam|     Sales| 50000|
| Ram|     Sales| 60000|
| Dan|     Sales| 70000|
| Gam| Marketing| 40000|
| Ham| Marketing| 55000|
| RAM|        IT| 45000|
| Mam|        IT| 65000|
| MAM|        IT| 75000|
+----+----------+------+

#2. Dataframe or Tables Columns Manipulation

How to show column names of a Dataframe in Pyspark:-

# Show column names
print(df.columns)--List of Column Names
df.printSchema()-- Schema that has column Names
df.select(*df.columns).show(truncate=False)--
print(df.dtypes)---Tuples that has column name
column_names = [x[0] for x in df.dtypes]
column_names = list(map(lambda x: x[0], df.dtypes))
column_names, _ = zip(*df.dtypes)
column_names = [field.name for field in df.schema.fields]

How to Print Column Names as separated by comma or space and no single or Double quotes

print(‘, ‘.join([c.strip(“‘”) for c in df.columns]))

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark session
spark = SparkSession.builder.appName("RenameColumns").getOrCreate()

# Sample data
data = [(1, "foo"), (2, "bar")]
columns = ["column_abc", "column_xyz"]

# Create the original DataFrame
df = spark.createDataFrame(data, columns)

# Show the original DataFrame
df.show()

# Rename columns by removing the 'column_' prefix
new_columns = [col(col_name).alias(col_name.replace("column_", "")) for col_name in df.columns]

# Create a new DataFrame with renamed columns
new_df = df.select(new_columns)

# Show the new DataFrame
new_df.show()

#2nd Method Rename columns by removing "column_" prefix
new_column_names = [col_name.replace("column_", "") for col_name in df.columns]
df_transformed = df.toDF(*new_column_names)
df_transformed.show()

# Save the new DataFrame as a table (optional)
new_df.write.saveAsTable("new_table_name")

# Function to remove 'column_' and convert to CamelCase
def to_camel_case(column_name):
    new_name = column_name.replace("column_", "")  # Remove 'column_'
    return new_name[:1].upper() + new_name[1:]  # Convert first letter to uppercase
new_column_names = [to_camel_case(col_name) for col_name in df.columns]
df_transformed = df_original.toDF(*new_column_names)
df_transformed.show()

# Function to remove 'column_' and convert to CamelCase
def to_camel_case(column_name):
    new_name1 ,new_name2 = column_name.split("_")
    new_name1 = new_name1.replace("_", "")  # Remove 'column_'
    print(new_name1,new_name2)
    return new_name1.capitalize()  + new_name2.capitalize()    # Convert first letter to uppercase
new_column_names = [to_camel_case(col_name) for col_name in df.columns]
df_transformed = df.toDF(*new_column_names)
df_transformed.show()

what is toDF and its uses.. can we use any other function to achieve similar functionality?

What is `toDF()` in PySpark?

toDF() is a method in PySpark used to rename columns of a DataFrame or to create a DataFrame from an RDD by specifying column names.

Uses of `toDF()`

Rename All Columns: You can use toDF(*new_column_names) to rename all columns at once.
Create a DataFrame from an RDD: When creating a DataFrame from an RDD, toDF() allows specifying column names.
Convert RDD to DataFrame with Inferred Schema: If an RDD has data but no schema, toDF() helps create a structured DataFrame.

Alternative Ways to Rename Columns

Yes, you can rename columns using other methods in PySpark:

1. Using `withColumnRenamed()` (Renaming Columns One by One)

pythonCopyEditfrom pyspark.sql.functions import col

df_transformed = df_original
for col_name in df_original.columns:
    df_transformed = df_transformed.withColumnRenamed(col_name, col_name.replace("column_", ""))

Best for renaming selective columns but is less efficient when renaming all columns.

2. Using `selectExpr()` (Using SQL Expressions)

pythonCopyEditdf_transformed = df_original.selectExpr(
    "`column_abc` as abc", "`column_xyz` as xyz", "`column_pqr` as pqr"
)

Useful when renaming and applying transformations simultaneously.

3. Using `alias()` (Inside `select()`)

df_transformed = df_original.select(
    col("column_abc").alias("abc"),
    col("column_xyz").alias("xyz"),
    col("column_pqr").alias("pqr")
)
Allows renaming specific columns, similar to selectExpr().

#3. To add to say 12 columns dynamically depends on current_date. .. select abc,xyz, txn_date, case when {txn_date is in current_month-1 , last month} then abc as mdab_{current_year}{last_month} .. will be repeated as for last 11 months

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr, date_format, trunc
from datetime import datetime, timedelta

# Initialize Spark Session
spark = SparkSession.builder.appName("DynamicColumns").getOrCreate()

# Sample DataFrame with txn_date
data = [
    (1, "Alice", "2024-12-15"),
    (2, "Bob", "2024-11-10"),
    (3, "Charlie", "2024-10-05"),
]
df = spark.createDataFrame(data, ["abc", "xyz", "txn_date"])

# Convert txn_date to date format
df = df.withColumn("txn_date", col("txn_date").cast("date"))

# Get current date
current_date = datetime.today()

# Generate last 12 months dynamically
for i in range(1, 13):  # Last 12 months
    target_date = current_date.replace(day=1) - timedelta(days=1)  # Go to last month
    target_date = target_date.replace(day=1) - timedelta(days=(i-1) * 30)  # Shift further back
    year_month = target_date.strftime("%Y%m")  # Format as YYYYMM

    # Generate column dynamically
    column_name = f"mdab_{year_month}"
    df = df.withColumn(
        column_name,
        expr(f"CASE WHEN date_format(txn_date, 'yyyyMM') = '{year_month}' THEN abc ELSE NULL END")
    )

# Show final DataFrame
df.show()

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr, current_date, date_format, add_months

# Initialize Spark Session
spark = SparkSession.builder.appName("DynamicColumns").getOrCreate()

# Sample DataFrame with txn_date
data = [
    (1, "Alice", "2024-12-15"),
    (2, "Bob", "2024-11-10"),
    (3, "Charlie", "2024-10-05"),
]
df = spark.createDataFrame(data, ["abc", "xyz", "txn_date"])

# Convert txn_date to date type
df = df.withColumn("txn_date", col("txn_date").cast("date"))

# Generate last 12 months dynamically using PySpark functions
current_dt = current_date()
new_columns = []

for i in range(1, 13):  # Last 12 months
    target_month = date_format(add_months(current_dt, -i), "yyyyMM")  # Get YYYYMM for each month
    column_name = f"mdab_{i}"  # e.g., mdab_1, mdab_2, ..., mdab_12

    # Create the CASE WHEN column
    new_columns.append(
        expr(f"CASE WHEN date_format(txn_date, 'yyyyMM') = '{target_month}' THEN abc ELSE NULL END").alias(column_name)
    )

# Select all columns + dynamically generated columns
df_final = df.select(["abc", "xyz", "txn_date"] + new_columns)

# Show the result
df_final.show()

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr, current_date, date_format, add_months, sequence, explode, lit

# Initialize Spark Session
spark = SparkSession.builder.appName("DynamicColumns").getOrCreate()

# Sample DataFrame with txn_date
data = [
    (1, "Alice", "2024-12-15"),
    (2, "Bob", "2024-11-10"),
    (3, "Charlie", "2024-10-05"),
]
df = spark.createDataFrame(data, ["abc", "xyz", "txn_date"])

# Convert txn_date to date type
df = df.withColumn("txn_date", col("txn_date").cast("date"))

# Step 1: Generate an array of the last 12 months
df_months = df.withColumn("month_list", sequence(add_months(current_date(), -12), add_months(current_date(), -1), lit(1)))

# Step 2: Explode the array into separate rows
df_exploded = df_months.withColumn("month_offset", explode(col("month_list")))

# Step 3: Format exploded dates into 'YYYYMM' format
df_exploded = df_exploded.withColumn("year_month", date_format(col("month_offset"), "yyyyMM"))

# Step 4: Apply CASE WHEN condition
df_mapped = df_exploded.withColumn(
    "mdab_value", expr("CASE WHEN date_format(txn_date, 'yyyyMM') = year_month THEN abc ELSE NULL END")
)

# Step 5: Pivot table to get months as columns
df_final = df_mapped.groupBy("abc", "xyz", "txn_date").pivot("year_month").agg(expr("first(mdab_value)"))

# Show final output
df_final.show()

dbname.table_name want to save dbname and table_name in seperate variable and then to pass them as parameters in pyspark/python script

# String containing dbname and table_name
full_table_name = "my_database.my_table"

# Split into dbname and table_name
dbname, table_name = full_table_name.split('.')

# Print the variables
print(f"Database Name: {dbname}")
print(f"Table Name: {table_name}")

# Use these variables in a PySpark job
query = f"SELECT * FROM {dbname}.{table_name} WHERE some_column = 'some_value'"

# Example usage in PySpark
df = spark.sql(query)
df.show()

import argparse
from pyspark.sql import SparkSession

# Initialize Spark
spark = SparkSession.builder.appName("Database Table Processing").getOrCreate()

# Argument parser
parser = argparse.ArgumentParser(description="Process a database table")
parser.add_argument("--dbname", required=True, help="Database name")
parser.add_argument("--table_name", required=True, help="Table name")

# Parse the arguments
args = parser.parse_args()
dbname = args.dbname
table_name = args.table_name

# Use dbname and table_name in your query
query = f"SELECT * FROM {dbname}.{table_name} WHERE some_column = 'some_value'"

# Execute the query in PySpark
df = spark.sql(query)
df.show()

import argparse
from pyspark.sql import SparkSession

# Initialize Spark
spark = SparkSession.builder.appName("Database Table Processing").getOrCreate()

# Argument parser
parser = argparse.ArgumentParser(description="Process a database table")
parser.add_argument("--dbname", required=True, help="Database name")
parser.add_argument("--table_name", required=True, help="Table name")

# Parse the arguments
args = parser.parse_args()
dbname = args.dbname
table_name = args.table_name

# Use dbname and table_name in your query
query = "SELECT * FROM {}.{} WHERE some_column = 'some_value'".format(dbname, table_name)

# Execute the query in PySpark
df = spark.sql(query)
df.show()


spark-submit myscript.py --dbname my_database --table_name my_table

To create a list of columns from a Pandas DataFrame or PySpark DataFrame, formatted with different delimiters or enclosed in quotes.

Pandas DataFrame

Example DataFrame

import pandas as pd

df = pd.DataFrame({
    "col1": [1, 2],
    "col2": [3, 4],
    "col3": [5, 6]
})

Creating a List of Columns

# Get column names
columns = df.columns.tolist()

# Separate by comma
comma_separated = ", ".join(columns)
print("Comma-separated:", comma_separated)

# Separate by space
space_separated = " ".join(columns)
print("Space-separated:", space_separated)

# Enclose in quotes and separate by comma
quoted_comma_separated = ", ".join(f"'{col}'" for col in columns)
print("Quoted, comma-separated:", quoted_comma_separated)

For PySpark DataFrame

Example DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ColumnListExample").getOrCreate()

data = [(1, 3, 5), (2, 4, 6)]
columns = ["col1", "col2", "col3"]
df = spark.createDataFrame(data, columns)

Creating a List of Columns

# Get column names
columns = df.columns

# Separate by comma
comma_separated = ", ".join(columns)
print("Comma-separated:", comma_separated)

# Separate by space
space_separated = " ".join(columns)
print("Space-separated:", space_separated)

# Enclose in quotes and separate by comma
quoted_comma_separated = ", ".join(f"'{col}'" for col in columns)
print("Quoted, comma-separated:", quoted_comma_separated)

Outputs

For the DataFrame columns ["col1", "col2", "col3"], you would get:

Comma-separated: col1, col2, col3
Space-separated: col1 col2 col3
Quoted, comma-separated: 'col1', 'col2', 'col3'

What is indexing in SQL- Syntax, Types, Use Cases, Advantages, Disadv, and Scenarios

January 4, 2025

Here’s a comprehensive, logically structured, and interactive guide to SQL Indexing, consolidating and enhancing all the content you’ve shared, complete with real examples, platform-specific insights, and advanced use cases:

🧠 Mastering Indexing in SQL: A Complete Guide

🔍 What is Indexing in SQL?

Indexing is a performance optimization technique that allows fast retrieval of rows from a table, similar to a book index that helps locate topics quickly.

🎯 Core Idea

Instead of scanning every row (full table scan), the database engine can jump directly to matching rows using a sorted data structure like a B-tree or hash table.

⚙️ How Does Indexing Work?

Index Creation: You specify one or more columns to index.
Index Structure: Data is stored in a B-tree, bitmap, hash, or other structures depending on DB type.
Query Execution: On query, the DB engine checks the query plan and uses the index if available and efficient.
Row Lookup: Index holds pointers (row IDs or offsets) to locate the actual data.

📂 Types of Indexes (with Syntax & Use Cases)

Type	Description	Syntax	Example Use Case
Single-column Index	Index on one column	`CREATE INDEX idx_col ON table(col);`	Fast lookup on a specific column
Composite Index	Index on multiple columns	`CREATE INDEX idx_multi ON table(col1, col2);`	Filtering by both columns
Unique Index	Ensures values are unique	`CREATE UNIQUE INDEX idx_unique ON table(col);`	Enforce no duplicate emails
Clustered Index	Reorders actual rows	`CREATE CLUSTERED INDEX idx_cl ON table(col);`	Only one allowed per table
Non-clustered Index	Separate structure; references rows	`CREATE NONCLUSTERED INDEX idx ON table(col);`	Most common index type
Function-Based Index	Index on function output	`CREATE INDEX idx_lower ON users(LOWER(email));`	Used in search-insensitive queries
Partial Index (PostgreSQL)	Index on filtered rows	`CREATE INDEX idx_active ON users(last_login) WHERE is_active = true;`	Speed up selective queries
Full-Text Index	For full-text search	`CREATE FULLTEXT INDEX idx_text ON docs(content);`	Search documents by keyword
Spatial Index	For geospatial data	`CREATE SPATIAL INDEX idx_geo ON locations(geom);`	Optimized for GIS queries

🚀 When to Use Indexes (Use Cases)

✅ Best Use Cases for Indexes

Query Pattern	Why Index Helps	Example
WHERE clause	Directly locates matching rows	`SELECT * FROM orders WHERE order_id = 12345;`
JOIN conditions	Fast row match between tables	`JOIN orders ON customer_id`
ORDER BY / GROUP BY	Efficient sorting/grouping	`ORDER BY created_at`
Foreign key columns	Used frequently in joins	`customer_id` in `orders`
Search in expressions	If functional index used	`LOWER(email) = 'abc@test.com'`
Columns with high cardinality	Distinct values benefit more	`email`, `SSN`, `transaction_id`

⚠️ Cases Where Indexes May Not Help

Small tables: full scan is often faster.
Columns with low cardinality: e.g., gender = ‘M/F’.
Queries without WHERE/JOIN/ORDER BY.
Filtering by non-leading column of composite index.

🔨 Creating Indexes — Syntax Examples

-- Basic Index
CREATE INDEX idx_email ON employees(email);

-- Composite Index
CREATE INDEX idx_email_dept ON employees(email, department);

-- Unique Index
CREATE UNIQUE INDEX idx_unique_email ON users(email);

-- Function-based Index (PostgreSQL)
CREATE INDEX idx_lower_email ON users(LOWER(email));

-- Partial Index (PostgreSQL)
CREATE INDEX idx_active_users ON users(last_login) WHERE is_active = true;

-- Full-text Index (MySQL, PostgreSQL)
CREATE FULLTEXT INDEX idx_text_content ON documents(content);

-- Drop Index
DROP INDEX idx_email; -- PostgreSQL
DROP INDEX idx_email ON employees; -- MySQL

🔬 Platform-Specific Hands-On Demos

🐘 PostgreSQL Indexing Demo

CREATE TABLE customers (
  id SERIAL PRIMARY KEY,
  name VARCHAR(100),
  email VARCHAR(100),
  city VARCHAR(50),
  created_at DATE
);

-- Insert sample data
INSERT INTO customers (name, email, city, created_at)
SELECT 
  'User' || i, 
  'user' || i || '@example.com', 
  CASE WHEN i % 2 = 0 THEN 'Delhi' ELSE 'Mumbai' END,
  CURRENT_DATE - (i || ' days')::interval
FROM generate_series(1, 100000) AS i;

-- Create index
CREATE INDEX idx_city ON customers(city);

-- Analyze performance
EXPLAIN ANALYZE SELECT * FROM customers WHERE city = 'Delhi';

🐬 MySQL Indexing Demo

CREATE TABLE orders (
  id INT AUTO_INCREMENT PRIMARY KEY,
  customer_id INT,
  amount DECIMAL(10,2),
  order_date DATE
);

-- Insert test data using procedure
DELIMITER $$
CREATE PROCEDURE insert_orders()
BEGIN
  DECLARE i INT DEFAULT 1;
  WHILE i <= 100000 DO
    INSERT INTO orders (customer_id, amount, order_date)
    VALUES (
      FLOOR(RAND()*1000),
      RAND()*1000,
      DATE_SUB(CURDATE(), INTERVAL FLOOR(RAND()*365) DAY)
    );
    SET i = i + 1;
  END WHILE;
END$$
DELIMITER ;

CALL insert_orders();

-- Index and performance check
CREATE INDEX idx_order_date ON orders(order_date);
EXPLAIN SELECT * FROM orders WHERE order_date = '2024-01-01';

🔥 PySpark + Delta Lake Z-Ordering

# Create delta table
from pyspark.sql.functions import expr
df = spark.range(0, 1000000).withColumn("order_id", expr("id % 1000")) \
                             .withColumn("country", expr("CASE WHEN id % 2 = 0 THEN 'India' ELSE 'US' END")) \
                             .withColumn("amount", expr("rand() * 1000"))

df.write.format("delta").mode("overwrite").save("/mnt/delta/orders")
spark.sql("CREATE TABLE delta_orders USING DELTA LOCATION '/mnt/delta/orders'")

# Optimize with Z-Ordering
spark.sql("OPTIMIZE delta_orders ZORDER BY (country)")

# Query performance
spark.sql("SELECT * FROM delta_orders WHERE country = 'India'").explain(True)

📈 How to Check if Index is Used

PostgreSQL: EXPLAIN ANALYZE SELECT ... → Look for Index Scan or Bitmap Index Scan
MySQL: EXPLAIN SELECT ... → Look for Using index or ref
Spark: Use .explain(True) or Query Execution Plan

📉 Disadvantages of Indexing

Concern	Explanation
Slower Writes	INSERT/UPDATE/DELETE are slower due to index maintenance
Disk Space	Indexes consume additional space
Over-indexing	Too many indexes confuse the optimizer and bloat storage
Fragmentation	May reduce performance over time

🧠 Advanced Tips & Best Practices

Start with frequently queried columns (WHERE/JOIN).
Use composite indexes carefully – Always filter using left-most column.
Don’t index everything – Focus on high-cardinality, high-impact queries.
Regularly analyze and drop unused indexes – Monitor via query plans.
Monitor index usage – Most RDBMSs provide pg_stat_user_indexes, SHOW INDEX, or query performance tables.

🧊 Final Summary

Feature	Purpose	Example
🔎 Fast lookups	Single-column index	`CREATE INDEX ON employees(email)`
🧩 Multiple filters	Composite index	`CREATE INDEX ON orders(customer_id, order_date)`
🔐 Uniqueness	Unique index	`CREATE UNIQUE INDEX ON users(email)`
🧠 Function queries	Function-based index	`CREATE INDEX ON users(LOWER(email))`
🎯 Conditional	Partial index	`CREATE INDEX ON users(last_login) WHERE is_active = true`
🔄 Maintain speed	Z-Order (Delta Lake)	`OPTIMIZE delta_orders ZORDER BY (country)`

Spark SQL- operators Cheatsheet- Explanation with Usecases

December 28, 2024

Spark SQL Operators Cheatsheet

1. Arithmetic Operators

Operator	Syntax	Description	Example
`+`	`a + b`	Adds two values	`SELECT 5 + 3;`
`-`	`a - b`	Subtracts one value from another	`SELECT 5 - 3;`
`*`	`a * b`	Multiplies two values	`SELECT 5 * 3;`
`/`	`a / b`	Divides one value by another	`SELECT 6 / 2;`
`%`	`a % b`	Returns the remainder of division	`SELECT 5 % 2;`

2. Comparison Operators

Operator	Syntax	Description	Example
`=`	`a = b`	Equal to	`SELECT * FROM table WHERE col = 10;`
`!=` or `<>`	`a != b` or `a <> b`	Not equal to	`SELECT * FROM table WHERE col != 10;`
`>`	`a > b`	Greater than	`SELECT * FROM table WHERE col > 10;`
`<`	`a < b`	Less than	`SELECT * FROM table WHERE col < 10;`
`>=`	`a >= b`	Greater than or equal to	`SELECT * FROM table WHERE col >= 10;`
`<=`	`a <= b`	Less than or equal to	`SELECT * FROM table WHERE col <= 10;`

3. Logical Operators

Operator	Syntax	Description	Example
`AND`	`a AND b`	Returns true if both conditions are true	`SELECT * FROM table WHERE a > 10 AND b < 20;`
`OR`	`a OR b`	Returns true if any condition is true	`SELECT * FROM table WHERE a > 10 OR b < 20;`
`NOT`	`NOT a`	Reverses the boolean value of a condition	`SELECT * FROM table WHERE NOT a > 10;`

4. String Operators

Operator	Syntax	Description	Example
`		`	`a	b`	Concatenates two strings	`SELECT ‘Hello’	‘ World’;`
`LIKE`	`a LIKE pattern`	Matches a string to a pattern	`SELECT * FROM table WHERE name LIKE 'A%';`
`RLIKE`	`a RLIKE pattern`	Matches a string to a regex pattern	`SELECT * FROM table WHERE name RLIKE '^A';`
`ILIKE`	`a ILIKE pattern`	Case-insensitive `LIKE` pattern matching	`SELECT * FROM table WHERE name ILIKE 'a%';`

5. Null Handling Operators

Operator	Syntax	Description	Example
`IS NULL`	`a IS NULL`	Checks if a value is NULL	`SELECT * FROM table WHERE col IS NULL;`
`IS NOT NULL`	`a IS NOT NULL`	Checks if a value is NOT NULL	`SELECT * FROM table WHERE col IS NOT NULL;`
`NULLIF`	`NULLIF(a, b)`	Returns NULL if a equals b, otherwise a	`SELECT NULLIF(10, 10);`
`COALESCE`	`COALESCE(a, b, ...)`	Returns the first non-NULL value	`SELECT COALESCE(NULL, 'default');`

6. Set Operators

Operator	Syntax	Description	Example
`UNION`	`query1 UNION query2`	Combines results of two queries (removes duplicates)	`SELECT col FROM table1 UNION SELECT col FROM table2;`
`UNION ALL`	`query1 UNION ALL query2`	Combines results of two queries (includes duplicates)	`SELECT col FROM table1 UNION ALL SELECT col FROM table2;`
`INTERSECT`	`query1 INTERSECT query2`	Returns common rows from both queries	`SELECT col FROM table1 INTERSECT SELECT col FROM table2;`
`EXCEPT`	`query1 EXCEPT query2`	Returns rows from the first query not in the second	`SELECT col FROM table1 EXCEPT SELECT col FROM table2;`

7. Aggregate Functions (Operators)

Function	Syntax	Description	Example
`COUNT`	`COUNT(*)` or `COUNT(col)`	Counts rows or non-NULL values	`SELECT COUNT(*) FROM table;`
`SUM`	`SUM(col)`	Sums up numeric values	`SELECT SUM(sales) FROM table;`
`AVG`	`AVG(col)`	Calculates the average	`SELECT AVG(price) FROM table;`
`MIN`	`MIN(col)`	Finds the minimum value	`SELECT MIN(price) FROM table;`
`MAX`	`MAX(col)`	Finds the maximum value	`SELECT MAX(price) FROM table;`
`GROUP BY`	`GROUP BY col`	Groups rows based on a column value	`SELECT category, SUM(sales) FROM table GROUP BY category;`

8. Window Functions

Function	Syntax	Description	Example
`ROW_NUMBER`	`ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)`	Assigns a unique number to each row within a partition	`SELECT ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales) AS rank FROM table;`
`RANK`	`RANK() OVER (PARTITION BY col ORDER BY col2)`	Assigns rank to rows with gaps for duplicates	`SELECT RANK() OVER (PARTITION BY category ORDER BY sales DESC) AS rank FROM table;`
`DENSE_RANK`	`DENSE_RANK() OVER (PARTITION BY col ORDER BY col2)`	Assigns rank to rows without gaps	`SELECT DENSE_RANK() OVER (PARTITION BY category ORDER BY sales DESC) AS rank FROM table;`
`NTILE`	`NTILE(n) OVER (PARTITION BY col ORDER BY col2)`	Divides rows into n buckets	`SELECT NTILE(4) OVER (ORDER BY sales) AS quartile FROM table;`
`LEAD`	`LEAD(col, offset, default) OVER (PARTITION BY col ORDER BY col2)`	Accesses the value of the next row	`SELECT LEAD(sales, 1, 0) OVER (ORDER BY date) FROM table;`
`LAG`	`LAG(col, offset, default) OVER (PARTITION BY col ORDER BY col2)`	Accesses the value of the previous row	`SELECT LAG(sales, 1, 0) OVER (ORDER BY date) FROM table;`

9. Miscellaneous Operators

Operator	Syntax	Description	Example
`CAST`	`CAST(expr AS type)`	Converts a value to a specified type	`SELECT CAST(price AS STRING) FROM table;`
`CASE`	`CASE WHEN condition THEN result ELSE result END`	Conditional logic	`SELECT CASE WHEN sales > 100 THEN 'High' ELSE 'Low' END AS category FROM table;`
`DISTINCT`	`DISTINCT col`	Returns unique values	`SELECT DISTINCT category FROM table;`

This cheatsheet provides a comprehensive overview of commonly used Spark SQL operators and functions with their syntax, descriptions, and examples. Use it as a reference to efficiently write and optimize Spark SQL queries!

How to Write Perfect Pseudocode- Syntax , Standards, Terms
December 28, 2024
Syntax Rules for Pseudocode
- Natural Language: Use simple and clear natural language to describe steps.
- Keywords: Use standard control flow keywords such as:
  - IF, ELSE, ENDIF
  - FOR, WHILE, ENDWHILE
  - FUNCTION, CALL
  - INPUT, OUTPUT
- Indentation: Indent blocks within loops or conditionals to signify nesting.
- Capitalization: Write pseudocode keywords in UPPERCASE to distinguish them from variables and logic descriptions.
- Variables: Use meaningful and descriptive names (e.g., customerName, totalSales).
Standards:
1. Use simple language: Avoid complex sentences and focus on concise, easy-to-understand statements.
2. Be consistent: Use the same terminology and formatting throughout the pseudocode.
3. Use indentation: Indent code blocks to show hierarchy and improve readability.
Terms
1. INPUT: Used to describe user input or data input.
  Example: INPUT name
2. OUTPUT: Used to describe output or results.
  Example: OUTPUT "Hello, " + name
3. SET: Used to assign a value to a variable.
  Example: SET count = 0
4. IF: Used to describe conditional statements.
  Example: IF age > 18 THEN ...
5. WHILE: Used to describe loops.
  Example: WHILE count < 10 DO ...
6. FOR: Used to describe loops with a counter.
  Example: FOR i = 1 TO 10 DO ...
7. REPEAT: Used to describe loops that repeat until a condition is met.
  Example: REPEAT ... UNTIL count = 10
8. UNTIL: Used to describe loops that repeat until a condition is met.
  Example: REPEAT ... UNTIL count = 10
9. CASE: Used to describe multi-branch conditional statements.
  Example: CASE color OF ...
10. PROCEDURE: Used to describe a subroutine or function.
  Example: PROCEDURE greet(name) ...
Common Terms and Constructs in Pseudocode
Control Structures
```
Conditionals:
IF condition 
THEN // Actions
 ELSE // Alternative actions
 ENDIF
```
```
LoopsFOR variable FROM start TO end
 DO // Actions 
ENDFOR 
WHILE condition 
DO // Actions 
ENDWHILE
```
Input/Output
```
Use clear and direct terms for input and output:
INPUT "Enter a number:", 
user Number
 OUTPUT "The result is:", result
```
Functions and Procedures
```
Define reusable components:FUNCTION calculateSum(a, b)
 RETURN a + b
 ENDFUNCTION
```
Error Handling
```
Describe how to handle errors in a readable way:IF error occurs 
THEN OUTPUT "Error encountered. Exiting process." 
EXIT 
ENDIF
```
Data Manipulation
```
Data operations like adding, updating, or deleting:                                                                                 SET total TO total + itemCost 
REMOVE item FROM shoppingCart
```
Example of Good Pseudocodes
Here’s an example of a simple algorithm to calculate the average of three numbers:
```
INPUT num1, num2, num3

SET sum = num1 + num2 + num3
SET average = sum / 3

OUTPUT "The average is: " + average
```
Problem: Calculate the factorial of a given number.
```
START

INPUT "Enter a positive integer:", num
IF num < 0 THEN
    OUTPUT "Error: Number must be positive."
    EXIT
ENDIF

SET factorial TO 1
FOR i FROM 1 TO num DO
    SET factorial TO factorial * i
ENDFOR

OUTPUT "The factorial of", num, "is", factorial

END
```
By following these guidelines, syntax, standards, and terms, you can write perfect pseudocode that is easy to read, understand, and implement.

Window functions in PySpark on Dataframe programming

December 5, 2024

Window functions in PySpark allow you to perform operations on a subset of your data using a “window” that defines a range of rows. These functions are similar to SQL window functions and are useful for tasks like ranking, cumulative sums, and moving averages. Let’s go through various PySpark DataFrame window functions, compare them with Spark SQL window functions, and provide examples with a large sample dataset.

PySpark’s window functions allow operations across a specified “window” of rows, such as performing aggregations, ranking, or comparisons. The functionality mimics SQL window functions but uses PySpark’s syntax.

Syntax Structure

Define a Window Specification: The Window object specifies how rows are partitioned and ordered for the operation.

from pyspark.sql.window import Window 
window_spec = Window.partitionBy("column1").orderBy("column2")

Apply the Window Function: Use PySpark functions like row_number(), rank(), dense_rank(), etc., with the window specification.

from pyspark.sql.functions import row_number, rank, dense_rank, sum 
df.withColumn("row_num", row_number().over(window_spec))

Window Specification Options

Option	Description	Syntax
`partitionBy()`	Divides the data into partitions for independent calculations.	`Window.partitionBy("column1")`
`orderBy()`	Specifies the order of rows within each partition.	`Window.orderBy("column2")`
`rowsBetween()`	Defines a window frame by rows relative to the current row.	`.rowsBetween(-1, 1)`
`rangeBetween()`	Defines a window frame based on the range of values in the ordering column.	`.rangeBetween(-10, 10)`
`unboundedPreceding`	Indicates all rows before the current row in the partition.	`Window.rowsBetween(Window.unboundedPreceding, 0)`
`unboundedFollowing`	Indicates all rows after the current row in the partition.	`Window.rowsBetween(0, Window.unboundedFollowing)`
`currentRow`	Refers to the current row in the partition.	`Window.rowsBetween(Window.currentRow, Window.currentRow)`

Common PySpark Window Functions

Function	Description
`row_number()`	Assigns a unique number to each row in a window.
`rank()`	Assigns a rank to each row, with gaps for ties.
`dense_rank()`	Assigns a rank to each row, without gaps for ties.
`ntile(n)`	Divides rows into `n` buckets and assigns a bucket number to each row.
`lead(column, n)`	Returns the value of the column from `n` rows ahead of the current row.
`lag(column, n)`	Returns the value of the column from `n` rows behind the current row.
`first()`	Returns the first value in the window frame.
`last()`	Returns the last value in the window frame.
`sum()`	Computes the sum of the column over the window frame.
`avg()`	Computes the average of the column over the window frame.
`max()`	Returns the maximum value of the column over the window frame.
`min()`	Returns the minimum value of the column over the window frame.
`count()`	Returns the count of rows in the window frame.

When using the RANK window function in Spark SQL or DataFrame API, if there are duplicates within a partition, the behavior is as follows:

The RANK function assigns the same rank to duplicate values.
The next rank value is skipped. For example, if two rows have the same value and are assigned rank 1, the next row will be assigned rank 3.

Here’s an example:

+--------+-------+
|  value | rank  |
+--------+-------+
|     10 |     1 |
|     10 |     1 |
|      9 |     3 |
|      8 |     4 |
+--------+-------+

In contrast, the DENSE_RANK function does not skip rank values. If there are duplicates, the next rank value will be consecutive.

+--------+-----------+
|  value | dense_rank|
+--------+-----------+
|     10 |         1 |
|     10 |         1 |
|      9 |         2 |
|      8 |         3 |
+--------+-----------+

Examples

1. Ranking Employees by Salary

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank

data = [(1, "Alice", 5000), (2, "Bob", 6000), (3, "Charlie", 4000), (4, "Alice", 7000)]
columns = ["EmpID", "Name", "Salary"]

df = spark.createDataFrame(data, columns)

window_spec = Window.partitionBy("Name").orderBy("Salary")

df = df.withColumn("row_number", row_number().over(window_spec))
       .withColumn("rank", rank().over(window_spec))
       .withColumn("dense_rank", dense_rank().over(window_spec))

df.show()

Output:

EmpID	Name	Salary	row_number	rank	dense_rank
3	Charlie	4000	1	1	1
1	Alice	5000	1	1	1
4	Alice	7000	2	2	2
2	Bob	6000	1	1	1

2. Cumulative Sum

from pyspark.sql.functions import sum

window_spec = Window.partitionBy("Name").orderBy("Salary").rowsBetween(Window.unboundedPreceding, Window.currentRow)

df = df.withColumn("cumulative_sum", sum("Salary").over(window_spec))
df.show()

Output:

EmpID	Name	Salary	cumulative_sum
3	Charlie	4000	4000
1	Alice	5000	5000
4	Alice	7000	12000
2	Bob	6000	6000

Options for Handling NULLs

Exclude NULLs in Order: Use NULLS FIRST or NULLS LAST in orderBy(). Window.orderBy(col("Salary").desc().asc_nulls_last())
Filter NULLs in Partition: Use .filter() before applying the window function. df.filter(col("Salary").isNotNull())

Important Notes

PartitionBy: Breaks data into logical groups for independent calculations.
OrderBy: Determines the order within each partition.
Frame Specification: Allows cumulative, rolling, or specific-frame calculations using rowsBetween or rangeBetween

Setting Up the Environment

First, let’s set up the environment and create a sample dataset.

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number, rank, dense_rank, percent_rank, ntile, lag, lead, sum, avg

# Initialize Spark session
spark = SparkSession.builder 
    .appName("PySpark Window Functions") 
    .getOrCreate()

# Create a sample dataset
data = [(1, "Alice", 1000),
        (2, "Bob", 1200),
        (3, "Catherine", 1200),
        (4, "David", 800),
        (5, "Eve", 950),
        (6, "Frank", 800),
        (7, "George", 1200),
        (8, "Hannah", 1000),
        (9, "Ivy", 950),
        (10, "Jack", 1200)]
columns = ["id", "name", "salary"]

df = spark.createDataFrame(data, schema=columns)
df.show()

PySpark Window Functions

1. Row Number

The row_number function assigns a unique number to each row within a window partition.

windowSpec = Window.partitionBy("salary").orderBy("id")
df.withColumn("row_number", row_number().over(windowSpec)).show()

2. Rank

The rank function provides ranks to rows within a window partition, with gaps in ranking.

df.withColumn("rank", rank().over(windowSpec)).show()

3. Dense Rank

The dense_rank function provides ranks to rows within a window partition, without gaps in ranking.

df.withColumn("dense_rank", dense_rank().over(windowSpec)).show()

4. Percent Rank

The percent_rank function calculates the percentile rank of rows within a window partition.

df.withColumn("percent_rank", percent_rank().over(windowSpec)).show()

5. NTile

The ntile function divides the rows within a window partition into n buckets.

df.withColumn("ntile", ntile(4).over(windowSpec)).show()

6. Lag

The lag function provides access to a row at a given physical offset before the current row within a window partition.

df.withColumn("lag", lag("salary", 1).over(windowSpec)).show()

7. Lead

The lead function provides access to a row at a given physical offset after the current row within a window partition.

df.withColumn("lead", lead("salary", 1).over(windowSpec)).show()

8. Cumulative Sum

The sum function calculates the cumulative sum of values within a window partition.

df.withColumn("cumulative_sum", sum("salary").over(windowSpec)).show()

9. Moving Average

The avg function calculates the moving average of values within a window partition.

df.withColumn("moving_avg", avg("salary").over(windowSpec)).show()

There are multiple ways to apply window functions on DataFrames in PySpark. While withColumn is the most commonly used method to add a new column with a window function, there are other approaches to apply window functions, depending on the specific use case.

Here are different methods for applying window functions:

1. Using `select()` with window functions

Instead of withColumn(), you can use select() to directly apply a window function to the columns of the DataFrame. This is useful when you only want to return a subset of columns along with the windowed column.

from pyspark.sql.functions import row_number
from pyspark.sql import Window

# Define the window specification
windowSpec = Window.partitionBy("salary").orderBy("id")

# Use select to apply the window function
df.select("id", "salary", row_number().over(windowSpec).alias("row_number")).show()

2. Using `agg()` with window functions

Window functions can also be applied when performing aggregations (agg()). This is useful when you want to calculate aggregated metrics (e.g., sum, avg) over a window.

from pyspark.sql.functions import sum
from pyspark.sql import Window

# Define the window specification
windowSpec = Window.partitionBy("salary")

# Apply window function during aggregation
df.groupBy("id").agg(sum("salary").over(windowSpec).alias("total_salary")).show()

3. Using `filter()` or `where()`

Sometimes, window functions are used in conjunction with filters to extract specific rows, such as filtering the first or last row per partition.

from pyspark.sql.functions import row_number

# Define the window specification
windowSpec = Window.partitionBy("salary").orderBy("id")

# Apply window function and filter based on the rank
ranked_df = df.withColumn("row_number", row_number().over(windowSpec))
ranked_df.filter(ranked_df["row_number"] == 1).show()  # Filter to get the first row per partition

4. Using `groupBy()` with window functions

Though groupBy() is usually used for aggregations, you can combine it with window functions. Window functions won’t replace groupBy(), but you can apply them after aggregations.

from pyspark.sql.functions import rank

# Define the window specification
windowSpec = Window.partitionBy("salary").orderBy("id")

# First, group by some column and then apply a window function
grouped_df = df.groupBy("salary").count()
grouped_df.withColumn("rank", rank().over(windowSpec)).show()

5. Using `withColumnRenamed()` with window functions

You can also rename the result of a window function when adding it as a new column.

from pyspark.sql.functions import row_number

# Define the window specification
windowSpec = Window.partitionBy("salary").orderBy("id")

# Apply the window function and rename the column
df.withColumn("row_number", row_number().over(windowSpec)).withColumnRenamed("row_number", "rank").show()

6. Combining multiple window functions in one step

You can apply multiple window functions in a single step using either select() or withColumn().

from pyspark.sql.functions import row_number, rank

# Define the window specification
windowSpec = Window.partitionBy("salary").orderBy("id")

# Apply multiple window functions
df.select("id", "salary",
          row_number().over(windowSpec).alias("row_number"),
          rank().over(windowSpec).alias("rank")
         ).show()

Comparison with Spark SQL Window Functions

All the above operations can also be performed using Spark SQL. Here are the equivalent SQL queries:

1. Row Number

SELECT id, name, salary,
       ROW_NUMBER() OVER (PARTITION BY salary ORDER BY id) AS row_number
FROM df

2. Rank

SELECT id, name, salary,
       RANK() OVER (PARTITION BY salary ORDER BY id) AS rank
FROM df

3. Dense Rank

SELECT id, name, salary,
       DENSE_RANK() OVER (PARTITION BY salary ORDER BY id) AS dense_rank
FROM df

4. Percent Rank

SELECT id, name, salary,
       PERCENT_RANK() OVER (PARTITION BY salary ORDER BY id) AS percent_rank
FROM df

5. NTile

SELECT id, name, salary,
       NTILE(4) OVER (PARTITION BY salary ORDER BY id) AS ntile
FROM df

6. Lag

SELECT id, name, salary,
       LAG(salary, 1) OVER (PARTITION BY salary ORDER BY id) AS lag
FROM df

7. Lead

SELECT id, name, salary,
       LEAD(salary, 1) OVER (PARTITION BY salary ORDER BY id) AS lead
FROM df

8. Cumulative Sum

SELECT id, name, salary,
       SUM(salary) OVER (PARTITION BY salary ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum
FROM df

9. Moving Average

SELECT id, name, salary,
       AVG(salary) OVER (PARTITION BY salary ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS moving_avg
FROM df

Large Sample Dataset Example

Let’s create a larger dataset and apply window functions.

import random

# Create a larger dataset
large_data = [(i, f"Name_{i}", random.choice([1000, 1200, 950, 800])) for i in range(1, 101)]
large_df = spark.createDataFrame(large_data, schema=columns)
large_df.show(10)

# Apply window functions
large_windowSpec = Window.partitionBy("salary").orderBy("id")

large_df.withColumn("row_number", row_number().over(large_windowSpec)).show(10)
large_df.withColumn("rank", rank().over(large_windowSpec)).show(10)
large_df.withColumn("dense_rank", dense_rank().over(large_windowSpec)).show(10)
large_df.withColumn("percent_rank", percent_rank().over(large_windowSpec)).show(10)
large_df.withColumn("ntile", ntile(4).over(large_windowSpec)).show(10)
large_df.withColumn("lag", lag("salary", 1).over(large_windowSpec)).show(10)
large_df.withColumn("lead", lead("salary", 1).over(large_windowSpec)).show(10)
large_df.withColumn("cumulative_sum", sum("salary").over(large_windowSpec)).show(10)
large_df.withColumn("moving_avg", avg("salary").over(large_windowSpec)).show(10)

Window functions in PySpark and Spark SQL are powerful tools for data analysis. They allow you to perform complex calculations and transformations on subsets of your data, similar to SQL window functions. By using window functions, you can easily implement features like ranking, cumulative sums, and moving averages in your PySpark applications.

Examples:-

1.pyspark dataframes Remove duplicates based on specific columns and then order by different columns

To remove duplicates from a PySpark DataFrame based on specific columns and order the remaining rows by different columns, you can use a combination of the dropDuplicates() function and the orderBy() (or sort()) function.

Here is an example that demonstrates this process:

Remove duplicates based on specific columns.
Order the resulting DataFrame by different columns.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize SparkSession
spark = SparkSession.builder.appName("RemoveDuplicatesAndOrder").getOrCreate()

# Sample data
data = [
    (1, "Alice", 29),
    (2, "Bob", 30),
    (3, "Alice", 29),
    (4, "David", 35),
    (5, "Alice", 25)
]

# Create DataFrame
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)

# Show the original DataFrame
print("Original DataFrame:")
df.show()

# Step 1: Remove duplicates based on specific columns (e.g., "name", "age")
df_no_duplicates = df.dropDuplicates(["name", "age"])

# Step 2: Order the resulting DataFrame by different columns (e.g., "age" in descending order)
df_ordered = df_no_duplicates.orderBy(col("age").desc())

# Show the resulting DataFrame
print("DataFrame after removing duplicates and ordering:")
df_ordered.show()

# Stop SparkSession
spark.stop()

Explanation:

Initialization and Data Preparation:
- A SparkSession is created.
- Sample data is provided, and a DataFrame is created from this data.
Removing Duplicates:
- The dropDuplicates() function is used to remove rows that have the same values in the specified columns ("name" and "age" in this case).
Ordering Data:
- The orderBy() function is used to sort the DataFrame by the specified columns. In this case, the DataFrame is ordered by "age" in descending order.
Displaying Results:
- The original and resulting DataFrames are displayed using the show() function.

Example Output:

Original DataFrame:
+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 29|
|  2|  Bob| 30|
|  3|Alice| 29|
|  4|David| 35|
|  5|Alice| 25|
+---+-----+---+

DataFrame after removing duplicates and ordering:
+---+-----+---+
| id| name|age|
+---+-----+---+
|  4|David| 35|
|  2|  Bob| 30|
|  1|Alice| 29|
|  5|Alice| 25|
+---+-----+---+

Additional Notes:

dropDuplicates(): This function removes duplicate rows based on the specified columns. If no columns are specified, it removes rows that are identical in all columns.
orderBy()/sort(): These functions are used to order the DataFrame. You can specify multiple columns and the sorting order (ascending or descending) for each column.
You can chain multiple DataFrame operations together. For example, you can combine dropDuplicates() and orderBy() in a single statement: df.dropDuplicates(["name", "age"]).orderBy(col("age").desc())

2.pyspark dataframes Remove duplicates based on specific columns while ordering by different columns

To remove duplicates from a PySpark DataFrame based on specific columns while ensuring the ordering of the data based on other columns, you can use the window functions in PySpark. This approach allows you to specify how to handle duplicates and maintain the desired order.

Below is an example to demonstrate how to remove duplicates based on specific columns (name, age) while ordering the rows by different columns (age descending and id ascending):

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

# Initialize SparkSession
spark = SparkSession.builder.appName("RemoveDuplicatesWithOrdering").getOrCreate()

# Sample data
data = [
    (1, "Alice", 29),
    (2, "Bob", 30),
    (3, "Alice", 29),
    (4, "David", 35),
    (5, "Alice", 25)
]

# Create DataFrame
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)

# Show the original DataFrame
print("Original DataFrame:")
df.show()

# Define a window specification
windowSpec = Window.partitionBy("name", "age").orderBy(col("age").desc(), col("id").asc())

# Add a row number to each partition
df_with_row_number = df.withColumn("row_number", row_number().over(windowSpec))

# Filter the rows to keep only the first occurrence of each combination of name and age
df_no_duplicates = df_with_row_number.filter(col("row_number") == 1).drop("row_number")

# Show the resulting DataFrame
print("DataFrame after removing duplicates and ordering:")
df_no_duplicates.show()

# Stop SparkSession
spark.stop()

Explanation:

Initialization and Data Preparation:
- A SparkSession is created.
- Sample data is provided, and a DataFrame is created from this data.
Define Window Specification:
- A window specification is defined using Window.partitionBy("name", "age").orderBy(col("age").desc(), col("id").asc()). This specifies that the data should be partitioned by the columns name and age and ordered within each partition by age in descending order and id in ascending order.
Add Row Number:
- The row_number() function is used to add a row number to each row within the specified window. This row number helps to identify and keep only the first occurrence of each partition.
Filter Rows:
- The DataFrame is filtered to keep only the rows where the row_number is 1. This effectively removes duplicates while maintaining the specified order.
Display Results:
- The original and resulting DataFrames are displayed using the show() function.

Example Output:

Original DataFrame:
+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 29|
|  2|  Bob| 30|
|  3|Alice| 29|
|  4|David| 35|
|  5|Alice| 25|
+---+-----+---+

DataFrame after removing duplicates and ordering:
+---+-----+---+
| id| name|age|
+---+-----+---+
|  4|David| 35|
|  2|  Bob| 30|
|  1|Alice| 29|
|  5|Alice| 25|
+---+-----+---+

Notes:

Window Specification: The Window specification defines how the data should be partitioned and ordered.
Row Number: The row_number() function assigns a unique row number within each window partition.
Filter and Drop: The resulting DataFrame is filtered to keep only the rows where the row_number is 1, and the temporary row_number column is dropped.

This approach ensures that duplicates are removed based on the specified columns while maintaining the desired order of the data. You can adjust the partitioning and ordering columns according to your specific requirements.

Codes for Revising Pyspark Windows


# Sample data
data = [
    (1, "Alice", 10, 8000, "New York"),
    (2, "Bob", 11, 9000, "New York"),
    (3, "Charlie", 10, 10000, "Chicago"),
    (4, "David", 12, 9000, "New York"),
    (6, "Eve", 13, 9000, "Chicago"),
    (7, "GEve", 13, 10000, "Chicago"),
    (8, "REve", 13, 5000, "Chicago"),
    (9, "ScEve", 14, 5600, "LA"),
    (10, "DEve", 15, 11000, "LA"),
    (11, "Ram", 14, 11000, "LA"),
    (12, "Hem", 10, 8000, "LA"),
    (13, "Hith", 11, 6000, "Chicago"),
    (14, "Pit", 15, 13000, "Chicago"),
    (15, "Evelyn", 15, 14000, "New York"),
    (16, "FteEve", 12, 9200, "New York"),
    (17, "sctuEve", 12, None, "Chicago"),
]

# Define schema
columns = ["EmpID", "Emp_name", "Manager_id", "Salary", "Location"]

df = spark.createDataFrame(data, schema=columns)
df.show()

from pyspark.sql.functions import row_number, rank, dense_rank
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank, col

wf=Window.partitionBy("Location").orderBy("Salary")

# Calculate row_number, rank, and dense_rank separately
row_number_col = row_number().over(wf).alias("row_number")
rank_col = rank().over(wf).alias("rank")
dense_rank_col = dense_rank().over(wf).alias("dense_rank")

# Select columns including calculated window function results
df.select(
    "EmpID",
    "Emp_name",
    "Manager_id",
    "salary",
    "Location",
    row_number_col,
    rank_col,
    dense_rank_col
).show()


df.select(
    "EmpID",
    "Emp_name",
    "Manager_id",
    "salary",
    "Location",
    row_number().over(wf).alias("row_number"),
    rank().over(wf).alias("rank"),
    dense_rank().over(wf).alias("dense_rank")
).show()


#Using withColumn with window functions


df.withColumn("row_number", row_number().over(wf))
  .withColumn("rank", rank().over(wf))
  .withColumn("dense_rank", dense_rank().over(wf))
  .show()

#Using selectExpr with window functions

df.selectExpr(
    "EmpID",
    "Emp_name",
    "Manager_id",
    "salary",
    "Location",
    "row_number() OVER (PARTITION BY Location ORDER BY Salary) AS row_number",  # Define window here
    "rank() OVER (PARTITION BY Location ORDER BY Salary) AS rank",          # Define window here
    "dense_rank() OVER (PARTITION BY Location ORDER BY Salary) AS dense_rank"  # Define window here
).show()

#Using withColumn with window functions and chaining

df.withColumn("row_number", row_number().over(wf))
  .withColumn("rank", rank().over(wf))
  .withColumn("dense_rank", dense_rank().over(wf))
  .drop("salary")
  .filter(col("row_number") == 1)
  .show()

df.createOrReplaceTempView("dfview")
spark.sql(""" select EmpID,Emp_name,Manager_id,salary,Location,row_number() OVER (PARTITION BY Location ORDER BY Salary) AS row_number,
          rank() OVER (PARTITION BY Location ORDER BY Salary) AS rank,dense_rank() OVER (PARTITION BY Location ORDER BY Salary) AS dense_rank
           from dfview """ ) .show()


spark.sql("""
SELECT EmpID, Emp_name, Manager_id, salary, Location,
       row_number() OVER w AS row_number,
       rank() OVER w AS rank,
       dense_rank() OVER w AS dense_rank
FROM dfview
WINDOW w AS (PARTITION BY Location ORDER BY Salary)
""").show()

Spark SQL windows Function and Best Usecases

November 25, 2024

For Better understanding on Spark SQL windows Function and Best Usecases do refer our post Window functions in Oracle Pl/Sql and Hive explained and compared with examples.

Window functions in Spark SQL are powerful tools that allow you to perform calculations across a set of table rows that are somehow related to the current row. These functions are particularly useful for tasks that require a global view of the data, such as running totals, ranking, and time-series analysis.

Overview of Window Functions

In Spark SQL, window functions are used with the OVER clause and can be categorized into several types:

Ranking Functions:
- ROW_NUMBER(): Assigns a unique number to each row within the partition of a result set.
- RANK(): Assigns a rank to each row within the partition of a result set, with gaps for ties.
- DENSE_RANK(): Similar to RANK(), but without gaps between the ranks.
- NTILE(n): Distributes the rows into a specified number of groups and assigns a group number to each row.
Analytic Functions:
- CUME_DIST(): Computes the cumulative distribution of a value in a group of values.
- PERCENT_RANK(): Computes the rank of a value in a group of values as a percentage.
Aggregate Functions:
- SUM(), AVG(), MIN(), MAX(), COUNT(): These functions can be used as window functions to perform aggregations over a set of rows.
Value Functions:
- LAG(): Accesses data from a previous row in the same result set.
- LEAD(): Accesses data from a subsequent row in the same result set.
- FIRST_VALUE(): Returns the first value in an ordered set of values.
- LAST_VALUE(): Returns the last value in an ordered set of values.

Window Specification
To use a window function, you need to define a window specification that includes:
Partitioning: Defines the subsets of data to which the function is applied.
Ordering: Defines the order of rows within each partition.
Frame: Defines the subset of rows relative to the current row.

Basic Syntax of Window Fuction in Spark SQL

SELECT column1, column2, 
       aggregate_function(column) OVER (PARTITION BY column_name ORDER BY column_name [ROWS/RANGE frame_clause]) AS alias
FROM table_name;

Components of a Window Function

Aggregate or Analytical Function:
- Examples: SUM(), AVG(), COUNT(), ROW_NUMBER(), RANK(), etc.
OVER Clause:
- Specifies the window over which the function operates.
PARTITION BY Clause (optional):
- Divides the dataset into partitions.
- Each partition is processed independently.
ORDER BY Clause (optional):
- Specifies the ordering of rows within a partition.
Frame Specification (optional):
- Defines the range of rows within a partition for computation.
- Options: ROWS, RANGE.

1. `PARTITION BY` Clause

Divides the dataset into groups (partitions) for which the window function is applied independently.
Syntax:PARTITION BY column_name
Example:AVG(Salary) OVER (PARTITION BY Department) This computes the average salary separately for each department.

2. `ORDER BY` Clause

Defines the order of rows within each partition.
Syntax:ORDER BY column_name [ASC | DESC] [NULLS FIRST | NULLS LAST]
- ASC: Ascending order (default).
- DESC: Descending order.
- NULLS FIRST: Places NULL values at the beginning of the order.
- NULLS LAST: Places NULL values at the end of the order.
Example:RANK() OVER (PARTITION BY Department ORDER BY Salary DESC NULLS LAST)

3. Frame Specification

Defines the range of rows to consider for the window function.
Frame types:
- ROWS: Based on row positions.
- RANGE: Based on value ranges.

Frame Syntax

[ROWS | RANGE] BETWEEN frame_start AND frame_end

Frame Options

UNBOUNDED PRECEDING:
- Includes all rows from the beginning of the partition.
- Example: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
CURRENT ROW:
- Includes the current row only.
- Example: ROWS BETWEEN CURRENT ROW AND CURRENT ROW.
UNBOUNDED FOLLOWING:
- Includes all rows until the end of the partition.
- Example: ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING.
Specific Number of Rows:
- Includes a fixed number of preceding or following rows.
- Example: ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING.

Examples

Cumulative sum:SUM(Sales) OVER (PARTITION BY CustomerID ORDER BY OrderDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
Moving average:AVG(Salary) OVER (PARTITION BY Department ORDER BY Salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

4. Handling NULL Values in `ORDER BY`

By default:
- NULLS FIRST for ascending order.
- NULLS LAST for descending order.

Example

RANK() OVER (PARTITION BY Department ORDER BY Salary DESC NULLS FIRST)

5. Supported Functions with `OVER` Clause

Spark SQL supports many window functions, including:

Aggregate Functions

SUM(), AVG(), COUNT(), MIN(), MAX()

Ranking Functions

ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(n)

Value Functions

FIRST_VALUE(), LAST_VALUE(), LEAD(column, offset, default), LAG(column, offset, default)

6. Combining `PARTITION BY`, `ORDER BY`, and Frames

You can combine all components for more control:

Example:

SELECT EmpID,
       Department,
       Salary,
       AVG(Salary) OVER (PARTITION BY Department ORDER BY Salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS running_avg,
       RANK() OVER (PARTITION BY Department ORDER BY Salary DESC NULLS LAST) AS salary_rank
FROM Employee;

This example:

Computes a running average for the last three salaries within each department.
Ranks employees by salary in descending order, placing NULL salaries at the end.

Summary of Options

Clause	Description	Syntax Example
`PARTITION BY`	Divides rows into partitions	`PARTITION BY Department`
`ORDER BY`	Orders rows within partitions	`ORDER BY Salary DESC NULLS FIRST`
`ROWS`	Row-based range for window frame	`ROWS BETWEEN 2 PRECEDING AND CURRENT ROW`
`RANGE`	Value-based range for window frame	`RANGE BETWEEN INTERVAL 1 DAY PRECEDING AND CURRENT ROW`
`NULLS`	Specifies how nulls are handled in ordering	`ORDER BY Salary DESC NULLS LAST`

By combining these options, Spark SQL allows flexible and efficient computation over distributed datasets!

Examples

1. Rank Employees by Salary in Each Department

SELECT EmpID,
       Emp_name,
       Department,
       Salary,
       RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) AS rank
FROM Employee;

Explanation:

PARTITION BY Department: Separate ranking by department.
ORDER BY Salary DESC: Ranks employees by salary in descending order.

2. Calculate Cumulative Sum of Sales

SELECT CustomerID, 
       OrderDate, 
       ProductID, 
       SUM(Sales) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS cumulative_sales
FROM Orders;

Explanation:

PARTITION BY CustomerID: Cumulative sales calculated per customer.
ORDER BY OrderDate: Sales are accumulated in chronological order.

3. Find Running Average Salary

SELECT EmpID,
       Emp_name,
       Department,
       AVG(Salary) OVER (PARTITION BY Department ORDER BY Salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS running_avg
FROM Employee;

Explanation:

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW: Calculates the average salary of the current row and the two preceding rows.

4. Lead and Lag Example

SELECT EmpID,
       Emp_name,
       Department,
       Salary,
       LAG(Salary, 1, 0) OVER (PARTITION BY Department ORDER BY Salary) AS previous_salary,
       LEAD(Salary, 1, 0) OVER (PARTITION BY Department ORDER BY Salary) AS next_salary
FROM Employee;

Explanation:

LAG(Salary, 1, 0): Fetches the previous salary within the partition. Default is 0 if there is no previous value.
LEAD(Salary, 1, 0): Fetches the next salary within the partition. Default is 0 if there is no next value.

5. Find Employees Above Average Salary in Their Department

SELECT EmpID,
       Emp_name,
       Department,
       Salary,
       AVG(Salary) OVER (PARTITION BY Department) AS avg_salary
FROM Employee
WHERE Salary > AVG(Salary) OVER (PARTITION BY Department);

Explanation:

AVG(Salary) OVER (PARTITION BY Department): Calculates average salary for each department.
The WHERE clause filters employees whose salary is above the department average.

Notes

Spark SQL does not support DISTINCT in window functions.
- For example, COUNT(DISTINCT column) OVER (...) is not allowed in Spark SQL.
Optimized for Distributed Computing:
- Unlike traditional SQL, Spark SQL handles large-scale datasets by distributing computations across the cluster.
Frame Specification:
- Use ROWS for row-level computations.
- Use RANGE for value-based computations.

Integration in PySpark

Spark SQL window functions can also be used within PySpark via .selectExpr():

df.selectExpr("EmpID", 
              "Salary", 
              "AVG(Salary) OVER (PARTITION BY Department ORDER BY EmpID) AS avg_salary").show()

Best Use Cases for Window Functions

1. Running Totals and Moving Averages

Calculate running totals or moving averages over a specified window of rows.

SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM orders;

2. Ranking

Assign ranks to rows based on the values in a specific column.

SELECT
    customer_id,
    order_date,
    amount,
    RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rank
FROM orders;

3. Time-Series Analysis

Perform operations like lead and lag for time-series data.

SELECT
    customer_id,
    order_date,
    amount,
    LAG(amount, 1) OVER (PARTITION BY customer_id ORDER BY order_date) AS previous_amount,
    LEAD(amount, 1) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_amount
FROM orders;

4. Percentile and Distribution Analysis

Compute percentiles and cumulative distributions.

SELECT
    customer_id,
    order_date,
    amount,
    PERCENT_RANK() OVER (PARTITION BY customer_id ORDER BY amount) AS percent_rank,
    CUME_DIST() OVER (PARTITION BY customer_id ORDER BY amount) AS cumulative_distribution
FROM orders;

5. Identifying First and Last Records in Groups

Identify the first and last records within each group.

SELECT
    customer_id,
    order_date,
    amount,
    FIRST_VALUE(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS first_amount,
    LAST_VALUE(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS last_amount
FROM orders;

6.For SAS programmers:-How to translate sas first. and last. in spark sql?

In Spark SQL, there’s no direct equivalent of SAS FIRST. and LAST. variables. However, you can achieve similar functionality using window functions like row_number() or first_value() and last_value() depending on your specific needs.

Here’s how you can translate the concept in Spark SQL:

1. Using row_number():

This approach assigns a unique row number within each group defined by the BY variable(s). You can then use conditional logic to identify the first and last rows based on the row number.

SELECT *,
       CASE WHEN row_number() OVER (PARTITION BY variable1, variable2 ORDER BY sort_column) = 1 THEN 1 ELSE 0 END AS is_first,
       CASE WHEN row_number() OVER (PARTITION BY variable1, variable2 ORDER BY sort_column) = COUNT(*) OVER (PARTITION BY variable1, variable2) THEN 1 ELSE 0 END AS is_last
FROM your_data
ORDER BY variable1, variable2, sort_column;

Explanation:

We use row_number() with a PARTITION BY clause based on your BY variable(s) and an ORDER BY clause to define the order within each group (replace sort_column with the actual column for sorting).
The CASE statement assigns 1 to the first row (where row_number is 1) and 0 to others within the group for is_first.
Similarly, the second CASE statement identifies the last row by checking if the row_number is equal to the total count of rows within the group (using COUNT(*) OVER (PARTITION BY...)).

2. Using first_value() and last_value():

These functions can be used if you only need the values of specific columns for the first and last rows in each group.

SELECT *,
       first_value(column_to_get) OVER (PARTITION BY variable1, variable2 ORDER BY sort_column) AS first_value,
       last_value(column_to_get) OVER (PARTITION BY variable1, variable2 ORDER BY sort_column) AS last_value
FROM your_data
ORDER BY variable1, variable2, sort_column;

Explanation:

first_value() and last_value() are used within the SELECT clause with a similar windowing definition as before.
They return the value of the specified column_to_get for the first and last rows within each group, respectively.

Choosing the Right Approach:

If you need additional information beyond just identifying the first and last rows (like row numbers), use row_number().
If you only need the values of specific columns for the first and last rows, use first_value() and last_value().

Remember:

Ensure your data is sorted by the BY variable(s) before applying these window functions.
Adjust the windowing specifications (partitioning and ordering) based on your actual requirements.

By leveraging these window functions in Spark SQL, you can effectively achieve functionalities similar to SAS FIRST. and LAST. variables for identifying and processing the first and last observations within groups defined by your BY variable(s).

Scenario

Assume we have a dataset with columns group1, group2, order1, and order2, and we want to identify the first and last occurrence of each group based on the combined ordering of order1 and order2.

SAS Code Example

data example;
    set example;
    by group1 group2 order1 order2;
    if first.group1 then first_group1 = 1;
    else first_group1 = 0;
    if last.group1 then last_group1 = 1;
    else last_group1 = 0;
    if first.group2 then first_group2 = 1;
    else first_group2 = 0;
    if last.group2 then last_group2 = 1;
    else last_group2 = 0;
run;

Equivalent in Spark SQL

To translate this logic into Spark SQL, you can use the ROW_NUMBER function along with window functions to identify the first and last occurrence of each group.

Spark SQL Code Example

Create the DataFrame and Register as a Temporary View:
- This step involves creating a DataFrame and registering it as a temporary view for SQL operations.
Define the Window Specification:
- The window specification is partitioned by the grouping variables and ordered by the ordering variables.
Identify the First and Last Occurrences:
- Use ROW_NUMBER with appropriate ordering to mark the first and last occurrences.

Here’s how you can achieve this:

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("SAS First Last Equivalent").getOrCreate()

# Sample data
data = [
    ("group1", "subgroup1", "2023-01-01", "10:00", 100),
    ("group1", "subgroup1", "2023-01-01", "10:30", 200),
    ("group1", "subgroup2", "2023-01-01", "11:00", 300),
    ("group2", "subgroup1", "2023-01-01", "09:00", 150),
    ("group2", "subgroup1", "2023-01-01", "09:30", 250),
    ("group2", "subgroup2", "2023-01-02", "08:00", 130),
    ("group2", "subgroup2", "2023-01-02", "08:30", 170)
]

columns = ["group1", "group2", "order1", "order2", "amount"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Register the DataFrame as a temporary view
df.createOrReplaceTempView("example")

# Use Spark SQL to identify first and last occurrences
sql_query = """
WITH RankedData AS (
    SELECT
        group1,
        group2,
        order1,
        order2,
        amount,
        ROW_NUMBER() OVER (PARTITION BY group1 ORDER BY order1, order2) AS rn_group1_asc,
        ROW_NUMBER() OVER (PARTITION BY group1 ORDER BY order1 DESC, order2 DESC) AS rn_group1_desc,
        ROW_NUMBER() OVER (PARTITION BY group2 ORDER BY order1, order2) AS rn_group2_asc,
        ROW_NUMBER() OVER (PARTITION BY group2 ORDER BY order1 DESC, order2 DESC) AS rn_group2_desc
    FROM example
)
SELECT
    group1,
    group2,
    order1,
    order2,
    amount,
    CASE WHEN rn_group1_asc = 1 THEN 1 ELSE 0 END AS first_group1,
    CASE WHEN rn_group1_desc = 1 THEN 1 ELSE 0 END AS last_group1,
    CASE WHEN rn_group2_asc = 1 THEN 1 ELSE 0 END AS first_group2,
    CASE WHEN rn_group2_desc = 1 THEN 1 ELSE 0 END AS last_group2
FROM RankedData
"""

# Execute the SQL query
result_df = spark.sql(sql_query)

# Show the result
result_df.show()

Explanation

RankedData Common Table Expression (CTE):
- We create a CTE called RankedData where we calculate row numbers for each group based on the specified order.
- ROW_NUMBER() OVER (PARTITION BY group1 ORDER BY order1, order2) assigns a row number to each row within group1 ordered by order1 and order2.
- ROW_NUMBER() OVER (PARTITION BY group1 ORDER BY order1 DESC, order2 DESC) assigns a reverse row number to identify the last occurrence within group1.
- The same logic is applied for group2.
Select with CASE Statements:
- We select from the RankedData CTE and use CASE statements to mark the first and last occurrences for each group.

Sample Output

The resulting DataFrame will include flags indicating the first and last occurrences for each group:

+-------+---------+----------+-------+------+------------+-----------+------------+-----------+
| group1|  group2 |   order1 | order2|amount|first_group1|last_group1|first_group2|last_group2|
+-------+---------+----------+-------+------+------------+-----------+------------+-----------+
| group1|subgroup1|2023-01-01| 10:00 |   100|           1|          0|           1|          0|
| group1|subgroup1|2023-01-01| 10:30 |   200|           0|          0|           0|          1|
| group1|subgroup2|2023-01-01| 11:00 |   300|           0|          1|           1|          1|
| group2|subgroup1|2023-01-01| 09:00 |   150|           1|          0|           1|          0|
| group2|subgroup1|2023-01-01| 09:30 |   250|           0|          1|           0|          1|
| group2|subgroup2|2023-01-02| 08:00 |   130|           0|          0|           1|          0|
| group2|subgroup2|2023-01-02| 08:30 |   170|           0|          1|           0|          1|
+-------+---------+----------+-------+------+------------+-----------+------------+-----------+

Examples from Official Spark SQL Doc:-
CREATE TABLE employees (name STRING, dept STRING, salary INT, age INT);

INSERT INTO employees VALUES ("Lisa", "Sales", 10000, 35);
INSERT INTO employees VALUES ("Evan", "Sales", 32000, 38);
INSERT INTO employees VALUES ("Fred", "Engineering", 21000, 28);
INSERT INTO employees VALUES ("Alex", "Sales", 30000, 33);
INSERT INTO employees VALUES ("Tom", "Engineering", 23000, 33);
INSERT INTO employees VALUES ("Jane", "Marketing", 29000, 28);
INSERT INTO employees VALUES ("Jeff", "Marketing", 35000, 38);
INSERT INTO employees VALUES ("Paul", "Engineering", 29000, 23);
INSERT INTO employees VALUES ("Chloe", "Engineering", 23000, 25);

SELECT * FROM employees;
+-----+-----------+------+-----+
| name|       dept|salary|  age|
+-----+-----------+------+-----+
|Chloe|Engineering| 23000|   25|
| Fred|Engineering| 21000|   28|
| Paul|Engineering| 29000|   23|
|Helen|  Marketing| 29000|   40|
|  Tom|Engineering| 23000|   33|
| Jane|  Marketing| 29000|   28|
| Jeff|  Marketing| 35000|   38|
| Evan|      Sales| 32000|   38|
| Lisa|      Sales| 10000|   35|
| Alex|      Sales| 30000|   33|
+-----+-----------+------+-----+

SELECT name, dept, salary, RANK() OVER (PARTITION BY dept ORDER BY salary) AS rank FROM employees;
+-----+-----------+------+----+
| name|       dept|salary|rank|
+-----+-----------+------+----+
| Lisa|      Sales| 10000|   1|
| Alex|      Sales| 30000|   2|
| Evan|      Sales| 32000|   3|
| Fred|Engineering| 21000|   1|
|  Tom|Engineering| 23000|   2|
|Chloe|Engineering| 23000|   2|
| Paul|Engineering| 29000|   4|
|Helen|  Marketing| 29000|   1|
| Jane|  Marketing| 29000|   1|
| Jeff|  Marketing| 35000|   3|
+-----+-----------+------+----+

SELECT name, dept, salary, DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary ROWS BETWEEN
    UNBOUNDED PRECEDING AND CURRENT ROW) AS dense_rank FROM employees;
+-----+-----------+------+----------+
| name|       dept|salary|dense_rank|
+-----+-----------+------+----------+
| Lisa|      Sales| 10000|         1|
| Alex|      Sales| 30000|         2|
| Evan|      Sales| 32000|         3|
| Fred|Engineering| 21000|         1|
|  Tom|Engineering| 23000|         2|
|Chloe|Engineering| 23000|         2|
| Paul|Engineering| 29000|         3|
|Helen|  Marketing| 29000|         1|
| Jane|  Marketing| 29000|         1|
| Jeff|  Marketing| 35000|         2|
+-----+-----------+------+----------+

SELECT name, dept, age, CUME_DIST() OVER (PARTITION BY dept ORDER BY age
    RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cume_dist FROM employees;
+-----+-----------+------+------------------+
| name|       dept|age   |         cume_dist|
+-----+-----------+------+------------------+
| Alex|      Sales|    33|0.3333333333333333|
| Lisa|      Sales|    35|0.6666666666666666|
| Evan|      Sales|    38|               1.0|
| Paul|Engineering|    23|              0.25|
|Chloe|Engineering|    25|              0.75|
| Fred|Engineering|    28|              0.25|
|  Tom|Engineering|    33|               1.0|
| Jane|  Marketing|    28|0.3333333333333333|
| Jeff|  Marketing|    38|0.6666666666666666|
|Helen|  Marketing|    40|               1.0|
+-----+-----------+------+------------------+

SELECT name, dept, salary, MIN(salary) OVER (PARTITION BY dept ORDER BY salary) AS min
    FROM employees;
+-----+-----------+------+-----+
| name|       dept|salary|  min|
+-----+-----------+------+-----+
| Lisa|      Sales| 10000|10000|
| Alex|      Sales| 30000|10000|
| Evan|      Sales| 32000|10000|
|Helen|  Marketing| 29000|29000|
| Jane|  Marketing| 29000|29000|
| Jeff|  Marketing| 35000|29000|
| Fred|Engineering| 21000|21000|
|  Tom|Engineering| 23000|21000|
|Chloe|Engineering| 23000|21000|
| Paul|Engineering| 29000|21000|
+-----+-----------+------+-----+

6.Another Scenario in SAS:- How to translate a SAS scenario using `PROC SORT` with `BY` (ordering based on multiple keys) and `NODUPKEY` (removing duplicate observations) into Spark SQL:

SAS Scenario:

Imagine you have a dataset named sales_data containing columns like customer_id, product_id, date, and sales_amount. You want to:

Sort the data by customer_id, product_id, and date (in ascending order).
Remove any duplicate observations based on the combination of customer_id, product_id, and date.

SAS Code:

proc sort data=sales_data;
  by customer_id product_id date;
  nodupkey;
run;

Spark SQL Translation:

There are two main approaches to achieve this functionality in Spark SQL:

Approach 1: Using DISTINCT and ORDER BY:

SELECT DISTINCT customer_id, product_id, date, sales_amount
FROM sales_data
ORDER BY customer_id, product_id, date;

Explanation:

DISTINCT: This clause eliminates duplicate rows based on all columns in the SELECT clause (in this case, customer_id, product_id, and date).
ORDER BY: This clause sorts the resulting unique data based on the specified order (ascending for all three columns).

Approach 2: Using ROW_NUMBER() and WHERE:

SELECT customer_id, product_id, date, sales_amount
FROM (
  SELECT customer_id, product_id, date, sales_amount,
         row_number() OVER (PARTITION BY customer_id, product_id, date ORDER BY date) AS row_num
  FROM sales_data
) AS ranked_data
WHERE row_num = 1;

Explanation:

Window Function (row_number):
- A subquery assigns a unique row number (row_num) within each group defined by customer_id, product_id, and date.
- The ORDER BY date ensures rows are ordered by the date for each customer/product combination.
Filtering Unique Rows:
- The outer query filters the ranked data to keep only rows where row_num is 1. This selects the first occurrence for each unique combination of customer_id, product_id, and date.

Choosing the Right Approach:

If your primary goal is sorting and eliminating duplicates efficiently, DISTINCT with ORDER BY might be a simpler solution.
If you need the row number information for further processing, or if the data volume is large and DISTINCT performance becomes an issue, the window function approach (ROW_NUMBER()) might be preferable.

Additional Considerations:

Both approaches achieve the desired outcome of removing duplicate observations based on the specified key combination and sorting the remaining data.
You can adjust the ORDER BY clause to define a different sorting order if needed.

By using either of these approaches in Spark SQL, you can effectively translate the functionality of SAS PROC SORT with BY and NODUPKEY for handling multiple sorting keys and deduplication within your Spark DataFrame.

7.SAS TO SQL

data xyz;set abc;
by xy;
retain x1 y1;
length x1 $400 y1 $100;
if first.xy then do;
x1=' ';
y1=' ';
end;
x1=catx('',trim(x1),x2); y1=catx("",trim(y1),y2);
if last.xy then output;
run;

provided SAS code is performing a group-by operation on xy, retaining values across rows within each group, concatenating those values, and outputting the final concatenated results for each group.

# Define and execute SQL query
sql_query = """
WITH concatenated AS (
    SELECT
        xy,
        COLLECT_LIST(x2) OVER (PARTITION BY xy) AS x2_list,
        COLLECT_LIST(y2) OVER (PARTITION BY xy) AS y2_list,
        ROW_NUMBER() OVER (PARTITION BY xy ORDER BY xy) AS row_num,
        COUNT(*) OVER (PARTITION BY xy) AS count_xy
    FROM abc
)
SELECT
    xy,
    CONCAT_WS('_', x2_list) AS x1,
    CONCAT_WS('_', y2_list) AS y1
FROM concatenated
WHERE row_num = count_xy
"""

# Execute the SQL query
result = spark.sql(sql_query)

# Show the results
result.show()

PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

November 16, 2024

PySpark Architecture Cheat Sheet

1. Core Components of PySpark

Component	Description	Key Features
Spark Core	The foundational Spark component for scheduling, memory management, and fault tolerance.	Task scheduling, data partitioning, RDD APIs.
Spark SQL	Enables interaction with structured data via SQL, DataFrames, and Datasets.	Supports SQL queries, schema inference, integration with Hive.
Spark Streaming	Allows real-time data processing through micro-batching.	DStreams, integration with Kafka, Flume, etc.
Spark MLlib	Provides scalable machine learning algorithms.	Algorithms for classification, clustering, regression, etc.
Spark GraphX	Supports graph processing and analysis for complex networks.	Graph algorithms, graph-parallel computation.

2. PySpark Layered Architecture

Layer	Description	Key Functions
Application Layer	Contains user applications and custom PySpark code.	Custom data workflows and business logic.
Spark API Layer	Provides PySpark APIs to interact with Spark Core and other components.	High-level abstractions for data manipulation, SQL, streaming.
Spark Core Layer	Provides core functionalities including task scheduling and fault tolerance.	Data locality, memory management, RDD transformations.
Execution Layer	Manages the execution of Spark tasks across the cluster.	Task scheduling, load balancing, error handling.
Storage Layer	Manages data storage with distributed storage solutions.	Supports HDFS, S3, Cassandra, and other storage integrations.

3. Spark Cluster Architecture

Component	Description	Key Role
Driver Node	Runs the Spark application, creates Spark Context, and coordinates tasks execution.	Manages jobs, scheduling, and resource allocation.
Executor Nodes	Run tasks assigned by the driver node and process data on each partition.	Execute tasks, store intermediate results, and return output.
Cluster Manager	Manages the resources of the Spark cluster.	Allocates resources, manages executor lifecycle.
Distributed File System	Stores data across the cluster, ensuring high availability.	HDFS, S3, or other compatible storage for data sharing.

4. Spark Execution Process

Step	Description	Key Points
1. Application Submission	User submits a Spark application via command line or a UI.	Initializes job creation in Spark.
2. Job Creation	Spark creates jobs and splits them into stages and tasks based on transformations.	Directed Acyclic Graph (DAG) creation based on data flow.
3. Task Assignment	Driver assigns tasks to executors based on data locality and resource availability.	Utilizes data partitioning for parallelism.
4. Task Execution	Executors run assigned tasks on data partitions.	Processes data in parallel across the cluster.
5. Result Collection	Driver collects results from all executors, aggregates, and returns the final output.	Outputs final results to the user or designated storage.

5. Spark RDD Architecture

Component	Description	Key Characteristics
RDD (Resilient Distributed Dataset)	Immutable, distributed collection of objects across the cluster.	Fault-tolerant, lineage-based recovery, in-memory processing.
Partition	Subset of data within an RDD stored on a single node.	Enables parallel processing of data.
Task	Smallest unit of work that operates on a single partition.	Executes transformations or actions on data.

6. Spark DataFrame Architecture

Component	Description	Key Characteristics
DataFrame	Distributed collection of data organized into named columns.	Schema-based data handling, optimized storage, SQL compatibility.
Dataset	Strongly-typed distributed collection of data (Java/Scala).	Type safety, combines features of RDDs and DataFrames.
Encoder	Converts data between JVM objects and Spark’s internal format for optimized serialization.	Efficient serialization/deserialization for faster processing.

7. Spark SQL Architecture

Component	Description	Key Functions
Catalyst Optimizer	Optimizes Spark SQL queries for enhanced performance.	Logical plan, physical plan optimization.
Query Planner	Plans the execution of SQL queries by selecting the best execution strategy.	Converts optimized logical plan into physical execution plan.
Execution Engine	Executes SQL queries using Spark’s distributed computing framework.	Leverages cluster resources for parallel query execution.

8. Spark Streaming Architecture

Component	Description	Key Features
DStream (Discretized Stream)	Continuous data stream split into micro-batches for processing.	Batch processing in near real-time.
Receiver	Ingests data from external sources like Kafka, Flume, etc.	Acts as data source for streaming jobs.
Processor	Processes data within DStream by applying transformations and actions.	Provides transformations similar to RDDs.

9. Spark Master-Slave Architecture

Component	Description	Key Role
Master Node	Coordinates resource allocation, task scheduling, and overall job management.	Central controller for Spark cluster.
Worker Nodes	Execute tasks on assigned data partitions as directed by the Master.	Run computations, store data, and handle intermediate results.
Executor	Process-running unit on each worker node, responsible for executing tasks.	Runs task code, caches data, and sends results to the driver.
Task	Smallest unit of work on data partitions assigned to executors by the driver.	Executes transformations or actions on data partitions.
Driver Program	Initiates the Spark application and coordinates overall job execution.	Submits tasks to the master and receives results.
Cluster Manager	Manages resources for the entire Spark cluster (YARN, Mesos, Kubernetes).	Manages the lifecycle of executors and resource distribution.

10. Key Concepts in PySpark

Concept	Description	Benefits
Lazy Evaluation	Transformations are not executed until an action is called.	Optimizes query execution by grouping operations.
Fault Tolerance	Spark recovers lost RDDs using lineage information when nodes fail.	Increases reliability in distributed environments.
In-Memory Processing	Stores intermediate data in memory instead of writing to disk.	Enables faster data processing by avoiding I/O overhead.

11. Common Use Cases for PySpark

Batch Processing: Large-scale ETL (Extract, Transform, Load) jobs.
Stream Processing: Real-time analytics, monitoring systems.
Machine Learning: Training models at scale using Spark MLlib.
Graph Processing: Social network analysis, recommendation systems.
Data Warehousing: Leveraging Spark SQL for querying structured datasets.

recent posts

about

What are Iterables?

🔹 List of Built-in Iterables in Python

What are Iterables?

What are Iterators?

Why Are All Iterators Iterables?

Why Are Not All Iterables Iterators?

Key Differences

Example to Illustrate

Iterable (List)

Iterator (Object returned by iter())

Custom Example

Custom Iterable

Custom Iterator

Key Takeaways

Below is a complete guide to Python’s built-in iterables, along with use cases and challenges for each.

1. Lists

Use Cases

🔥 Challenge: Find all unique triplets in a list that sum to zero.

Challenges

2. Tuples

Use Cases

Challenges

3. Strings

Use Cases

Challenges

4. Dictionaries

Use Cases

Challenges

5. Sets

Use Cases

Challenges

6. Range

Use Cases

Challenges

7. Files

Use Cases

Challenges

🔥 Challenge: Find the longest word in a large file efficiently.

8 Enumerate (enumerate())

🔥 Challenge: Find the index of all occurrences of a target value in a list.

9 Zip (zip())

🔥 Challenge: Transpose a 2D matrix using zip().

10 Map (map())

🔥 Challenge: Convert a list of temperatures from Celsius to Fahrenheit using map().

11. Filter (filter())

🔥 Challenge: Filter out all words from a list that are shorter than 4 letters.

13. Generators

Use Cases

Challenges

14. itertools Module

Use Cases

Challenges

🔹 Final Challenge Problems 🚀

🔹 Summary: Best Iterable for Each Task

What is Dictionary in Python?

Simple and Readable Syntax:

Interpreted and Interactive:

High-level Language:

1. Abstraction from Hardware

2. Easy to Read and Write

3. Rich Standard Library

4. Dynamic Typing

5. Built-in High-Level Data Structures

6. Automatic Memory Management

7. Cross-Platform

8. Extensive Ecosystem

Dynamic Typing:

Example 1: Basic Variable Assignment

Example 2: Function with Dynamic Types

Example 3: List with Mixed Types

Example 4: Type Checking and Type Casting

Example 5: Dynamic Typing in Function Arguments

Automatic Memory Management:

Extensive Standard Library:

Cross-platform:

Object-Oriented:

Functional Programming Constructs:

Community and Ecosystem:

Iterator (Object returned by `iter()`)

8 Enumerate (`enumerate()`)

9 Zip (`zip()`)

🔥 Challenge: Transpose a 2D matrix using `zip()`.

10 Map (`map()`)

🔥 Challenge: Convert a list of temperatures from Celsius to Fahrenheit using `map()`.

11. Filter (`filter()`)

14. `itertools` Module

What is `toDF()` in PySpark?

Uses of `toDF()`

1. Using `withColumnRenamed()` (Renaming Columns One by One)

2. Using `selectExpr()` (Using SQL Expressions)

3. Using `alias()` (Inside `select()`)