Absolutely! Let’s break down select(), withColumn(), and expr() in PySpark — with supercharged examples that combine:

Column creation
Complex logic
Conditional expressions
Regex, math, string ops
Cross-column logic
SQL-style expressions inside code

🔹 1. `select()` → Used to pick or transform columns

df.select("name", "salary", (F.col("salary") * 0.10).alias("bonus")).show()

✅ select() is pure projection — useful when you want to create a new DataFrame with selected and derived columns.

🔹 2. `withColumn()` → Used to add or replace a column

df.withColumn("bonus", F.col("salary") * 0.10).show()

✅ withColumn() modifies the existing DataFrame by adding a new column or replacing an existing one.

🔹 3. `expr()` → Used for SQL-style expressions in DataFrame code

df.selectExpr("name", "salary", "salary * 0.10 as bonus").show()
df.withColumn("bonus", F.expr("salary * 0.10")).show()

✅ expr() and selectExpr() allow complex SQL-style syntax directly inside PySpark.

🚀 Super Examples

✅ 1. Conditional Bonus (with `expr`)

df.withColumn("bonus",
    F.expr("CASE WHEN salary > 10000 THEN salary * 0.2 ELSE salary * 0.1 END")
).show()

✅ 2. Select + Cast + Format + Rename

df.select(
    F.col("salary").cast("double").alias("salary_dbl"),
    F.date_format("hire_date", "MMM-yyyy").alias("hired_month"),
    F.upper("dept").alias("DEPT_CAPS")
).show()

✅ 3. Create Score Category using `withColumn() + when()`

df.withColumn("grade",
    F.when(F.col("score") >= 90, "A")
     .when(F.col("score") >= 80, "B")
     .otherwise("C")
).show()

✅ 4. Regex Column Extraction (`withColumn` + `regexp_extract`)

df.withColumn("region_code", F.regexp_extract(F.col("address"), r"(\\d{2})$", 1)).show()

✅ 5. Cross-Column Logic (compute `total = base + tax - discount`)

df.withColumn("total", F.col("base") + F.col("tax") - F.col("discount")).show()

✅ 6. Use `selectExpr()` for Compact SQL Style

df.selectExpr(
    "name",
    "salary",
    "salary * 0.1 AS bonus",
    "CASE WHEN dept = 'HR' THEN 1 ELSE 0 END as is_hr"
).show()

✅ 7. Date Logic using `expr`

df.withColumn("last_30_days_flag", F.expr("date >= date_sub(current_date(), 30)")).show()

✅ 8. Chain Multiple `withColumn()` Calls

df.withColumn("bonus", F.col("salary") * 0.1) \
  .withColumn("after_tax", F.col("bonus") * 0.7) \
  .withColumn("flag", F.when(F.col("after_tax") > 1000, "High").otherwise("Low")) \
  .show()

✅ 9. Null-safe Expressions with `expr`

df.withColumn("safe_sum", F.expr("coalesce(base, 0) + coalesce(tax, 0)")).show()

✅ 10. Dynamically Rename All Columns with `select()`

df.select([F.col(c).alias(f"{c}_new") for c in df.columns]).show()

🧠 When to Use What?

Use Case	Use
Create/Replace column	`withColumn()`
Return new DataFrame with selected columns	`select()`
SQL-style expressions	`expr()` or `selectExpr()`
Loop or rename dynamically	`select()` with list comprehension
Multiple expressions in one call	`selectExpr()` is concise

Awesome! Here’s your extended guide to most-used PySpark DataFrame operations — including select, withColumn, expr, and a ton of practical examples — all grouped by category:

🔥 Master PySpark Operations – With Super Examples

🔹 1. select() – pick or transform columns

df.select("id", F.col("amount") * 1.2).alias("adjusted").show()

✅ Returns new DataFrame with specified columns

🔹 2. withColumn() – add or overwrite column

df.withColumn("tax", F.col("amount") * 0.18).show()

✅ Adds or replaces a column (mutable style)

🔹 3. expr() / selectExpr() – SQL-style column logic

df.selectExpr("id", "amount * 1.1 as adjusted", "CASE WHEN amount > 1000 THEN 'High' ELSE 'Low' END as risk").show()

✅ SQL syntax inside DataFrame code

🔹 4. filter() / where() – row filtering

df.filter(F.col("amount") > 500).show()
df.where("amount > 500 AND status = 'PAID'").show()

✅ Same functionality. filter() = programmatic, where() = SQL-style

🔹 5. drop() / dropDuplicates()

df.drop("col1").dropDuplicates(["id", "month"]).show()

✅ Drop columns or deduplicate by subset

🔹 6. withColumnRenamed()

df.withColumnRenamed("oldName", "newName").show()

✅ Rename a column

🔹 7. groupBy() + agg()

df.groupBy("dept").agg(
    F.sum("amount").alias("total"),
    F.count("*").alias("count")
).show()

✅ Grouping + aggregate logic

🔹 8. orderBy() / sort()

df.orderBy("amount").show()
df.orderBy(F.col("amount").desc()).show()

✅ Ascending/descending sorting

🔹 9. distinct()

df.select("dept").distinct().show()

✅ Unique values

🔹 10. join()

df1.join(df2, on="id", how="inner").show()

✅ Supports inner, left, right, outer, cross

🔹 11. union() / unionByName()

df1.union(df2).show()
df1.unionByName(df2).show()

✅ Combine rows of two DataFrames (schema should match)

🔹 12. explode() – unnest array/struct

df.withColumn("tag", F.explode("tags")).show()

✅ Flatten arrays/structs into rows

🔹 13. array(), struct(), lit(), when()

df.withColumn("new_struct", F.struct("id", "amount")) \
  .withColumn("new_array", F.array("id", "amount")) \
  .withColumn("flag", F.when(F.col("amount") > 1000, "High").otherwise("Low")).show()

✅ Combine fields or conditionally assign values

🔹 14. isNull() / isNotNull() / fillna() / dropna()

df.filter(F.col("email").isNull()).show()
df.fillna(0, subset=["amount"]).show()
df.dropna(subset=["name", "dept"]).show()

✅ Null-safe handling

🔹 15. cast() – type conversion

df.withColumn("amount_int", F.col("amount").cast("int")).show()

✅ Data type conversions

🔹 16. repartition() / coalesce()

df.repartition(10).write.parquet("path")  # Wider shuffle
df.coalesce(1).write.csv("out.csv")       # Fewer files

✅ Control number of partitions before writing

🔹 17. cache() / persist()

df.cache()
df.persist(StorageLevel.MEMORY_AND_DISK)

✅ Optimize re-use of DataFrames

🔹 18. describe() / summary()

df.describe("amount", "salary").show()
df.summary("count", "mean", "min", "max").show()

✅ Basic statistics

🔹 19. toPandas() (only for small datasets)

pdf = df.limit(1000).toPandas()

✅ Bring data to driver for display or ML use

🔹 20. window() functions

from pyspark.sql.window import Window
w = Window.partitionBy("dept").orderBy("date")
df.withColumn("running_total", F.sum("amount").over(w)).show()

✅ Do rank, lead/lag, rolling stats within groups

🔹 21. regexp_extract / replace / split (Regex magic)

df.withColumn("clean_code", F.regexp_extract("text", r"(\\d{3})-(\\d{2})", 1)).show()

✅ Pattern matching and string processing

Perfect. Here’s the extended, interview-ready PySpark operations cheat sheet, now tailored for:

✅ Use in Interviews · ETL Pipelines · Notebooks · Metadata Frameworks

We’ll go over:

1. 📌 Real-life use case with logic

2. 🧠 What interviewers expect

3. 💼 ETL-friendly expression

4. 🛠️ Reusable in Metadata Framework

🔹 1. Column Logic: `withColumn`, `select`, `expr`

✅ Example Use Case:

Add 18% GST to amount if item type is “TAXABLE”

df.withColumn("final_price", 
    F.when(F.col("type") == "TAXABLE", F.col("amount") * 1.18)
     .otherwise(F.col("amount"))
)

🧠 Interview Q: How do you apply conditional logic across columns?

💼 ETL: Can be expressed via metadata like:

{
  "column_name": "final_price",
  "expression": "CASE WHEN type = 'TAXABLE' THEN amount * 1.18 ELSE amount END"
}

🛠️ Metadata layer: Execute using F.expr(metadata["expression"])

🔹 2. Filtering Logic

✅ Use Case:

Only keep active users in the last 12 months

df.filter(
    F.col("status") == "ACTIVE"
).filter(
    F.col("last_login") >= F.add_months(F.current_date(), -12)
)

🧠 Interview Q: How would you implement rolling time windows?

💼 Use in A/B retention, churn logic.

🛠️ Use metadata["filter_condition"] = "status = 'ACTIVE' AND last_login >= add_months(current_date(), -12)"

🔹 3. Dynamic Aggregations with GroupBy

✅ Example:

agg_expr = {
  "amount": "sum",
  "bonus": "avg",
  "region": "collect_set"
}

agg_df = df.groupBy("dept").agg(
    *[getattr(F, func)(col).alias(f"{col}_{func}") for col, func in agg_expr.items()]
)

🧠 Interview Q: How do you dynamically generate aggregates?

💼 ETL: Flexible groupBy + metric configs

🛠️ Metadata JSON:

{
  "group_by": ["dept"],
  "metrics": {"amount": "sum", "bonus": "avg", "region": "collect_set"}
}

🔹 4. Joins & Lookups

df.join(dim_region, on="region_code", how="left")

🧠 Q: When to use broadcast, skew, z-order?

💼 Typical in dimension enrichment

🛠️ Metadata:

{
  "join_type": "left",
  "join_column": "region_code",
  "lookup_table": "dim_region"
}

🔹 5. Window Function with Row Enrichment

from pyspark.sql.window import Window
w = Window.partitionBy("customer").orderBy("date")

df.withColumn("running_total", F.sum("amount").over(w))

🧠 Q: Difference between rowsBetween vs rangeBetween?

💼 Used in revenue trending, cohort metrics

🛠️ Can be parameterized:

{
  "partition_by": "customer",
  "order_by": "date",
  "window_function": "sum(amount)",
  "window_type": "rows",
  "frame": [-sys.maxsize, 0]
}

🔹 6. Explode Arrays / Structs

df.withColumn("tag", F.explode("tags"))

🧠 Q: When do you use explode vs inline?

💼 Used in JSON flattening, log processing

🛠️ Metadata flag:

{ "explode_column": "tags" }

🔹 7. String Cleaning & Parsing

df.withColumn("cleaned", F.regexp_replace("email", r"[^a-zA-Z0-9@.]", ""))

🧠 Q: Regex functions in PySpark?

💼 Used in input cleansing

🛠️ Metadata mapping:

{
  "column": "email",
  "operation": "regexp_replace",
  "pattern": "[^a-zA-Z0-9@.]",
  "replacement": ""
}

🔹 8. Schema Alignment for Safe Union

all_cols = list(set(df1.columns) | set(df2.columns))

df1_aligned = df1.select([F.col(c) if c in df1.columns else F.lit(None).alias(c) for c in all_cols])
df2_aligned = df2.select([F.col(c) if c in df2.columns else F.lit(None).alias(c) for c in all_cols])

df1_aligned.unionByName(df2_aligned)

🧠 Q: How do you perform safe union across mismatched schemas?

💼 ETL: Useful in multi-source pipeline

🛠️ Metadata: union list

🔹 9. Dynamic `selectExpr()` for Column Transformations

cols = ["id", "upper(name) as name_uc", "salary * 1.2 as updated_salary"]
df.selectExpr(*cols).show()

🧠 Q: How do you parameterize transformation logic?

💼 Declarative transformation via config

🛠️ Metadata example:

{
  "select_exprs": ["id", "upper(name) as name_uc", "salary * 1.2 as updated_salary"]
}

🔹 10. Date Filtering Tricks (Dynamic)

def filter_last_n_months(df, date_col, months=12):
    start = F.trunc(F.add_months(F.current_date(), -months - 1), "MM")
    end = F.last_day(F.add_months(F.current_date(), -1))
    return df.filter((F.col(date_col) >= start) & (F.col(date_col) <= end))

🧠 Q: How do you apply rolling logic for time windows?

💼 Used in retention, churn, time-based KPIs

🛠️ Parameter:

{
  "filter_type": "rolling_months",
  "months_back": 12
}

recent posts

about

PySpark Coding Practice Questions

🔹 1. select() → Used to pick or transform columns

🔹 2. withColumn() → Used to add or replace a column

🔹 3. expr() → Used for SQL-style expressions in DataFrame code