Discuss Spark Data Types, Spark Schemas- How Sparks infers Schema?

Here’s a comprehensive PySpark Data Type Casting Cheat Sheet for DataFrame columns using .cast() – includes syntax, examples, and supported types.

🔄 CASTING CHEAT SHEET — PySpark `withColumn(...cast(...))`

✅ Syntax Options:

from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType

# Method 1: Using type object
df.withColumn("new_col", col("old_col").cast(IntegerType()))

# Method 2: Using string alias
df.withColumn("new_col", col("old_col").cast("int"))

Both are valid, though method 1 is stricter (less typo-prone).

🧪 Common Casts with Examples

Data Type	String Alias	Type Object	Example Usage
Integer	`"int"`	`IntegerType()`	`col("age").cast("int")`
Float	`"float"`	`FloatType()`	`col("salary").cast(FloatType())`
Double	`"double"`	`DoubleType()`	`col("score").cast("double")`
Long	`"bigint"`	`LongType()`	`col("id").cast(LongType())`
Short	`"smallint"`	`ShortType()`	`col("val").cast("smallint")`
Byte	`"tinyint"`	`ByteType()`	`col("val").cast("tinyint")`
String	`"string"`	`StringType()`	`col("number").cast("string")`
Boolean	`"boolean"`	`BooleanType()`	`col("flag").cast(BooleanType())`
Date	`"date"`	`DateType()`	`col("dob_str").cast("date")`
Timestamp	`"timestamp"`	`TimestampType()`	`col("dt").cast(TimestampType())`
Decimal	`"decimal(p,s)"`	`DecimalType(p, s)`	`col("val").cast(DecimalType(10,2))`
Binary	`"binary"`	`BinaryType()`	`col("image_data").cast("binary")`

🚧 Notes and Warnings

✅ cast() only works on compatible data. Casting “abc” to int will return null.
⛔ DecimalType must have precision & scale: DecimalType(10, 2)
✅ When casting date/timestamps from strings, format matters:
yyyy-MM-dd, yyyy-MM-dd HH:mm:ss

💡 Practical Examples

from pyspark.sql.functions import col
from pyspark.sql.types import *

df = df.withColumn("age_int", col("age").cast(IntegerType()))
df = df.withColumn("price_decimal", col("price").cast(DecimalType(8, 2)))
df = df.withColumn("dob_date", col("dob").cast("date"))
df = df.withColumn("is_valid", col("flag_str").cast("boolean"))

📋 Bonus: Create Schema with Explicit Types

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
])

📘 Tip: Check Schema and Casting Results

df.printSchema()
df.show()

# PySpark Notebook: Casting Examples + Error Cases

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import *

# ✅ Step 1: Initialize Spark
spark = SparkSession.builder \
    .appName("PySpark Casting Examples") \
    .getOrCreate()

# ✅ Step 2: Create Sample Data
data = [
    (1, "25", "100.75", "true", "2022-05-01", "not_a_number"),
    (2, "30", "200.10", "false", "2022-12-15", "123"),
    (3, "abc", "NaN", "yes", "not_a_date", "456.78"),
]

columns = ["id", "age_str", "salary_str", "flag_str", "date_str", "invalid_num"]

# ✅ Step 3: Create DataFrame
df = spark.createDataFrame(data, columns)
df.show(truncate=False)
df.printSchema()

# ✅ Step 4: Apply Various Casts
casted_df = df \
    .withColumn("age_int", col("age_str").cast(IntegerType())) \
    .withColumn("salary_double", col("salary_str").cast("double")) \
    .withColumn("flag_bool", col("flag_str").cast(BooleanType())) \
    .withColumn("date_col", col("date_str").cast(DateType())) \
    .withColumn("decimal_val", col("invalid_num").cast(DecimalType(10, 2)))

# ✅ Step 5: Show Results
casted_df.show(truncate=False)
casted_df.printSchema()

# ✅ Step 6: Highlight Nulls from Bad Casts
from pyspark.sql.functions import isnan, isnull

casted_df.select(
    "id", "age_str", "age_int",
    (isnull("age_int").cast("int").alias("age_cast_failed")),
    "date_str", "date_col",
    (isnull("date_col").cast("int").alias("date_cast_failed"))
).show()

# ✅ Step 7: Stop Spark
spark.stop()

HintsToday

recent posts

about

🔄 CASTING CHEAT SHEET — PySpark `withColumn(...cast(...))`

✅ Syntax Options:

🧪 Common Casts with Examples

🚧 Notes and Warnings

💡 Practical Examples

📋 Bonus: Create Schema with Explicit Types

📘 Tip: Check Schema and Casting Results

Like this:

Discover more from HintsToday

Leave a ReplyCancel reply

recent posts

about

Discuss Spark Data Types, Spark Schemas- How Sparks infers Schema?

🔄 CASTING CHEAT SHEET — PySpark withColumn(...cast(...))

✅ Syntax Options:

🧪 Common Casts with Examples

🚧 Notes and Warnings

💡 Practical Examples

📋 Bonus: Create Schema with Explicit Types

📘 Tip: Check Schema and Casting Results

Like this:

Discover more from HintsToday

Leave a ReplyCancel reply

Discover more from HintsToday

🔄 CASTING CHEAT SHEET — PySpark `withColumn(...cast(...))`