Here’s a comprehensive PySpark Data Type Casting Cheat Sheet for DataFrame columns using .cast()
โ includes syntax, examples, and supported types.
๐ CASTING CHEAT SHEET โ PySpark withColumn(...cast(...))
โ Syntax Options:
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType
# Method 1: Using type object
df.withColumn("new_col", col("old_col").cast(IntegerType()))
# Method 2: Using string alias
df.withColumn("new_col", col("old_col").cast("int"))
Both are valid, though method 1 is stricter (less typo-prone).
๐งช Common Casts with Examples
Data Type | String Alias | Type Object | Example Usage |
---|---|---|---|
Integer | "int" | IntegerType() | col("age").cast("int") |
Float | "float" | FloatType() | col("salary").cast(FloatType()) |
Double | "double" | DoubleType() | col("score").cast("double") |
Long | "bigint" | LongType() | col("id").cast(LongType()) |
Short | "smallint" | ShortType() | col("val").cast("smallint") |
Byte | "tinyint" | ByteType() | col("val").cast("tinyint") |
String | "string" | StringType() | col("number").cast("string") |
Boolean | "boolean" | BooleanType() | col("flag").cast(BooleanType()) |
Date | "date" | DateType() | col("dob_str").cast("date") |
Timestamp | "timestamp" | TimestampType() | col("dt").cast(TimestampType()) |
Decimal | "decimal(p,s)" | DecimalType(p, s) | col("val").cast(DecimalType(10,2)) |
Binary | "binary" | BinaryType() | col("image_data").cast("binary") |
๐ง Notes and Warnings
- โ
cast()
only works on compatible data. Casting “abc” to int will returnnull
. - โ
DecimalType
must have precision & scale:DecimalType(10, 2)
- โ
When casting date/timestamps from strings, format matters:
yyyy-MM-dd
,yyyy-MM-dd HH:mm:ss
๐ก Practical Examples
from pyspark.sql.functions import col
from pyspark.sql.types import *
df = df.withColumn("age_int", col("age").cast(IntegerType()))
df = df.withColumn("price_decimal", col("price").cast(DecimalType(8, 2)))
df = df.withColumn("dob_date", col("dob").cast("date"))
df = df.withColumn("is_valid", col("flag_str").cast("boolean"))
๐ Bonus: Create Schema with Explicit Types
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
])
๐ Tip: Check Schema and Casting Results
df.printSchema()
df.show()
# PySpark Notebook: Casting Examples + Error Cases
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import *
# โ
Step 1: Initialize Spark
spark = SparkSession.builder \
.appName("PySpark Casting Examples") \
.getOrCreate()
# โ
Step 2: Create Sample Data
data = [
(1, "25", "100.75", "true", "2022-05-01", "not_a_number"),
(2, "30", "200.10", "false", "2022-12-15", "123"),
(3, "abc", "NaN", "yes", "not_a_date", "456.78"),
]
columns = ["id", "age_str", "salary_str", "flag_str", "date_str", "invalid_num"]
# โ
Step 3: Create DataFrame
df = spark.createDataFrame(data, columns)
df.show(truncate=False)
df.printSchema()
# โ
Step 4: Apply Various Casts
casted_df = df \
.withColumn("age_int", col("age_str").cast(IntegerType())) \
.withColumn("salary_double", col("salary_str").cast("double")) \
.withColumn("flag_bool", col("flag_str").cast(BooleanType())) \
.withColumn("date_col", col("date_str").cast(DateType())) \
.withColumn("decimal_val", col("invalid_num").cast(DecimalType(10, 2)))
# โ
Step 5: Show Results
casted_df.show(truncate=False)
casted_df.printSchema()
# โ
Step 6: Highlight Nulls from Bad Casts
from pyspark.sql.functions import isnan, isnull
casted_df.select(
"id", "age_str", "age_int",
(isnull("age_int").cast("int").alias("age_cast_failed")),
"date_str", "date_col",
(isnull("date_col").cast("int").alias("date_cast_failed"))
).show()
# โ
Step 7: Stop Spark
spark.stop()
Leave a Reply