Here’s a comprehensive PySpark Data Type Casting Cheat Sheet for DataFrame columns using .cast() โ€“ includes syntax, examples, and supported types.


๐Ÿ”„ CASTING CHEAT SHEET โ€” PySpark withColumn(...cast(...))

โœ… Syntax Options:

from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType

# Method 1: Using type object
df.withColumn("new_col", col("old_col").cast(IntegerType()))

# Method 2: Using string alias
df.withColumn("new_col", col("old_col").cast("int"))

Both are valid, though method 1 is stricter (less typo-prone).


๐Ÿงช Common Casts with Examples

Data TypeString AliasType ObjectExample Usage
Integer"int"IntegerType()col("age").cast("int")
Float"float"FloatType()col("salary").cast(FloatType())
Double"double"DoubleType()col("score").cast("double")
Long"bigint"LongType()col("id").cast(LongType())
Short"smallint"ShortType()col("val").cast("smallint")
Byte"tinyint"ByteType()col("val").cast("tinyint")
String"string"StringType()col("number").cast("string")
Boolean"boolean"BooleanType()col("flag").cast(BooleanType())
Date"date"DateType()col("dob_str").cast("date")
Timestamp"timestamp"TimestampType()col("dt").cast(TimestampType())
Decimal"decimal(p,s)"DecimalType(p, s)col("val").cast(DecimalType(10,2))
Binary"binary"BinaryType()col("image_data").cast("binary")

๐Ÿšง Notes and Warnings

  • โœ… cast() only works on compatible data. Casting “abc” to int will return null.
  • โ›” DecimalType must have precision & scale: DecimalType(10, 2)
  • โœ… When casting date/timestamps from strings, format matters:
    yyyy-MM-dd, yyyy-MM-dd HH:mm:ss

๐Ÿ’ก Practical Examples

from pyspark.sql.functions import col
from pyspark.sql.types import *

df = df.withColumn("age_int", col("age").cast(IntegerType()))
df = df.withColumn("price_decimal", col("price").cast(DecimalType(8, 2)))
df = df.withColumn("dob_date", col("dob").cast("date"))
df = df.withColumn("is_valid", col("flag_str").cast("boolean"))

๐Ÿ“‹ Bonus: Create Schema with Explicit Types

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
])

๐Ÿ“˜ Tip: Check Schema and Casting Results

df.printSchema()
df.show()

# PySpark Notebook: Casting Examples + Error Cases

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import *

# โœ… Step 1: Initialize Spark
spark = SparkSession.builder \
    .appName("PySpark Casting Examples") \
    .getOrCreate()

# โœ… Step 2: Create Sample Data
data = [
    (1, "25", "100.75", "true", "2022-05-01", "not_a_number"),
    (2, "30", "200.10", "false", "2022-12-15", "123"),
    (3, "abc", "NaN", "yes", "not_a_date", "456.78"),
]

columns = ["id", "age_str", "salary_str", "flag_str", "date_str", "invalid_num"]

# โœ… Step 3: Create DataFrame
df = spark.createDataFrame(data, columns)
df.show(truncate=False)
df.printSchema()

# โœ… Step 4: Apply Various Casts
casted_df = df \
    .withColumn("age_int", col("age_str").cast(IntegerType())) \
    .withColumn("salary_double", col("salary_str").cast("double")) \
    .withColumn("flag_bool", col("flag_str").cast(BooleanType())) \
    .withColumn("date_col", col("date_str").cast(DateType())) \
    .withColumn("decimal_val", col("invalid_num").cast(DecimalType(10, 2)))

# โœ… Step 5: Show Results
casted_df.show(truncate=False)
casted_df.printSchema()

# โœ… Step 6: Highlight Nulls from Bad Casts
from pyspark.sql.functions import isnan, isnull

casted_df.select(
    "id", "age_str", "age_int",
    (isnull("age_int").cast("int").alias("age_cast_failed")),
    "date_str", "date_col",
    (isnull("date_col").cast("int").alias("date_cast_failed"))
).show()

# โœ… Step 7: Stop Spark
spark.stop()

Pages: 1 2 3


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Leave a Reply

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading