HintsToday

Hints and Answers for Everything

String Manipulation on PySpark DataFrames

df = df.withColumn(“name_length”, length(df.first_name)), how to calculate length with leading or trailing spaces or any special characters inserted

Great question! In PySpark, the length() function counts all characters, including:

✅ Leading/trailing spaces
✅ Special characters (!@#$%^, etc.)
✅ Unicode characters
✅ Numbers

✅ Example: Count Characters with Spaces and Specials

from pyspark.sql.functions import col, length

df = spark.createDataFrame([
    (" Alice ",),     # Leading/trailing space
    ("A$ha",),        # Special character
    ("  Bob  ",),     # More spaces
    ("Émilie",),      # Unicode
    ("",),            # Empty string
    (None,)           # NULL value
], ["first_name"])

df_with_len = df.withColumn("name_length", length(col("first_name")))
df_with_len.show(truncate=False)

✅ Output:

+-----------+------------+
|first_name |name_length |
+-----------+------------+
| Alice     |7           |
|A$ha       |4           |
|  Bob      |7           |
|Émilie     |6           |
|           |0           |
|null       |null        |
+-----------+------------+

🧠 Key Notes:

length() is character count, not byte count.
It includes all types of characters (whitespace, special, Unicode).
Use trim() if you want to remove spaces before counting.

⚠️ Compare With Trim:

from pyspark.sql.functions import trim

df_trimmed = df.withColumn("trimmed_length", length(trim(col("first_name"))))
df_trimmed.show()

Posted in Dataframe Programming