df = df.withColumn(“name_length”, length(df.first_name)), how to calculate length with leading or trailing spaces or any special characters inserted
Great question! In PySpark, the length()
function counts all characters, including:
- ✅ Leading/trailing spaces
- ✅ Special characters (
!@#$%^
, etc.) - ✅ Unicode characters
- ✅ Numbers
✅ Example: Count Characters with Spaces and Specials
from pyspark.sql.functions import col, length
df = spark.createDataFrame([
(" Alice ",), # Leading/trailing space
("A$ha",), # Special character
(" Bob ",), # More spaces
("Émilie",), # Unicode
("",), # Empty string
(None,) # NULL value
], ["first_name"])
df_with_len = df.withColumn("name_length", length(col("first_name")))
df_with_len.show(truncate=False)
✅ Output:
+-----------+------------+
|first_name |name_length |
+-----------+------------+
| Alice |7 |
|A$ha |4 |
| Bob |7 |
|Émilie |6 |
| |0 |
|null |null |
+-----------+------------+
🧠 Key Notes:
length()
is character count, not byte count.- It includes all types of characters (whitespace, special, Unicode).
- Use
trim()
if you want to remove spaces before counting.
⚠️ Compare With Trim:
from pyspark.sql.functions import trim
df_trimmed = df.withColumn("trimmed_length", length(trim(col("first_name"))))
df_trimmed.show()