df = df.withColumn(“name_length”, length(df.first_name)), how to calculate length with leading or trailing spaces or any special characters inserted

Great question! In PySpark, the length() function counts all characters, including:

  • Leading/trailing spaces
  • Special characters (!@#$%^, etc.)
  • Unicode characters
  • Numbers

✅ Example: Count Characters with Spaces and Specials

from pyspark.sql.functions import col, length

df = spark.createDataFrame([
    (" Alice ",),     # Leading/trailing space
    ("A$ha",),        # Special character
    ("  Bob  ",),     # More spaces
    ("Émilie",),      # Unicode
    ("",),            # Empty string
    (None,)           # NULL value
], ["first_name"])

df_with_len = df.withColumn("name_length", length(col("first_name")))
df_with_len.show(truncate=False)

✅ Output:

+-----------+------------+
|first_name |name_length |
+-----------+------------+
| Alice     |7           |
|A$ha       |4           |
|  Bob      |7           |
|Émilie     |6           |
|           |0           |
|null       |null        |
+-----------+------------+

🧠 Key Notes:

  • length() is character count, not byte count.
  • It includes all types of characters (whitespace, special, Unicode).
  • Use trim() if you want to remove spaces before counting.

⚠️ Compare With Trim:

from pyspark.sql.functions import trim

df_trimmed = df.withColumn("trimmed_length", length(trim(col("first_name"))))
df_trimmed.show()

Pages: 1 2 3


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading