All major PySpark data structures and types Discussed

Below are three Spark‑SQL‑friendly patterns for producing all distinct, unordered pairs from a single‑column table. Pick whichever feels most readable in your environment.

1️⃣ Self‑join with an inequality (the classic)

-- any table name is fine; keep only distinct values first
WITH base AS (
  SELECT DISTINCT a                         --  x, y, w, v
  FROM  your_table
)

SELECT  t1.a AS val1,
        t2.a AS val2
FROM    base t1
JOIN    base t2
        ON  t1.a < t2.a;                    -- guarantees val1 ≠ val2 and no duplicates

Why it works

t1.a < t2.a eliminates (x,x) pairs and keeps only the
lexicographically “smaller‑first” orientation, so (x,y) survives but (y,x) never appears.

2️⃣ Row‑number window (if the data type isn’t naturally comparable)

WITH numbered AS (
  SELECT  a,
          ROW_NUMBER() OVER (ORDER BY a) AS rn
  FROM    (SELECT DISTINCT a FROM your_table)
)

SELECT  n1.a AS val1,
        n2.a AS val2
FROM    numbered n1
JOIN    numbered n2
        ON  n1.rn < n2.rn;                 -- rn1 < rn2 ensures uniqueness

This avoids relying on alphabetical ordering and works even if a is a complex
type (e.g., structs).

3️⃣ One‑liner with `collect_set + combinations` (Spark 3.1+)

SELECT  c[0] AS val1,
        c[1] AS val2
FROM (
  SELECT  EXPLODE( combinations(
            SORT_ARRAY( collect_set(a) ), 2)      -- 2‑element combinations
          ) AS c
  FROM    your_table
);

Highlights

collect_set(a) → unique values
sort_array → deterministic order
combinations(arr, 2) → every 2‑item subset
explode flattens the array of pairs into rows.

🚀 Result for your sample values `(x,y,w,v)`

Whatever method you choose returns exactly:

val1	val2
v	w
v	x
v	y
w	x
w	y
x	y

(Assuming simple string order v < w < x < y; actual order depends on your sort key.)

When to prefer which?

Requirement / Context	Recommended pattern
Small, readable SQL; strings/ints are fine	#1 self‑join
Non‑comparable or multi‑column keys	#2 row_number
“One shot” query inside Databricks SQL	#3 combinations

All three avoid (x,x) and treat (x,y) ≡ (y,x).

Let’s break this down step by step so the expression becomes crystal clear:

You’re using:

SELECT EXPLODE(combinations(SORT_ARRAY(collect_set(a)), 2)) AS c
FROM your_table;

We’ll explain each part inside-out, step-by-step.

🧱 Step-by-step Breakdown

✅ Step 1: `collect_set(a)`

This aggregates all unique values of column a into an array.

Input table:

a
---
x
y
x
w
v

collect_set(a) ⇒ removes duplicates and gives:

["x", "y", "w", "v"]

It’s non-deterministic in order, so we fix it next.

✅ Step 2: `SORT_ARRAY(...)`

This makes the array deterministically sorted, so results like (x,y) or (y,x) are consistently ordered.

From:

["x", "y", "w", "v"]

To:

["v", "w", "x", "y"]

Now Spark always uses the same order → avoids (x,y) and (y,x) both appearing.

✅ Step 3: `combinations(sorted_array, 2)`

This generates all unique 2-element combinations, preserving order.

combinations(["v", "w", "x", "y"], 2)

Gives:

["v", "w"]
["v", "x"]
["v", "y"]
["w", "x"]
["w", "y"]
["x", "y"]

This is a built-in Spark SQL function (since Spark 3.1) that mimics the Python itertools.combinations().

✅ Step 4: `EXPLODE(...)`

Now that we have an array of arrays, we need to flatten it into rows.

Input:

[
 ["v", "w"],
 ["v", "x"],
 ["v", "y"],
 ["w", "x"],
 ["w", "y"],
 ["x", "y"]
]

EXPLODE(...) turns each of these into a row:

c
[v, w]
[v, x]
[v, y]
[w, x]
[w, y]
[x, y]

✅ Final SELECT

You can go one step further and split each pair into separate columns:

SELECT
  c[0] AS val1,
  c[1] AS val2
FROM (
  SELECT EXPLODE(combinations(SORT_ARRAY(collect_set(a)), 2)) AS c
  FROM your_table
);

🧠 Summary

Function	Purpose
`collect_set(a)`	Gathers unique values of `a` into an array
`sort_array(...)`	Ensures consistent ordering (no duplicate unordered pairs)
`combinations(..., 2)`	Generates all 2-element combinations (no repeats, no (x,x))
`explode(...)`	Flattens array of arrays into rows

✅ Result: You get all unique, unordered pairs (a, b) such that a ≠ b and no (b, a) duplicates exist.

HintsToday

recent posts

about