Below are three Spark‑SQL‑friendly patterns for producing all distinct, unordered pairs from a single‑column table. Pick whichever feels most readable in your environment.
1️⃣ Self‑join with an inequality (the classic)
-- any table name is fine; keep only distinct values first
WITH base AS (
SELECT DISTINCT a -- x, y, w, v
FROM your_table
)
SELECT t1.a AS val1,
t2.a AS val2
FROM base t1
JOIN base t2
ON t1.a < t2.a; -- guarantees val1 ≠ val2 and no duplicates
Why it works
t1.a < t2.a
eliminates(x,x)
pairs and keeps only the
lexicographically “smaller‑first” orientation, so(x,y)
survives but(y,x)
never appears.
2️⃣ Row‑number window (if the data type isn’t naturally comparable)
WITH numbered AS (
SELECT a,
ROW_NUMBER() OVER (ORDER BY a) AS rn
FROM (SELECT DISTINCT a FROM your_table)
)
SELECT n1.a AS val1,
n2.a AS val2
FROM numbered n1
JOIN numbered n2
ON n1.rn < n2.rn; -- rn1 < rn2 ensures uniqueness
This avoids relying on alphabetical ordering and works even if a
is a complex
type (e.g., structs).
3️⃣ One‑liner with collect_set + combinations
(Spark 3.1+)
SELECT c[0] AS val1,
c[1] AS val2
FROM (
SELECT EXPLODE( combinations(
SORT_ARRAY( collect_set(a) ), 2) -- 2‑element combinations
) AS c
FROM your_table
);
Highlights
collect_set(a)
→ unique valuessort_array
→ deterministic ordercombinations(arr, 2)
→ every 2‑item subsetexplode
flattens the array of pairs into rows.
🚀 Result for your sample values (x,y,w,v)
Whatever method you choose returns exactly:
val1 | val2 |
---|---|
v | w |
v | x |
v | y |
w | x |
w | y |
x | y |
(Assuming simple string order v < w < x < y; actual order depends on your sort key.)
When to prefer which?
Requirement / Context | Recommended pattern |
---|---|
Small, readable SQL; strings/ints are fine | #1 self‑join |
Non‑comparable or multi‑column keys | #2 row_number |
“One shot” query inside Databricks SQL | #3 combinations |
All three avoid (x,x)
and treat (x,y)
≡ (y,x)
.
Let’s break this down step by step so the expression becomes crystal clear:
You’re using:
SELECT EXPLODE(combinations(SORT_ARRAY(collect_set(a)), 2)) AS c
FROM your_table;
We’ll explain each part inside-out, step-by-step.
🧱 Step-by-step Breakdown
✅ Step 1: collect_set(a)
This aggregates all unique values of column a
into an array.
Input table:
a
---
x
y
x
w
v
collect_set(a)
⇒ removes duplicates and gives:
["x", "y", "w", "v"]
It’s non-deterministic in order, so we fix it next.
✅ Step 2: SORT_ARRAY(...)
This makes the array deterministically sorted, so results like (x,y)
or (y,x)
are consistently ordered.
From:
["x", "y", "w", "v"]
To:
["v", "w", "x", "y"]
Now Spark always uses the same order → avoids (x,y)
and (y,x)
both appearing.
✅ Step 3: combinations(sorted_array, 2)
This generates all unique 2-element combinations, preserving order.
combinations(["v", "w", "x", "y"], 2)
Gives:
["v", "w"]
["v", "x"]
["v", "y"]
["w", "x"]
["w", "y"]
["x", "y"]
This is a built-in Spark SQL function (since Spark 3.1) that mimics the Python itertools.combinations()
.
✅ Step 4: EXPLODE(...)
Now that we have an array of arrays, we need to flatten it into rows.
Input:
[
["v", "w"],
["v", "x"],
["v", "y"],
["w", "x"],
["w", "y"],
["x", "y"]
]
EXPLODE(...)
turns each of these into a row:
c |
---|
[v, w] |
[v, x] |
[v, y] |
[w, x] |
[w, y] |
[x, y] |
✅ Final SELECT
You can go one step further and split each pair into separate columns:
SELECT
c[0] AS val1,
c[1] AS val2
FROM (
SELECT EXPLODE(combinations(SORT_ARRAY(collect_set(a)), 2)) AS c
FROM your_table
);
🧠 Summary
Function | Purpose |
---|---|
collect_set(a) | Gathers unique values of a into an array |
sort_array(...) | Ensures consistent ordering (no duplicate unordered pairs) |
combinations(..., 2) | Generates all 2-element combinations (no repeats, no (x,x)) |
explode(...) | Flattens array of arrays into rows |
✅ Result: You get all unique, unordered pairs (a, b)
such that a ≠ b
and no (b, a)
duplicates exist.
Leave a Reply