I have two separate queries that generate tables that have the following structure:
user_id | 2017_spend | 2017_decile
and
user_id | 2016_spend | 2016_decile
where the decile columns is a 10% percentile grouping, generated using Redshift’s RANK() and NTILE window functions.
Now, I want to join these two into a single table that has every unique User ID in the first column, and their spend and decile information in the following 4 columns. This seems a straightforward task, but I have not been able to get this work as desired. Every JOIN that I do (LEFT, LEFT OUTER, FULL OUTER, INNER) results in a loss of data. Because if someone was a customer in 2016 but not in 2017 (or vice versa), then there will be no match and the JOIN will exclude them. And as I understand, a UNION will simply stack the two tables on top of each other, resulting in duplicate user_id’s.
Am I missing something basic? (It feels like I am) .
My instinct is that a FULL OUTER JOIN should do what you’re trying to do, but perhaps there’s some reason I can’t think of that it doesn’t.
An inelegant but potentially effective solution could be to UNION the tables then group by user_id and use MIN/MAX to aggregate to one row per user.
This assumes you’ve only got one value per user for each of the four columns.
WITH step1 AS(
SELECT
user_id,
2016_spend,
2016_decile,
null AS 2017_spend,
null AS 2017_decile
FROM table_1
UNION
SELECT
user_id,
null AS 2016_spend,
null AS 2016_decile,
2017_spend,
2017_decile
FROM table_2
)
SELECT
user_id,
MIN(2016_spend),
MIN(2016_decile),
MIN(2017_spend),
MIN(2017_decile)
FROM step1
If the problem is duplication then you have about a million ways of dealing with it
WITH duplicates_from_various_sets (
SELECT user from 2017_spend
UNION
SELECT user from 2016_spend
) SELECT distinct user from duplicates_from_various_sets
WITH mock as (
SELECT DISTINCT 1::int2 flag, invoice_id from 2017_spend
UNION DISTINCT
SELECT DISTINCT 2::int2 flag, invoice_id from 2016_spend
)
SELECT invoice_id,
bit_or(flag) – if you need to know which sets was invoice found in 1 - first, 2 - second, 3 - both
You can use this as a skeleton to put all your left outer joins onto… i.e.
– sum(1) should be a faster de-duplicator than DISTINCT, at least in PADB it was
WITH superset_of_invoices as (
SELECT invoice_id, bit_or(flag) as appears_in from (
SELECT 1::int2 flag, invoice_id, sum(1) from 2017_spend group by 1,2
UNION
SELECT 2::int2 flag, invoice_id, sum(1) from 2016_spend group by 1,2
) pile_of_invoices
) SELECT invoice_id,
appears_in,
COALESCE(2017_spend.amount,0) as 2017_spend_amount ,
COALESCE(2017_spend.amount,0) as 2016_spend_amount
FROM superset_of_invoices,
LEFT OUTER JOIN 2017_spend using(invoice_id)
LEFT OUTER JOIN 2016_spend using(invoice_id)
… Many ways to skin this cat. Check what’s faster and cheaper, i guess?
The FULL OUTER JOIN should do what your after without any loss of data (as it will include rows from both tables).
Let’s cover the three possibilities:
Customer 1 occurs in both 2016 and 2017
Customer 2 occurs in 2016 only
Customer 3 occurs in 2017 only
we can mock this out simply with the following dummy values
INSERT INTO test.c_2016 VALUES (1, 1, 1);
INSERT INTO test.c_2016 VALUES(2, 1, 1);
INSERT INTO test.c_2017 VALUES (1, 2, 2);
INSERT INTO test.c_2017 VALUES(3, 1, 1);
we expect to yield 3 rows in a table that looks something similar to
user_id
spend_2016
decile_2016
spend_2017
decile_2017
1
1
1
2
2
2
1
1
null
null
3
null
null
1
1
To get this result we can join the two tables using a FULL OUTER JOIN on user_id which gives the expected result.
SELECT
COALESCE(c_2016.user_id, c_2017.user_id) AS user_id,
c_2016.spend_2016,
c_2016.decile_2016,
c_2017.spend_2017,
c_2017.decile_2017
FROM
test.c_2016
FULL OUTER JOIN
test.c_2017 ON
c_2016.user_id = c_2017.user_id
ORDER BY user_id ASC