I have two separate queries that generate tables that have the following structure:
user_id | 2017_spend | 2017_decile
user_id | 2016_spend | 2016_decile
where the decile columns is a 10% percentile grouping, generated using Redshift’s RANK() and NTILE window functions.
Now, I want to join these two into a single table that has every unique User ID in the first column, and their spend and decile information in the following 4 columns. This seems a straightforward task, but I have not been able to get this work as desired. Every JOIN that I do (LEFT, LEFT OUTER, FULL OUTER, INNER) results in a loss of data. Because if someone was a customer in 2016 but not in 2017 (or vice versa), then there will be no match and the JOIN will exclude them. And as I understand, a UNION will simply stack the two tables on top of each other, resulting in duplicate user_id’s.
Am I missing something basic? (It feels like I am) .
Any help let me know!
My instinct is that a
FULL OUTER JOIN should do what you’re trying to do, but perhaps there’s some reason I can’t think of that it doesn’t.
An inelegant but potentially effective solution could be to
UNION the tables then group by
user_id and use
MAX to aggregate to one row per user.
This assumes you’ve only got one value per user for each of the four columns.
WITH step1 AS(
null AS 2017_spend,
null AS 2017_decile
null AS 2016_spend,
null AS 2016_decile,
Hope this helps!
yup that’s what i do as well
If the problem is duplication then you have about a million ways of dealing with it
WITH duplicates_from_various_sets (
SELECT user from 2017_spend
SELECT user from 2016_spend
) SELECT distinct user from duplicates_from_various_sets
WITH mock as (
SELECT DISTINCT 1::int2 flag, invoice_id from 2017_spend
SELECT DISTINCT 2::int2 flag, invoice_id from 2016_spend
bit_or(flag) – if you need to know which sets was invoice found in 1 - first, 2 - second, 3 - both
You can use this as a skeleton to put all your left outer joins onto… i.e.
– sum(1) should be a faster de-duplicator than DISTINCT, at least in PADB it was
WITH superset_of_invoices as (
SELECT invoice_id, bit_or(flag) as appears_in from (
SELECT 1::int2 flag, invoice_id, sum(1) from 2017_spend group by 1,2
SELECT 2::int2 flag, invoice_id, sum(1) from 2016_spend group by 1,2
) SELECT invoice_id,
COALESCE(2017_spend.amount,0) as 2017_spend_amount ,
COALESCE(2017_spend.amount,0) as 2016_spend_amount
LEFT OUTER JOIN 2017_spend using(invoice_id)
LEFT OUTER JOIN 2016_spend using(invoice_id)
… Many ways to skin this cat. Check what’s faster and cheaper, i guess?
FULL OUTER JOIN should do what your after without any loss of data (as it will include rows from both tables).
Let’s cover the three possibilities:
- Customer 1 occurs in both 2016 and 2017
- Customer 2 occurs in 2016 only
- Customer 3 occurs in 2017 only
we can mock this out simply with the following dummy values
INSERT INTO test.c_2016 VALUES (1, 1, 1);
INSERT INTO test.c_2016 VALUES(2, 1, 1);
INSERT INTO test.c_2017 VALUES (1, 2, 2);
INSERT INTO test.c_2017 VALUES(3, 1, 1);
we expect to yield 3 rows in a table that looks something similar to
To get this result we can join the two tables using a
FULL OUTER JOIN on
user_id which gives the expected result.
COALESCE(c_2016.user_id, c_2017.user_id) AS user_id,
FULL OUTER JOIN
c_2016.user_id = c_2017.user_id
ORDER BY user_id ASC