Redshift overflow on domain_sessionidx + does column order matters?

Timmycarbone · January 10, 2017, 2:31pm

Hey there!

I have an issue with my batch pipeline today.
The COPY to Redshift fails because some values of domain_sessionidx are overflowing. The type in the database is int2 which has its maximum at 32767. I have multiple entries with a domain_sessionidx above that limit (probably a bot or something?).

I’d like to change the type of that column to something like BIGINT but since Redshift doesn’t allow altering column types, I’ll probably need to add a new column. I was wondering if the column order is important for the data pipeline (the inserting part actually) or if recreating that column (it would be the last column) is enough. Else I’ll have to completely duplicate the data in a new table in order to change the column type.

I guess another solution would be to parse the files and replace the overflowing values.

Thanks!

mike · January 11, 2017, 2:24am

The short answer is yes - unfortunately order does matter as atomic.events is loaded as a TSV so the order of these fields will matter when loading data into the atomic.events table.

Copying the data into another table and then reloading into atomic.events is probably the best bet though it’s a little odd for even a bot to have more than 32k sessions. If it’s a very unusual use case it may be easier to put some custom code in the JS to not send events for a certain useragent string (if that’s the case).

Timmycarbone · January 11, 2017, 1:43pm

Mmmh alright.
Yeah this is really unusual. First time in 6 months that I see this.

The useragent doesn’t look like a bot. But everything seems to be coming from the same user.
For now, I think I’ll just run a script to modify the shredded lines concerned. If it happens again, I’ll look into a longer-term solution.

Thanks for the answer!

vceron · January 30, 2017, 3:34pm

I has the same issue and I was lookig for a quick fix but, as our events table (several billions of lines) is huge, I think copying the data/reloading it is not that quick.

The entries come from a bot, with an average of 15k domain_sessionsidx per day for ~11 days.
Also, this is the first time in years, but another user (bot?) is approaching quickly to the maximum capacity of this field.

frankcash · January 29, 2019, 3:49pm

Are you able to share the script you came up with? Thanks.

ihor · January 29, 2019, 5:52pm

@frankcash, you need to migrate you events table to the latest version (higher than 0.8.0) where domain_sessionsidx has been extended to bigint. The migration scripts are here: https://github.com/snowplow/snowplow/tree/master/4-storage/redshift-storage/sql.

As a rule of thumb, the whole process normally consists of the following 9 steps:

Ensure sufficient disk space available for the migration
Rename existing table
Create new table (v0.10.0)
Copy data from old table to new one
Sanity check
Drop old table
Restore dependencies (views)
Recreate constraints in child tables
Downscale Redshift cluster (optional)

frankcash · January 29, 2019, 6:02pm

Will I need to change anything in my configuration or should I be able to the migration with just the SQL scripts?

Also, Redshift tables don’t have typical constraints such as readjusting PK connections in Postgres?

Thanks.

Timmycarbone · January 29, 2019, 8:07pm

In the end, I remember I went like @ihor presented and it worked just fine. IIRC, once the column type has been changed, you don’t need to update anything else.

mike · January 29, 2019, 11:35pm

Redshift doesn’t enforce constraints the way that Postgres does. Although constraints are useful for the query planner in Redshift they don’t do much beyond that.

Topic		Replies	Views
Cookie ID field types and lengths in Redshift schema Storage targets	4	1156	November 22, 2018
[IMPORTANT] November 7, 2022: Redshift table migration for yauaa_context 1-0-4 Open Source Alerts	2	935	November 10, 2022
Redshift maintenance best practices Storage targets	3	5092	April 26, 2017
Unstructured Events and the events table Data store sources	5	2080	March 7, 2019
Can I change the DISTKEY or SORTKEY of the atomic tables in Redshift? Redshift	3	3129	November 21, 2016

Redshift overflow on domain_sessionidx + does column order matters?

Related topics