I have an issue with my batch pipeline today.
The COPY to Redshift fails because some values of domain_sessionidx are overflowing. The type in the database is int2 which has its maximum at 32767. I have multiple entries with a domain_sessionidx above that limit (probably a bot or something?).
I’d like to change the type of that column to something like BIGINT but since Redshift doesn’t allow altering column types, I’ll probably need to add a new column. I was wondering if the column order is important for the data pipeline (the inserting part actually) or if recreating that column (it would be the last column) is enough. Else I’ll have to completely duplicate the data in a new table in order to change the column type.
I guess another solution would be to parse the files and replace the overflowing values.
The short answer is yes - unfortunately order does matter as atomic.events is loaded as a TSV so the order of these fields will matter when loading data into the atomic.events table.
Copying the data into another table and then reloading into atomic.events is probably the best bet though it’s a little odd for even a bot to have more than 32k sessions. If it’s a very unusual use case it may be easier to put some custom code in the JS to not send events for a certain useragent string (if that’s the case).
Yeah this is really unusual. First time in 6 months that I see this.
The useragent doesn’t look like a bot. But everything seems to be coming from the same user.
For now, I think I’ll just run a script to modify the shredded lines concerned. If it happens again, I’ll look into a longer-term solution.
Thanks for the answer!
I has the same issue and I was lookig for a quick fix but, as our events table (several billions of lines) is huge, I think copying the data/reloading it is not that quick.
The entries come from a bot, with an average of 15k domain_sessionsidx per day for ~11 days.
Also, this is the first time in years, but another user (bot?) is approaching quickly to the maximum capacity of this field.
Are you able to share the script you came up with? Thanks.
@frankcash, you need to migrate you
events table to the latest version (higher than 0.8.0) where
domain_sessionsidx has been extended to
bigint. The migration scripts are here: https://github.com/snowplow/snowplow/tree/master/4-storage/redshift-storage/sql.
As a rule of thumb, the whole process normally consists of the following 9 steps:
- Ensure sufficient disk space available for the migration
- Rename existing table
- Create new table (v0.10.0)
- Copy data from old table to new one
- Sanity check
- Drop old table
- Restore dependencies (views)
- Recreate constraints in child tables
- Downscale Redshift cluster (optional)
Will I need to change anything in my configuration or should I be able to the migration with just the SQL scripts?
Also, Redshift tables don’t have typical constraints such as readjusting PK connections in Postgres?
In the end, I remember I went like @ihor presented and it worked just fine. IIRC, once the column type has been changed, you don’t need to update anything else.
Redshift doesn’t enforce constraints the way that Postgres does. Although constraints are useful for the query planner in Redshift they don’t do much beyond that.