Is it possible to run the RDB shredder without loading events into Redshift? We are looking into using Athena for cost and scalability reasons (among others). However, all of the available documentation I have read seems to indicate that the shredder configuration requires a Redshift connection.
I came across this article: Using AWS Athena to query the shredded events which indicates Athena should be possible, but the article is several years old now, and doesn’t actually mention if Redshift was still required as a target to the loader.
Yes it is possible, and actually we run pipelines for a few customers who successfully do exactly what you describe.
If you look into RDB loader you’ll notice that it’s two separate EMR jobs (edit: on reflection I think it might be two steps of a job… Either way looking into the EMR config should make obvious what needs to happen), you can just run the Shred job and query that.
The only word of warning I’d have is that debugging issues and navigating things can be slightly more difficult without the load to database - simply by virtue of the data and structures being more visible and accessible in a database. So I generally do suggest that even when the plan is to skip redshift, it’s often worthwhile to start with the load job on, use that for the exploration part of developing a data model/use case, then switch it off once the meat of the work is in prod for long enough that you’re confident in it.
Well I’m actually running the shredder jar as a single step and not the loader, as described in Run the RDB shredder - Snowplow Docs But I think I’ve encountered a bug, as no matter what the config, it always fails with the error: