Running the RDB Loader without Redshift

jmccartin · October 15, 2021, 11:34am

Is it possible to run the RDB shredder without loading events into Redshift? We are looking into using Athena for cost and scalability reasons (among others). However, all of the available documentation I have read seems to indicate that the shredder configuration requires a Redshift connection.

I came across this article: Using AWS Athena to query the shredded events which indicates Athena should be possible, but the article is several years old now, and doesn’t actually mention if Redshift was still required as a target to the loader.

Colm · October 15, 2021, 12:12pm

Yes it is possible, and actually we run pipelines for a few customers who successfully do exactly what you describe.

If you look into RDB loader you’ll notice that it’s two separate EMR jobs (edit: on reflection I think it might be two steps of a job… Either way looking into the EMR config should make obvious what needs to happen), you can just run the Shred job and query that.

The only word of warning I’d have is that debugging issues and navigating things can be slightly more difficult without the load to database - simply by virtue of the data and structures being more visible and accessible in a database. So I generally do suggest that even when the plan is to skip redshift, it’s often worthwhile to start with the load job on, use that for the exploration part of developing a data model/use case, then switch it off once the meat of the work is in prod for long enough that you’re confident in it.

jmccartin · October 15, 2021, 1:38pm

Well I’m actually running the shredder jar as a single step and not the loader, as described in Run the RDB shredder - Snowplow Docs But I think I’ve encountered a bug, as no matter what the config, it always fails with the error:

ParsingFailure: expected " got '# Huma...' (line 2, column 3)

I’m attempting to upgrade an old pipeline that used the EmrEtlRunner, so we are looking at options that includes the batch and streaming shredders.

anton · October 15, 2021, 5:06pm

Hi @jmccartin,

I think the problem that you’re trying pass the HOCON config to Shredder as is without base64-encoding it first.

In the docs the string {{base64File "/home/snowplow/configs/snowplow/config.hocon"}} contains {{base64 ... } which is necessary.

Topic		Replies	Views
Should I run rdb_load only? For engineers	7	1235	February 11, 2020
How to run RDB shredder? For engineers	3	1499	December 31, 2021
RDB Loader 1.1.0 docs refer to Shredding / EMR	2	722	September 5, 2022
Storage Loader successful but not loading Redshift or Postgres DB Storage targets	4	2032	March 28, 2017
RDB loader container fails when there's no new shredded data Storage targets	3	1148	July 22, 2021

Running the RDB Loader without Redshift

Related topics