Basically, I already setup the trackers > collectors > enricher > storage.
(Just not sure if I setup correctly)
Our current setup for collectors is Kinesis and S3 for storage.
The first problem I encounter is that the “Storage” doesn’t process anything (though I setup correctly and running but doesn’t process so What I did was processed it on AWS w/c is Firehose.)
Now, we have an s3 data
Not sure also if this is the correct format , but this is what some files looks like.
@Aron_Quiray, it does look like enriched data to me. The enriched data (record) should be in TSV conforming to this structure.
To load that data into Redshift (another possibility is Snowflake DB), you need to run a batch job that would consume the data from S3 bucket containing your enriched data utilising EmrEtlRunner in Stream Enrich mode.
@ihor, Thanks for the reply,
I rephrase my question as I overlooked into 1 detail, the one’s not firing was the “Storage” jar file where the alternative route we did was the firehose (to store data to s3 storage)
now the data provided was from s3.
what steps we need to do inorder to save this one to a db or something so we can start modeling it?
sorry for my questions, really new here at snowplow.
@Aron_Quiray, I’m not sure I still follow you. Here’s the typical architecture you would build to get your data into data store other than S3. In your case, you enrich data in real time and you would follow the 2nd picture.
The post is a bit outdated but the same idea is still relevant. I’m not sure if Firehose fits here as files are expected to be in a certain format. The S3 Loader does the job for you - prepares the files for batch processing with EmrEtlRunner (to load the data to Redshift) unless you want to analyze data by some other means (in S3), for example with Athena.
You should be aware, however, that the enriched data (files) have to be placed in folders like “run=2020-12-01-16-30-50” (run=YYYY-MM-DD-hh-mm-ss) for the Loader to process them.
@Aron_Quiray, did you mean to say config.json? The config.yml is used with EmrEtlRunner but you would use JSON as a configuration file for Snowflake Loader. You can find an example of the configuration file here.
Your (erroneous) example, however, indicate you did try to run EmrEtlRunner. This contradicts to your statement “we’re planning to put it into snowflake db”. You do not need to run EmrEtlRunner to load the enriched data into Snowflake DB. You would run EmrEtlRunner to shred the data which is required to load the data to Redshift as shown in this doc. Though that workflow is for the older version of RDB Loader. The latest releases product TSV shredded files (as opposed to JSON) as explained here.
Are you trying to load to Redshift first before switching to Snowflake DB?