No bad data in S3

cealkate · October 13, 2023, 9:16am

Hello,

I have deployed Snowplow using Terraform on AWS. In Kinesis I can see 4 streams: raw, enriched, bad-1, bad-2. However, in the S3 bucket there is only one folder transformed only with good folder inside. So bad data isn’t written anywhere.
What could be a problem here?

josh · October 13, 2023, 11:31am

Hey @cealkate the easiest way to check if you don’t just have only good data is to create some bad data to validate things!

The simplest “bad” payload is simply:

curl -XGET <collector_endpoint>/i --output -

This sends an invalid (as empty) payload to the Collector which will land in the bad queue.

You then want to check if a bad folder appears where you have configured the S3 Loader for bad data to send information.

If nothing appears within generally 10 minutes you want to check out the logs for that application and validate that it is indeed working.

Did you use the quick-start to spin up the pipeline?

cealkate · October 13, 2023, 12:15pm

After sending a bad event, can not see any bad folder in S3. The logs show the following

Record is written only to raw stream, not the bad one.

Yes, used quick-start repository as basis for the deployment.

josh · October 13, 2023, 1:56pm

Hey @cealkate and have you checked the applications logs for the “bad” S3 Loader?

stanch · October 13, 2023, 2:11pm

P.S. You can also use this tool to send a few good & bad events: Tracking your first events | Snowplow Documentation. (Note: it only works with https:// collector URLs, due to the “mixed content” blocking in the browser.)

cealkate · October 13, 2023, 3:09pm

can’t see that loader in CloudWatch logs, here are the only ones available
Screenshot 2023-10-13 at 17.08.20

josh · October 16, 2023, 4:10am

Hi @cealkate and you have enabled this service in the vars file?

github.com

snowplow/quickstart-examples/blob/main/terraform/aws/pipeline/default/terraform.tfvars#L39-L42


      
          # --- Target: Amazon S3
          s3_raw_enabled      = false
          s3_bad_enabled      = true
          s3_enriched_enabled = true

cealkate · October 16, 2023, 10:07am

Hello,
Cannot find this variable s3_bad_enabled anywhere in our Terraform modules. Shouldn’t it then be enabled by default?

josh · October 16, 2023, 11:58am

So to reiterate here - there won’t be any “bad” data in S3 unless that loader has been deployed. It looks increasingly like that module has not been deployed as it would otherwise have an entry in CloudWatch logs.

Did you fork / customize the quickstart at all to your own purposes which could be why this loader is not present?

If you had followed the default pipeline setup it is indeed deployed as default so without removing it or disabling it using the options I linked it should be deployed.

Topic		Replies	Views
[SOLVED] Bad rows for schema violations are not loaded into Elasticsearch Troubleshooting	2	1081	March 2, 2022
Output (enriched/good and enriched/bad) are all empty! AWS batch pipeline (Legacy)	2	1747	February 27, 2017
Aws quickstart optimized snowplow infra For engineers	3	736	January 30, 2023
Enriched good and bad buckets are empty in the enrich AWS batch pipeline (Legacy)	7	2134	December 4, 2017
S3 Loader Not Loading Data from Stream Storage targets	2	2038	October 15, 2018

No bad data in S3

Related topics