Issue with snowflake transformer/loader

ian-dribbble · February 26, 2021, 5:51pm

I’m attempting to set up snowplow for the first time for my company, and load the data into our Snowflake setup. I’m doing everything on AWS, and I’ve got everything up to the snowflake loader working (I believe).

However, the loader is failing on the Transform step and sending the failed records to my “badOutputUrl”. The failure error is “FieldNumberMismatch”. I haven’t been able to find any information on what could be causing this in previous discourse posts or anywhere else.

I’ve created this gist with all the relevant config files. Let me know if you need any other information that what I’ve provided.

gist.github.com

https://gist.github.com/ehlertij/5b8c47076b14d7b281d5536cb367786b

cluster.json

{
   "schema":"iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
   "data":{
      "name":"dataflow-runner - snowflake transformer staging",
      "logUri":"s3://my-bucket/logs/",
      "region":"us-east-1",
      "credentials":{
         "accessKeyId":"env",
         "secretAccessKey":"env"
      },

This file has been truncated. show original

config.json

{
  "schema": "iglu:com.snowplowanalytics.snowplow.storage/snowflake_config/jsonschema/1-0-3",
  "data": {
    "name": "Snowflake config",
    "awsRegion": "us-east-1",
    "auth": {
        "integrationName": "SNOWPLOW_S3_INTEGRATION"
    },
    "manifest": "snowplow-snowflake-manifest",
    "snowflakeRegion": "us-east-1",

This file has been truncated. show original

enrich-config.hocon

enrich {
  streams {
    in {
      raw = ${?COLLECTOR_STREAM_GOOD}
    }

    out {
      enriched = ${?ENRICH_STREAM_GOOD}
      bad = ${?ENRICH_STREAM_BAD}
      partitionKey = "event_id"

This file has been truncated. show original

There are more than three files. show original

I’m really stuck on this one. I’m not sure what the issue could be at this point, so any ideas would be wonderful!

dilyan · March 1, 2021, 9:20am

Hi @ian-dribbble , the problem seems to be that the enriched event has more fields than expected: 391 vs the “canonical” 130. Looking at the example you provided, it looks like there are a lot of redundant tabs in the enriched event on S3.

Do you have a way to compare the enriched data on S3 with the events coming out of Enrich into Kinesis? What do you use to get them from Kinesis to S3? I wonder if that step is not adding all the extra tabs.

ian-dribbble · March 1, 2021, 2:55pm

Thanks for the reply @dilyan. I’m just using kinesis firehose to dump it from kinesis to s3. The configuration is pretty simple. I don’t have any conversions, compression, or encryption turned on.

Could it possibly be something with the iglu schema I’m getting from iteratively? I added iteratively-schema.json to my gist above, including the only event I’m attempting to trigger so far.

Also, is it something to do with my enrich setup? I’m not actually running any enrichments yet. I’m using the snowplow/stream-enrich-kinesis:latest dockerfile and here’s the command I’m using to run it (after copying the config files over):

["--config", "/snowplow/config/config.hocon", "--resolver", "file:/snowplow/config/resolver.json"]

dilyan · March 2, 2021, 4:19pm

Hi @ian-dribbble , I don’t think the Iteratively schema is the culprit. If you look at the example you shared, there are no extra tabs inside the JSON blob that contains the PageViewed event.

I can think of two places where these extra tabs might be getting introduced:

1.) In Enrich. The only way I can imagine it could happen here is if you have the JS enrichment running and that is updating the event in place – a bug here might be unnecessarily padding fields with tabs.

However you said you’re not running any enrichments, so that leads me to:

2.) In the process that loads the enriched data from Kinesis to S3. We don’t use Firehose for this usually. Rather, there’s a tool that you can use, which we maintain: Load data to S3 - Snowplow Docs . Would you be able to give that one a go?

Previously I asked you if you can see what the enriched data looks like in Kinesis, as way to check if scenario 2 above holds. I think you can use the aws cli tools to get records from Kinesis and inspect them, just to see if they will have the extra tabs. If they do, then we’re back on scenario 1; but from the evidence so far that appears to be the less likely scenario.

Colm · March 2, 2021, 4:52pm

Jumping in just to add:

I can’t find the threads where this has come up before, but I do remember an issue coming up with using Firehose instead of S3 loader.

If memory serves, Kinesis firehose just dumps everything it finds into S3 without any delimiter between events, but S3 loader delimits with newline. I suspect that this may be the issue here.

I believe people have mentioned that they circumnavigated the problem by adding a custom lambda function to their firehose setup to add a newline delimiter to the end of each event.

In my opinion Dilyan’s suggestion to use the S3 loader is the safest option, since it’s what we maintain as compatible with our other components (so any changes that happen to any of the components will be forward compatible with this setup).

Hope that extra bit of context helps!

ian-dribbble · March 3, 2021, 2:30pm

Thank you @dilyan and @Colm! I’ll give the s3 loader a try. I didn’t realize that was a potentially necessary piece with the snowflake loader.

ian-dribbble · March 4, 2021, 5:13pm

The s3 loader was the piece I was missing. Everything is working great now. Thanks guys!

alex · March 4, 2021, 5:29pm

That’s great news!

Topic		Replies	Views
Upgraded to snowflake loader 0.8.0 but data is not loaded For engineers	2	1046	December 5, 2020
Snowflake loader stopped processing enriched files Troubleshooting	6	659	August 14, 2023
Snowflake Transformer/Loader Stops Working Randomly Troubleshooting	6	999	December 5, 2022
Snowplow Snowflake Loader 0.3.1 released New releases	0	848	February 2, 2018
Spark missing in Dataflow-runner Enrichment	25	3730	December 10, 2020

Issue with snowflake transformer/loader

Related topics