Beam Enrich failing in GCP Dataflow with java.lang.NullPointerException

I’ve seen the other threads about this error and I’ve tried a variety of versions for the Dataflow job at this point. I think I had the most success with 1.2.3 actually, but I am getting this error still.

Regardless, even without this issue, everything is ending up in the “bad” topic and never making it to good. I’m using GCP Dataflow and BigQuery with:

snowplow-docker-registry.bintray.io/snowplow/snowplow-bigquery-loader:0.5.0

snowplow/scala-stream-collector-pubsub:1.0.1

snowplow/beam-enrich:1.2.3 (but have tried latest and 1.1.0)

The “vanilla” iglu_resolver.json with
"schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",

I am using the ip_lookups enrichment with MaxMind as well.

I’m new to setting this all up and everything seems to be running and collecting and putting messages into PubSub alright…It just won’t get past the Beam Enrich job. Messages end up in the bad topic, so I don’t even know if the loader will put things into BigQuery. Though the mutator created the initial schema alright.

I’ve read official documentation, tutorials, and searched the Scala source code for days trying to piece things together. I’ve tried a variety of configurations and am just at a loss…But sooo close, I can feel it. If I can figure it out, I’m more than happy to list everything out somewhere if it helps others :slight_smile: I’m really excited about this tool and have been dying to use it for a while now.

Any suggestions or places to look? Thanks!

Ok. 1.2.3 definitely works the best. I don’t see the error right now…But the messages still are ending up in the bad topic.

Am I doing this wrong? A test POST request to (where my collector is on localhost):

http://localhost:8080/com.snowplowanalytics.iglu/v1

with JSON body:
{ "schema":"iglu:com.acme/campaign/jsonschema/1-0-1", "data": { "name":"download", "publisher_name":"Organic", "user":"6353af9b-e288-4cf3-9f1c-b377a9c84dac" } }

I’ve tried a variety of other test requests that I’ve found here as well: https://github.com/snowplow/snowplow/wiki/Iglu-webhook-setup

This is the JAR I see being used in GCP cloud storage: com.snowplowanalytics.scala-maxmind-iplookups_2.12-0.6.1-_Gd4ePqfuq2JNwcpfJdh2w.jar

I saw on https://github.com/snowplow/scala-maxmind-iplookups readme that 0.7.0 was the latest…Do I need the latest? How would I go about getting that into the enrich job? I was just following along with the docs here: https://docs.snowplowanalytics.com/docs/setup-snowplow-on-gcp/setup-validation-and-enrich-beam-enrich/add-additional-enrichments/ip-lookups/ and didn’t do anything special.

Dataflow is showing me Scala verison: 2.12.10 and scioVersion 0.8.1

Oh thank God it was me. Cool. So I guess I wasn’t forming my requests correctly. Though I don’t know what they should be, I do know the JavaScript Tracker is sending out message that make it all the way through and into BigQuery.

Glad you’ve got this working.

Your request looks fine - but is that a schema that you have published in your Iglu repository? The output in bad rows should be specific about why an event is failing the validation step.

I used the “vanilla” Iglu resolver that points to the official repository. Do I need to set up my own repository? Is there a good guide on that?

Follow up on that…Can I then define my own new custom schema?
(edit: this? https://github.com/snowplow/iglu-example-schema-registry)

Thanks!

Yes - the “vanilla” Iglu registry (Iglu Central) contains shared / common vendor schemas but if you want to use your own schemas you’ll want to setup your own registry (and add this to your enrichment resolver where you can specify a list of registries to use).

You can find some information about setting up an Iglu registry here.

1 Like

Ah ok, I see. I thought I had to host a copy of what Iglu Central had as well. I figured I’d have to define my own schema at some point, but didn’t get that far yet. I was just curious if I had to have all schemas in one spot or if multiple could be used. Also if there was any “generic event” provisions for passing whatever data I wanted. I guess if I want things structured then obviously something generic like that (assuming there’s just some blob of JSON metadata in a single field) is not a good long term solution.

Cool, thanks so much!

You can have all your schemas in the same spot but typically people opt for multiple repositories. This not only means that you can use schemas from different sources (e.g., private schemas in your own repo vs public in Iglu Central) but also means there’s an additional layer of redundancy if one of the repositories becomes unavailable.

Yes - you’ve hit the nail on the head here. In the early days of Snowplow this was just what you’ve mentioned - arbitrary JSON could be stored in a single column, however from a validation and data modelling standpoint this was hard to work with. Over time this has evolved to the current state which means that there’s no real generic schema that works for any event (with the exception of things such as screen views) so these needs to be defined up front.

Starting to define my own schema and test things out a bit more today. I tried to form requests using existing schema I found in Iglu. For example “link_click”

`{
  "schema":"iglu:com.snowplowanalytics.snowplow/link_click/jsonschema/1-0-0",
    "data": {
      "targetUrl":"https://google.com",
      "elementId":"test"
    }
}`

POST to: http://my-collector-domain:8080/com.snowplowanalytics.iglu/v1

It seems to come through but to my surprised added as unstructured. I now have a bunch of columns in my BigQuery like: unstruct_event_com_snowplowanalytics_snowplow_link_click_1_0_0.target_url

How do I not permit these unstructured fields?

Why did the mutator add the fields like that? Is there something more terse?

Of course, despite adding fields, it still didn’t insert any records. The messages just ended up in the failed message topic. edit: I was impatient :slight_smile: it did eventually end up in BigQuery.

Hey @Tom_M,

I can explain a couple of things here - unstruct is a (rather unwisely chosen) legacy term for ‘custom events’. It’s a terrible misnomer as they’re actually highly structured. You’ve essentially sent a custom link click event, and that triggered the column’s creation.

If you’re testing, then using the full pipeline might get messy as you’ll get a proliferation of columns for those test events. It’s a good idea to use Snowplow Mini to test your tracking and schemas (or, for purely local, Snowplow Micro).

The delay you observed is also expected. When you first send events the column in which they live doesn’t exist - so the process to create it is triggered, and the events themselves go into a retry loop and are eventually inserted once the column is created.

Typically this can take 5 to 10 minutes, and only for the first few events for a new schema.

Hope that helps explain things!

Best,

1 Like

Awesome, thanks!

While I am testing, I also worry about users (either internal engineers or end-users if using the JavaScript tracker on public facing site) sending events/messages with random fields. I’d like to ensure they don’t end up being counted unless in an actual schema.

BigQuery seemingly has 10k max columns if I’m reading correctly, but I don’t really want to see that many or have (either accidentally or maliciously) that maximum reached.

edit: oooh I think I understand. these are “valid” they just sound invalid due to the prefix?

Yes - at the moment the only way a BQ column is created is if a valid event is sent through (to the good topic) and the schema exists in one or more of the repositories specified in your resolver file. In this sense it’s not possible to send through arbitrary JSON and have a column created.

1 Like