Setting up Iglu

First-time Snowplow user here, setting up Iglu. I am setting up basic JavaScript tracker for my client.

In reference to this page:

The page says:

There are two potential steps in setting up Iglu - you may choose to either or both of them:

  1. Setup an Iglu client
  2. Setup a repository

My question is – why would I choose (1), (2), or both? The document briefly describes the two options, but I don’t understand the “either or both” assertion.

It looks to me like I definitely need (2) – the repo where my JSON files live.

But I fail to understand why I might need (1), or the relationship between (1) and (2).

Thanks.

Karl Jones

2 Likes

This is one of the most confusing parts of the Snowplow setup in my opinion, as the documentation is somewhat confusing (and spread across github repos!).

You actually don’t need to set up an Iglu client. For the batch pipeline, EmrEtlRunner has one built in (within scala-hadoop-shred). You only need to ensure your configuration file specifies the Iglu repos you require. Here is an example of that configuration file: https://github.com/snowplow/snowplow/blob/master/3-enrich/config/iglu_resolver.json

In the example config file above, only one Iglu repo is specified: Iglu Central (public repo created/maintained by Snowplow the company). That repo contains a lot of the schemas used for certain enrichments.

If you desire to have your own unstructured events and custom contexts in addition to the ones created by Snowplow Co and their partners, you should set up your own repository (we opted to do a static repo hosted on Amazon S3). You’ll then add that repository reference in your config (iglu_resolver.json) file so that the EmrEtlRunner knows where to find the schemas for your custom events/contexts.

Note: to make matters more confusing, there are some default enrichments that are typically hosted locally alongside EmrEtlRunner (folder path specified as a command line flag): https://github.com/snowplow/snowplow/wiki/Configurable-enrichments.

2 Likes

Thanks @travisdevitt for explaining all this so thoroughly - really helpful!

The Setting up Iglu page is indeed confusing - as you say, a company (or their devops person) will setup an Iglu schema registry, but it’s only really a developer (like a Snowplow engineer working on EmrEtlRunner or Common Enrich) who will integrate an Iglu client into a product.

I’ve created a ticket to fix this confusing page: https://github.com/snowplow/iglu/issues/204

The basic idea is to re-orient the documentation around specific personas (e.g. Devops, Developer), which is a route we are taking with new product Sauna (see the in-progress Sauna wiki) and seems to be working quite well…

Thanks! This is helpful.

Karl

Thanks, getting closer. Still don’t have my head wrapped around it, but getting closer.

I have created an S3 bucket for my Iglu repo, and I have the JavaScript tracker in my web page.

What I need now is a bulletproof code sample that I can plug in for proof-of-concept. Both

  1. The JSON schema file, for the Iglu repo
  2. The JSON object that my JavaScript tracker will use.

I looked at several online docs, can’t seem to figure it out. Can you point me in the right direction?

Thanks,
Karl

https://github.com/tdevitt/snowplow_examples

2 Likes

Thanks, this is helpful – good example of the syntax.

What I am still not grasping:

  1. Where does the JSON schema file live? (URL of static repo.)

I see “com.travis” in the code … does this refer to the domain “travis.com” …?

  1. How is this URL referenced by the code?

Thanks,
Karl

The URL for your custom repo is specified in your config file (iglu_resolver.json) which EmrEtlRunner uses to locate the event and context schemas. I uploaded an example resolver config here: https://github.com/tdevitt/snowplow_examples/blob/master/example_iglu_resolver.json

You’ll notice that the resolver config file specifies both Iglu Central, as well as my custom example repo (which I’ve set up on Amazon S3) so that the Iglu client knows where to final ALL schemas needed during a run of EmrEtlRunner.

UPDATE – I think maybe I get it now –

(1) Unstructured Custom Events Do Not Require JSON Schema Validation. In this sense, Unstructured Events behave similarly to Structured Events.

(2) Schema validation is only relevant at Enrichment time.

Yes?

Thanks!

Karl

==========================
(Previous notes)

EmrEtlRunner is the Enricher, which happens after Collection, is that correct?

In reference to this document:

The tracking and collection should write the Log file, with custom context, is that correct?

And the custom context requires JSON schema validation?

If this happens before the Enrichment phase, same question: how does my JavaScript tracker reference the JSON schema?

Looking at this document, it appears I can put a hard-coded reference to the Schema URL (rather than using the resolver):

https://fivetran.zendesk.com/hc/en-us/articles/208241027-Snowplow-js

“4. Use the URL that you establish in step 3 as the schema URL when you call Snowplow.” Code snipppet:

window[snowplowName](‘trackPageView’, null, [{
schema: ‘https://raw.githubusercontent.com/fivetran/snowplow-schemas/master/hello_world.json’,
data: {
hello: ‘Hello world!’,
hello_array: [‘Hello’, ‘world!’]
}
}]);

Thanks
Karl

Thanks.

Question about the Schema file:

{
    "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
    "description": "Schema for an example custom context",
    "self": {
        "vendor": "com.travis",
        "name": "exampleCustomContext",
        "format": "jsonschema",
        "version": "1-0-0"
    },

    "type": "object",
    "properties": {
        "userBirthday": {
            "description": "Birthday input by the user",
            "type": ["string","null"],
            "format": "date-time"
        },
        "travId": {
            "description": "Unique ID of the user assigned by Travis",
            "type": ["string","null"],
            "maxLength": 1024
        },
        "isAwesome": {
            "description": "Is the user awesome?",
            "type": ["boolean","null"]
        },
        "twitterHandle": {
            "description": "Twitter handle of the user",
            "type": ["string","null"],
            "maxLength": 50
        },
        "firstName": {
            "description": "First name of the user",
            "type": ["string","null"],
            "maxLength": 200
        },
        "lastName": {
            "description": "Last name of the user",
            "type": ["string","null"],
            "maxLength": 200
        }
    },
    "required": ["travId"],
    "additionalProperties": false
}

Question: in the “self” parameter –

    "self": {
        "vendor": "com.travis", ...

Is “com.travis” interpreted as a “travis.com” ?

Or is it an arbitrary namespace, without reference to a site named “travis.com” ?

Thanks,
Karl

The vendor name “com.travis” is an arbitrary namespace, I just followed Snowplow’s stylistic convention. It doesn’t map to a site or URL

No, that document refers only to a forked version of the code in snowplow/snowplow that is operated by Fivetran. For Snowplow, continue to use iglu: schema URIs with an Iglu resolver file.

Hi all,
We are doing the iglu setup and have this question. We have created a iglu repository and referred the repository url in resolver.json. But the schema files in the repository(1-0-0) refers to the public repository($Schema) variable. Does this mean that the schema validation is taken from iglu central and not our own repository. Should we change $schema for all the schemas to refer to our local repository? Sorry can you please explain the need for $schema? We would typically not want to redirect to any public repository.

Thanks,
Manju

{
"$schema": “http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#”,
“description”: “Schema for an example custom context”,
“self”: {
“vendor”: “com.snowplow”,
“name”: “exampleCustomContext”,
“format”: “jsonschema”,
“version”: “1-0-0”
},
[/quote]