Setting up Iglu

karl_jones · August 22, 2016, 7:51pm

First-time Snowplow user here, setting up Iglu. I am setting up basic JavaScript tracker for my client.

In reference to this page:

The page says:

There are two potential steps in setting up Iglu - you may choose to either or both of them:

Setup an Iglu client
Setup a repository

My question is – why would I choose (1), (2), or both? The document briefly describes the two options, but I don’t understand the “either or both” assertion.

It looks to me like I definitely need (2) – the repo where my JSON files live.

But I fail to understand why I might need (1), or the relationship between (1) and (2).

Thanks.

Karl Jones

travisdevitt · August 22, 2016, 8:40pm

This is one of the most confusing parts of the Snowplow setup in my opinion, as the documentation is somewhat confusing (and spread across github repos!).

You actually don’t need to set up an Iglu client. For the batch pipeline, EmrEtlRunner has one built in (within scala-hadoop-shred). You only need to ensure your configuration file specifies the Iglu repos you require. Here is an example of that configuration file: https://github.com/snowplow/snowplow/blob/master/3-enrich/config/iglu_resolver.json

In the example config file above, only one Iglu repo is specified: Iglu Central (public repo created/maintained by Snowplow the company). That repo contains a lot of the schemas used for certain enrichments.

If you desire to have your own unstructured events and custom contexts in addition to the ones created by Snowplow Co and their partners, you should set up your own repository (we opted to do a static repo hosted on Amazon S3). You’ll then add that repository reference in your config (iglu_resolver.json) file so that the EmrEtlRunner knows where to find the schemas for your custom events/contexts.

Note: to make matters more confusing, there are some default enrichments that are typically hosted locally alongside EmrEtlRunner (folder path specified as a command line flag): https://github.com/snowplow/snowplow/wiki/Configurable-enrichments.

alex · August 22, 2016, 8:54pm

Thanks @travisdevitt for explaining all this so thoroughly - really helpful!

The Setting up Iglu page is indeed confusing - as you say, a company (or their devops person) will setup an Iglu schema registry, but it’s only really a developer (like a Snowplow engineer working on EmrEtlRunner or Common Enrich) who will integrate an Iglu client into a product.

I’ve created a ticket to fix this confusing page: https://github.com/snowplow/iglu/issues/204

The basic idea is to re-orient the documentation around specific personas (e.g. Devops, Developer), which is a route we are taking with new product Sauna (see the in-progress Sauna wiki) and seems to be working quite well…

karl_jones · August 22, 2016, 9:31pm

Thanks! This is helpful.

Karl

karl_jones · August 22, 2016, 9:39pm

Thanks, getting closer. Still don’t have my head wrapped around it, but getting closer.

I have created an S3 bucket for my Iglu repo, and I have the JavaScript tracker in my web page.

What I need now is a bulletproof code sample that I can plug in for proof-of-concept. Both

The JSON schema file, for the Iglu repo
The JSON object that my JavaScript tracker will use.

I looked at several online docs, can’t seem to figure it out. Can you point me in the right direction?

Thanks,
Karl

travisdevitt · August 22, 2016, 11:44pm

https://github.com/tdevitt/snowplow_examples

karl_jones · August 23, 2016, 3:03pm

Thanks, this is helpful – good example of the syntax.

What I am still not grasping:

Where does the JSON schema file live? (URL of static repo.)

I see “com.travis” in the code … does this refer to the domain “travis.com” …?

How is this URL referenced by the code?

Thanks,
Karl

travisdevitt · August 23, 2016, 6:50pm

The URL for your custom repo is specified in your config file (iglu_resolver.json) which EmrEtlRunner uses to locate the event and context schemas. I uploaded an example resolver config here: https://github.com/tdevitt/snowplow_examples/blob/master/example_iglu_resolver.json

You’ll notice that the resolver config file specifies both Iglu Central, as well as my custom example repo (which I’ve set up on Amazon S3) so that the Iglu client knows where to final ALL schemas needed during a run of EmrEtlRunner.

karl_jones · August 23, 2016, 7:25pm

UPDATE – I think maybe I get it now –

(1) Unstructured Custom Events Do Not Require JSON Schema Validation. In this sense, Unstructured Events behave similarly to Structured Events.

(2) Schema validation is only relevant at Enrichment time.

Yes?

Thanks!

Karl

==========================
(Previous notes)

EmrEtlRunner is the Enricher, which happens after Collection, is that correct?

In reference to this document:

The tracking and collection should write the Log file, with custom context, is that correct?

And the custom context requires JSON schema validation?

If this happens before the Enrichment phase, same question: how does my JavaScript tracker reference the JSON schema?

Looking at this document, it appears I can put a hard-coded reference to the Schema URL (rather than using the resolver):

https://fivetran.zendesk.com/hc/en-us/articles/208241027-Snowplow-js

“4. Use the URL that you establish in step 3 as the schema URL when you call Snowplow.” Code snipppet:

window[snowplowName](‘trackPageView’, null, [{
schema: ‘https://raw.githubusercontent.com/fivetran/snowplow-schemas/master/hello_world.json’,
data: {
hello: ‘Hello world!’,
hello_array: [‘Hello’, ‘world!’]
}
}]);

Thanks
Karl

karl_jones · August 24, 2016, 4:13pm

Thanks.

Question about the Schema file:

{
    "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
    "description": "Schema for an example custom context",
    "self": {
        "vendor": "com.travis",
        "name": "exampleCustomContext",
        "format": "jsonschema",
        "version": "1-0-0"
    },

    "type": "object",
    "properties": {
        "userBirthday": {
            "description": "Birthday input by the user",
            "type": ["string","null"],
            "format": "date-time"
        },
        "travId": {
            "description": "Unique ID of the user assigned by Travis",
            "type": ["string","null"],
            "maxLength": 1024
        },
        "isAwesome": {
            "description": "Is the user awesome?",
            "type": ["boolean","null"]
        },
        "twitterHandle": {
            "description": "Twitter handle of the user",
            "type": ["string","null"],
            "maxLength": 50
        },
        "firstName": {
            "description": "First name of the user",
            "type": ["string","null"],
            "maxLength": 200
        },
        "lastName": {
            "description": "Last name of the user",
            "type": ["string","null"],
            "maxLength": 200
        }
    },
    "required": ["travId"],
    "additionalProperties": false
}

Question: in the “self” parameter –

    "self": {
        "vendor": "com.travis", ...

Is “com.travis” interpreted as a “travis.com” ?

Or is it an arbitrary namespace, without reference to a site named “travis.com” ?

Thanks,
Karl

travisdevitt · August 24, 2016, 4:27pm

The vendor name “com.travis” is an arbitrary namespace, I just followed Snowplow’s stylistic convention. It doesn’t map to a site or URL

alex · August 25, 2016, 4:45pm

No, that document refers only to a forked version of the code in snowplow/snowplow that is operated by Fivetran. For Snowplow, continue to use iglu: schema URIs with an Iglu resolver file.

manju · October 20, 2017, 6:38am

Hi all,
We are doing the iglu setup and have this question. We have created a iglu repository and referred the repository url in resolver.json. But the schema files in the repository(1-0-0) refers to the public repository($Schema) variable. Does this mean that the schema validation is taken from iglu central and not our own repository. Should we change $schema for all the schemas to refer to our local repository? Sorry can you please explain the need for $schema? We would typically not want to redirect to any public repository.

Thanks,
Manju

{
"$schema": “http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#”,
“description”: “Schema for an example custom context”,
“self”: {
“vendor”: “com.snowplow”,
“name”: “exampleCustomContext”,
“format”: “jsonschema”,
“version”: “1-0-0”
},
[/quote]

Topic		Replies	Views
Iglu static repo setup Iglu	1	1701	August 23, 2016
Set up a iglu repo in github/gitlab Iglu	5	1599	October 7, 2019
Documentation for custom context Iglu	2	2905	April 26, 2017
Writing Iglu clients Iglu	9	2062	May 7, 2018
Javascript tracker with unstructured events	5	1605	January 14, 2020

Setting up Iglu

Related topics