Is there best practices guidelines for maintaining a catalogue of event models in use in a particular deployment?
The way i see it, machines and developers are somewhat happy with iglu repositories hosting jsonschemas and various derivatives (jsonpaths and redshift ddl) groomed into a repository index. Has anyone attempted to use this as a basis to provide human readable raw event model documentation to render jsonschema into a document and allow for manual input from the model designer to explain the reasoning behind capturing attributes, version change logs and other pieces of information to make downstream data analysis more thoughtful?
I for once would want to explain to the analysts why certain elements are captured, why the field lengths were limited, how the raw events should be interpreted and which elements make it into the database, which fields are expected to have high cardinality etc. etc.
In my previous engagements, unrelated to snowplow, we have made an extensive use of Atlassian Confluence templates to standardize information gathering, event model documentation and robust cataloguing. Any pointers on how to achieve these results with snowplow assets?
I think this is a fantastic idea.
One potential low-friction way of doing this may be to add a description to each field in the JSON schema as the description property is supported in the V4 JSON schema spec. This means you could have a schema that looked something akin to:
"description": "Entire schema description",
"description": "A product string not exceeding 255 characters - used as an internal SKU"
One advantage of this approach is that it would be possible to add functionality to igluctl to generate a changelog for each schema based on different schema versions. It would also be possible to add the description from the schema field to add a comment to the database column (Redshift DDL) so that these descriptions are also stored in the database however this has the caveat that only model changes generate a new table - revisions or additions to the table will mean that only the latest schema descriptions are reflected in the column comments.
Great topic. Governance and documentation, are as important as implementation. While commenting the schema like @mike is describing works for developers, I found that is not the best for business users and even analysts that were not involved in the schema creation.
Like you, I opted to rely heavily on Confluence to document each schema on a page with:
- Why the schema exists
- Who is the owner (main stakeholder)
- Where the schema is used (site, app, ios, etc)
- Business questions enabled by the schema
- A table with all the fields in the schema and what they represent
- Sometimes, tracking specifications related to the schema
I found documenting schemas is essential to enable analysts and business users to better understand Snowplow and to be able to perform analysis.
I was giving it a little more though and started looking around to see if any other OSS orgs have tried to autogenerate jsonschema documentation. Which apparently there are a few.
Example: https://github.com/cloudflare/doca have a process of generating ok looking documentation using as Mike suggested description, but also “example” property.
…not exactly perfect, but an ok starting point. We could place tabs for each available version and organize the hierarchies. Thoughts? Other approaches?
Looking deeper into using cloudfare doca as the basis for the documentation CMS it seems that all we’d need to do is
- Customizing a theme ( https://github.com/cloudflare/doca-bootstrap-theme )
- would need your ideas here
- A loader to import jsonschema files that have no “.json” filename extension
- tested, seems to work ok
The way i see the CMS so far is 100% autogenerated code, but maybe someone with actual UI/UX skills can tell me if we can capture user input and update jsonschema assets in a web-authoring environment.
I think left navigation should list vendor prefixes [ “.schema.vendor" ] in an alphanumeric order, each being an expandable menu item opening a list of event names [ ".schema.name” ] in an alphanumeric order. Next level in the hierarchy is the schema version [ “$.schema.version” ]. Each schema version menu item, I think should have three anchor links: JS (jsonschema), JP (jsonpaths) and RS (redshiftschema).
In the content canvas, I would keep most of the current structure, with small modifications:
- Remove Show/Hide button and always display expanded.
- Where doca currently displays schema title, I’d generate it from the self describing block attributes something like
[ com.example » invoice_paid » v1.0.0 ]
Then, I would encourage snowplowers to make use of “description” and “example” attributes but after careful examination of the specs I could not find those keywords defined as something validators should be on a lookout for and expect to be in a particular format. So if the attributes are present, they will be rendered nicely by the theme. Else, the content would look ugly, but will be functional.
Extra points, and I really don’t know how to achieve this effectively: I’d like to add jsonpaths and redshift DDL here. Ultimately, I see this interface being slapped on top of iglu repository server to be able to
- Programmatically store customized jsonschema, jsonpaths and redshift schema into a single repository
- Provide version locking
- Provide autogenerated documentation (what we’re discussing in this thread)
- Provide additional CMS / UGC capabilities
4.1- Either wiki notes around schemas, discussion forum, etc. OR
4.2- Means of embedding autogenerated content into popular CMS platforms (e.g. Confluence) where UGC will be created and organized.
What do you guys think?
We have completed initial development of the documentation app and are ready to share the spoils. Some things worked out, some we may need the community expertise to help us out with.
- iglu-central contains schemas in jsonschema and avro formats. Documentation app currently only covers self-describing schemas in jsonschema format. Question to the community - who uses avro schemas and in what context?
- iglu-central contains self-describing schemas with no corresponding jsonpaths and sql artifacts. Those can we generated using igluctl or similar utilities, but that action should be performed by schema owners. Question to the community - should repository PRs be accepted if these artifacts are not submitted by schema vendors? Should defaults be generated? What should be the default constraints on string fields in those cases?
- In our opinion, Ideally documentation app is deployed on top of iglu scala server, similar to current co-deployment of scala components and a swagger interface. We could not fully integrate the documentation UI with iglu during the initial iteration. Instead we developed a few scripts to (a) export public iglu server schemas to local file system and (b) merge jsonpaths and sql artifacts into the body of the jsonschema artifacts.
- We introduced, as was suggested in previous ports on this subject, the following elements:
(a) description - to capture notes for human consumption, (b) example - to serve both as a part of documentation of each payload element and as means to generate example for the entire event. The end-results are similar to previously posted screenshots.
There are two alternative implementations we can think of:
fully integrated with iglu scala server, it could use backend API to discover schemas stored on the iglu server and provide immediate response from documentation rendering perspective.
fully compatible with iglu static content repository. It could compute sql file and jsonpath file corresponding to a given schema and try to load it into a complete doc.
Our current iteration lies somewhere in-between.
Is there an interest in the community to take these ideas forward? Is this just another cool thing or will you actually benefit from the app? Are we willing to make modifications to igluctl and other utilities to support updated self-describing schema format?
This looks really lovely! Here’s some answers to Qs and my thoughts about this.
We use avro to configure sauna and dataflow-runner. Avro ecosystem has many tools to generate language-specific classes to hold data and we use this libraries to generate static types for our apps. But as far as I know this use-case is fairly unique for Avro, people use it along with Spark/Hadoop to store serialized data and I also think we have plans to switch to Avro from TSV to store enriched data.
These schemas (without SQL and JSONPaths) are again mostly for configuration or plain JSON validation. It doesn’t make much sense to generate SQL DDL and JSONPaths for them, as they never will be stored in relational databases, they used only to validate input self-describing JSONs.
I think we’re leaning towards fully integrated with scala server option if we’re talking about end product. Static and Scala registries supposed to have feature-parity for machines, but scala server is one to be more featureful for humans. But integrated with static server is absolutely fine option so far, as static server is much more widely used (so far) and many things in documentation could be implemented using only static data.
We have a ticket to explore possible ways to facilitate human-readable documentation and there’s no answer yet for many questions. I think end result will pretty much depend on apps/features that are not yet implemented, such as schema generator UI. But this looks like a good start.
We have some code to contribute, but not sure what’s the best way to do so. How do we share the sources? Should new repositories be created? We have 3 separate components, two forked from Cloudflare’s BSD3 licensed doca and doca-bootstrap-theme and one generated / resulting app. If the plan is to continue towards iglu scala server integration, should the app find a new home in the iglu scala app repository? Can @anton, @mike or @alex advise ?
hey @dashirov - have you put this in a public repo elsewhere? I’d like to take a look, sounds very promising!
Look forward to it. Nice work with the theme!
Here’s the resulting app, stripped of proprietary code. THERE’S NO LICENSE on it yet.
Work in progress so to say https://github.com/dashirov-ga/iglu-doca.
clone and run:
doca-snowplow-theme artifacts are hosted privately, so you may need to build and push it to some repo
node install can find it in. Sorry, we haven’t gotten it too far with all the official ceremonies.
In leu of integration with iglu, there a bin directory with a few scripts to periodically extract publicly exposed schemas from iglu server, add SQL and JSONPATHS to them and publish into docs library. That is what we could not finish on our own.
I think a backend API integration with vendor specific read keys or super-user api key can be used instead of the scheduler driven generator.
sounds good, ill check it out!
Theme could be improved… after 100th event version it becomes too cluttered both on the left nav side and in the content canvas. Some sort of an advanced hierarchy is definitely a better way to organize the docs. Also, there’s no place currently to document what applications are firing which event+version and that would be a very good thing to expose. And last, but not least, the order in which navigation lists the items is not the order in which the actual docs are listed in the main canvas.
So yeah, if community has a need for a trinket like this, we could spend time and effort making a better product here.
I got pretty close to having doca discover jsonschema in hosted iglu (scala server), but couldn’t drag it through the finish line. I took a wrong approach initially, trying to do it on the front-end, but I think it should have been done on the node side. Do you have it in you to add the feature?
Hi I’m interested in this doca-snowplow-theme. I installed it from github along with doca from NPM but running
doca init -t snowplow I keep getting
doca-snowplow-theme is not in the npm registry.
Are there updates pending or do I need to install some custom commit of doca to recognise the changes?
It was a long time ago, but I remember publishing the theme to my private npm repository first ( we use Nexus with npm hosted repository exposed). After that it just worked.