Page content sentiment data in Snowplow

Hey!

we’re working on merging Snowplow pipeline with the NLP pipeline. Our goal is to have the sentiment of the page content and model performance metrics, being passed as custom event properties in a pipeline that moves 150+ mln events a month. The content of each page changes every day.

The best idea we came up with so far is this.

  1. Create a dedicated custom context/schema for each page (can be 100’s of pages - does it make sense?).
  2. Re-run scraping and predictions and update each page’s custom context values with the latest sentiments and model performance metrics every day.
  3. Refer custom contexts as an extra argument in all Snowplow’s track…() events, depending on which page is being viewed in the session.

What do you think? It it a Snowplow-ish way to do what we want to do?

Thanks!

Do you have shared properties across multiple pages that you want to capture as attributes? If so I would avoid having a separate context for each page - 100s of schemas is doable but not really optimal. If it’s possible sharing an example of what you’d like to hypothetically send might make it easier to design this structure.

Is this your own content or content on another site? If it’s your own content I’d be tempted to do the sentiment scoring / NLP analysis in the enrichment part of the pipeline if possible rather than necessarily the frontend using a lookup on content id.

1 Like

Thanks for the thoughts!

Do you have shared properties across multiple pages that you want to capture as attributes? If so I would avoid having a separate context for each page - 100s of schemas is doable but not really optimal. If it’s possible sharing an example of what you’d like to hypothetically send might make it easier to design this structure.

Yes, we’ll use the same properties across all pages. The custom context we created so far would be as below (probably not fully validated).

100s / autogenerated schemas - it is not optimal from the pipeline perspective or just the management of it?

{
   "$schema":"http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
   "description":"Schema for content classification",
   "self":{
      "vendor":"com.dtm",
      "name":"sentiment",
      "format":"jsonschema",
      "version":"1-0-0"
   },
   "type":"object",
   "properties":{
      "automotive":{
         "direction": "positive",
         "positivity": 100,
         "negativity": 0,
         "score": 3,
         "words": 221,
         "sentences": 13,
         "precision": 0.8,
         "recall": 0.4,
         "f": 0.55
      },
      "books_literature": {
         "direction": "positive",
         "positivity": 100,
         "negativity": 0,
         "score": 3,
         "words": 221,
         "sentences": 13,
         "precision": 0.8,
         "recall": 0.4,
         "f": 0.55
      },
      "business_finance":{
         "direction": "negative",
         "positivity": 23,
         "negativity": 77,
         "score": 3,
         "words": 221,
         "sentences": 13,
         "precision": 0.8,
         "recall": 0.4,
         "f": 0.55
      },
      "travel": {
         "direction": "positive",
         "positivity": 100,
         "negativity": 0,
         "score": 3,
         "words": 221,
         "sentences": 13,
         "precision": 0.8,
         "recall": 0.4,
         "f": 0.55
      }
   },
   "additionalProperties": false
}

Is this your own content or content on another site? If it’s your own content I’d be tempted to do the sentiment scoring / NLP analysis in the enrichment part of the pipeline if possible rather than necessarily the frontend using a lookup on content id.

Got it. The content is from multiple sites, we have no control over the content. By enrichment you mean custom javascript enrichment?

A little bit of both. If you have common shared properties I’d see if you can use a couple of schemas which is generally possible rather than one schema per piece of content.

You could use the Javascript enrichment but I’d lean towards the API enrichment depending on how you are doing the crawling / scoring e.g., whether it’s at collection time or you have content that has been pre-scored.