Page content sentiment data in Snowplow

mac-dtm · April 11, 2022, 4:58am

Hey!

we’re working on merging Snowplow pipeline with the NLP pipeline. Our goal is to have the sentiment of the page content and model performance metrics, being passed as custom event properties in a pipeline that moves 150+ mln events a month. The content of each page changes every day.

The best idea we came up with so far is this.

Create a dedicated custom context/schema for each page (can be 100’s of pages - does it make sense?).
Re-run scraping and predictions and update each page’s custom context values with the latest sentiments and model performance metrics every day.
Refer custom contexts as an extra argument in all Snowplow’s track…() events, depending on which page is being viewed in the session.

What do you think? It it a Snowplow-ish way to do what we want to do?

Thanks!

mike · April 11, 2022, 6:20am

Do you have shared properties across multiple pages that you want to capture as attributes? If so I would avoid having a separate context for each page - 100s of schemas is doable but not really optimal. If it’s possible sharing an example of what you’d like to hypothetically send might make it easier to design this structure.

Is this your own content or content on another site? If it’s your own content I’d be tempted to do the sentiment scoring / NLP analysis in the enrichment part of the pipeline if possible rather than necessarily the frontend using a lookup on content id.

mac-dtm · April 11, 2022, 9:12am

Thanks for the thoughts!

Do you have shared properties across multiple pages that you want to capture as attributes? If so I would avoid having a separate context for each page - 100s of schemas is doable but not really optimal. If it’s possible sharing an example of what you’d like to hypothetically send might make it easier to design this structure.

Yes, we’ll use the same properties across all pages. The custom context we created so far would be as below (probably not fully validated).

100s / autogenerated schemas - it is not optimal from the pipeline perspective or just the management of it?

{
   "$schema":"http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
   "description":"Schema for content classification",
   "self":{
      "vendor":"com.dtm",
      "name":"sentiment",
      "format":"jsonschema",
      "version":"1-0-0"
   },
   "type":"object",
   "properties":{
      "automotive":{
         "direction": "positive",
         "positivity": 100,
         "negativity": 0,
         "score": 3,
         "words": 221,
         "sentences": 13,
         "precision": 0.8,
         "recall": 0.4,
         "f": 0.55
      },
      "books_literature": {
         "direction": "positive",
         "positivity": 100,
         "negativity": 0,
         "score": 3,
         "words": 221,
         "sentences": 13,
         "precision": 0.8,
         "recall": 0.4,
         "f": 0.55
      },
      "business_finance":{
         "direction": "negative",
         "positivity": 23,
         "negativity": 77,
         "score": 3,
         "words": 221,
         "sentences": 13,
         "precision": 0.8,
         "recall": 0.4,
         "f": 0.55
      },
      "travel": {
         "direction": "positive",
         "positivity": 100,
         "negativity": 0,
         "score": 3,
         "words": 221,
         "sentences": 13,
         "precision": 0.8,
         "recall": 0.4,
         "f": 0.55
      }
   },
   "additionalProperties": false
}

Is this your own content or content on another site? If it’s your own content I’d be tempted to do the sentiment scoring / NLP analysis in the enrichment part of the pipeline if possible rather than necessarily the frontend using a lookup on content id.

Got it. The content is from multiple sites, we have no control over the content. By enrichment you mean custom javascript enrichment?

mike · April 11, 2022, 9:41pm

A little bit of both. If you have common shared properties I’d see if you can use a couple of schemas which is generally possible rather than one schema per piece of content.

You could use the Javascript enrichment but I’d lean towards the API enrichment depending on how you are doing the crawling / scoring e.g., whether it’s at collection time or you have content that has been pre-scored.

Topic		Replies	Views
Tracking marketing campaigns	1	1104	March 9, 2021
Page_view events and custom properties (se_*) For engineers	3	1391	February 16, 2018
Page View Attributes Iglu	6	1714	April 20, 2021
Dbt_snowplow_web: Adding context data	3	787	June 16, 2023
Can snowplow measure first contentful paint?	2	1250	October 27, 2019

Page content sentiment data in Snowplow

Related topics