The current pipeline allows two main methods of sending webhook data to the Snowplow collector via
- The Iglu webhook endpoint (via GET or POST)
- A custom webhook adapter written in Scala (i.e., the Mailchimp adapter)
This works well for most use cases but raises some problems when dealing with data that may meet some of the following conditions:
- The data is not well modelled and requires transformation (such as converting epoch seconds to ISO8601 timestamps)
- The webhook producing service is only capable of sending multiple event types to a single URL
- The producing service requires authentication in the form of a challenge-response system (e.g., Xero)
- The schema for a webhook service may vary significantly between identical users of a product (e.g., a form service will only return form field values for the form created)
- The end user wishes to authenticate the origin of the payload by comparing the signature recieved from the webhook provider to the expected signature (e.g., Stripe HMAC signing) or by checking against a vendor provided secret
- The producing service sends notifications in a non-standard format (e.g., CSV row, XML document)
- The producing service sends multiple events in batch (e.g., a subscribe webhook may send 10 user events at once rather than 10 sequential requests)
Although it is possible to factor this in to the current collector service it may increase the cost of development and will be tied to collector/enricher releases.
I’d like to propose a possible alternative to putting this logic into the collector.
A service that sits in front of the collector could act as a proxy by receiving webhook requests, performing any processing and finally forwarding the well formatted request to the collector endpoint.
All three major cloud platform providers (Azure Functions, Google Cloud Functions) have some variation of this service though I’ll focus specifically on AWS as Snowplow currently runs on AWS.
On AWS the service would comprise of a number of API gateway endpoints (URLs) with each URL tied to a single Lambda function. Each Lambda function would contain standalone code to:
- Accept the incoming request
- Process/authenticate/respond to the webhook request in question
- Forward the request (if required) to the collector endpoint
As a core value of Snowplow is durability it is imperative that data that passes and fails this step is kept in semi-persistent storage (this could be a Kinesis stream(s)) to guard against possible data loss.
This approach has several advantages:
- It is easy to update and deploy functions
- It is possible to easily manipulate payloads
- It is easy to serve a response (for authentication or authorization purposes)
- Invoications of the function are “serverless” so there is less need to worry about scaling resources
- It is possible to do extensive and isolated testing for each function
- Errors and exceptions are easy to monitor and alert on (using Cloudwatch)
- Is cloud platform agnostic
- Enables decoupling of the collector logic from webhook specific processing
- Raises the possibility of allowing the webhook processor to perform schema inference (e.g., a pattern sent without a schema could be matched against Iglu repositories before being sent to the collector)
However this approach does come with some caveats
- An additional (but optional) piece of the pipeline increases complexity
- If the system does not durably store data to Kinesis/other source this increases the potential for data loss
I’d love to hear what the community thinks about this idea and if you have any other alternative approaches or methods that you’ve used!