Hi @digitaltouch - thanks for the super-detailed thoughts! I think we are very much on the same page.
I believe that we could build off of what logstash has done with the syntax awareness of various log files (Apache Common, Apache Combined, etc) with relative ease, or we could build some sort of DSL that has predefined JSON schemas for each common log file (something similar to the csv-mapper project).
Definitely agree - it feels like we can probably do both. Fastest would probably be to ship the Logfile Source with some standard logfile formats but also support Lua scripting to support simple transformations.
We can take some inspiration from [Amazon Kinesis Agent] kinesis-agent too.
(Much) further down the line, it would be interesting as well for the Snowplow pipeline to be able to support schema inference for unknown logfile formats (similar to the [Sequence project] sequence-project).
we found out the hard way that it is easiest to extract from third parties, save to text files, and send through the Snowplow pipeline through trackers because we can monitor bad rows without having to constantly re-deploy the extraction and loading scripts - or have DBA monitor. Going straight from third parties to Redshift turns out to cause the same problems as traditional ETL pipelines when managing more than 10 different datasources.
Completely makes sense. We made two architectural mistakes with Huskimo huskimo:
- Directly integrating with Redshift rather than emitting Snowplow events and letting the Snowplow pipeline and Iglu do the heavy lifting
- Adding all of the integrations into a single codebase. Each integration is completely independent of the others - they make much more sense as individual projects
Our [Snowplow AWS Lambda Source] lambda-source project (pre-release) is a better example of our planned post-Huskimo approach to these kinds of integrations.
On your integration dataflow:
Huskimo → S3 → SNS → Snowplow Logfile Source (Including simple transformations)-> Snowplow Collector → Snowplow Pipeline
This is super-interesting. Given that most third-party APIs require pagination anyway, what’s the benefit of roundtripping the data through S3 and SNS and the Logfile Source - why not just embed the Snowplow tracker in “Huskimo”? This is what we are planning post-Huskimo and I’m confident it would work with Singular, Twilio and Desk.com at the very least. In other words:
Snowplow Foo.com Source embeds Snowplow Tracker -> Snowplow Collector
As far as processing logfiles twice, we handle that with deduplication scripts with StorageLoader. We started with a system to handle cursor position (with Redis) and found that it was way easier to manage deduplication in SQL. We compress and archive the logfiles after processing. If the script fails mid way through parsing, that log may get parsed twice.
This is a bit surprising to me. Do you mean that you extract the whole data source on each run? We’ve found this impossible given API rate limits - and it feels unnecessary in any situation where resources are append-only or have a lastUpdated
timestamp. What were the problems you encountered with cursor positions in Redis? We were thinking of using DynamoDB for this.
In any case, I agree that you need good deduplication (but not source-specific deduplication as that’s a maintenance nightmare) given that a source will update its cursor pessimistically, so records may come through twice.
Phew! Ditto apologies for the length of this reply, but it’s an interesting topic. A final question: are you open to open-sourcing/contributing any of your source integrations? It feels like it would be easier to discuss your experiences in all this with some of the code in front of us.