Stream Collector - Accept connections via Thrift or Protobuf?

switzer · September 29, 2021, 3:28pm

We use Snowplow Collectors between two data centers, and therefore we have to pay for outbound bandwidth from one of our data centers, which is our biggest cost.

From what I understand, Snowplow Stream Collectors accept inbound connections via JSON, and then use Thrift to pass the message into the data sink.

We’d like to optimise the data connection that emits the JSON requests on the outbound platform, and an obvious optimisation is to move from JSON to a more compact format like Protobuf or Thrift.

Has anyone optimised the data collection in this way? One possibility is to move the Snowplow data collector to the outbound platform, and then the Thrift message gets sent over the wire to the sink on the inbound platform.

Any advice and feedback is most welcome! And in the short term, we unfortunately cannot consolidate our data centers.

mike · September 29, 2021, 10:57pm

How do the two Snowplow collectors communicate (if at all? Is this on prem or in GCP / AWS?). Just trying to figure out at what stage of the pipeline this egress is coming from.

Yes, there are other supported mimetypes as well but if you are using the POST endpoints this is mostly JSON.

Are you able to sketch this infrastructure out? Thrift is serialised at the collector so this should make it reasonably efficient to relay post-collection.

This is doable - if you mean optimisation the format of the data between the tracker and the collector. The trackers could (but don’t currently) support Protobuf and depending on how many / which trackers you use this wouldn’t be a huge amount of effort to support. The larger effort would likely be on the collector side which would have to deserialise Protobuf and then serialise to Thrift. I don’t think this would need to be particularly bespoke for Snowplow, but it might be better just to consider having a gRPC endpoint exposed by Akka on the collector at that point.

Colm · September 30, 2021, 9:24am

We actually did a hackathon to experiment with grpc/Protobuf a while back. In case it’s of interest here are the relevant branches:

switzer · September 30, 2021, 5:38pm

We send Snowplow events from AWS to our Snowplow Collector on GCP. The Snowplow Collector then places the events in GCP PubSub. We have complete control over how the event is sent from AWS.

switzer · September 30, 2021, 5:41pm

(event from internet) → AWS (processing and packaging) (via HTTP) → Snowplow Collector @ GCP (via JSON) → GCP PubSub (via Thrift)

switzer · September 30, 2021, 5:47pm

I am not wedded to Protobuf, just want to find the best combination of compression to save bandwidth and processing power to compress/decompress at the endpoints. If using Thrift as a protocol means that we don’t need to deserialize and reserialise at the Collector level, even better.

We ran a test compressing our data via gzip and we swamped our servers. There are better methods (and a lot of methods) to test.

If anyone has spent time in this area to optimise, I’d love to hear more!

switzer · September 30, 2021, 5:49pm

Thanks @Colm ! We’ll check this out.

mike · September 30, 2021, 9:24pm

Is the processing and packaging here an AWS load balancer that then forwards the request on to the GCP collector or something else more involved?

This is a pretty fine balance and is going to depend on your latency requirements but given you are doing cross-cloud transfers I’d probably optimise for compression over processing power.

If you have to run this cross cloud I’d be tempted to go:

(event) → AWS (via HTTP) → Snowplow Collector @ AWS → GCP PubSub (via Thrift).

You’ll still be incurring egress but Thrift should for the most part be smaller than your HTTP request (this will depend a little bit on what you are doing with header forwarding).

If you wanted to minimise this egress even further I think it’d be worth investigating the switch from binary protocol to compact protocol for the collector / enricher.

mike · October 4, 2021, 3:36am

I’ve had a bit more of a think about this and Thrift compact actually makes minimal difference - mostly because strings are encoded identically between the binary and compact versions so you are only saving a few bytes here and there as the CollectorPayload structure is mostly strings. I suspect Protobuf will probably suffer from the same problem - though it does have the concept of compression the structure itself.

In that case, possibly the easiest way to make the payload smaller is to restrict / filter the request headers coming in to the collector as this list of strings can be quite large and may not necessarily be needed (e.g., all cookie values, X- headers, trace headers etc).

switzer · October 5, 2021, 10:08pm

Thanks Mike - I appreciate your thoughts. The two options that we are evaluating are:

Move the Snowplow Collector to AWS, so the over-the-wire communication is via Thrift, and therefore no request headers. Downside is that it is major surgery, likely lots of tuning, and if we add additional enrichments these would be sent over-the-wire as well.
Reduce/filter the request headers. Downside is that it would be hard to understand/manage (depending on implementation), but it would be a straightforward change.

mike · October 5, 2021, 11:18pm

If you are adding enrichments (and enrich is running in GCP already) this shouldn’t add any overhead to the raw payloads which will just be the contents of the request itself.

I suspect there are different approaches here depending on what is happening at the AWS processing / packaging stage but you could remove them at the load balancer or even the collector itself (probably easiest to keep an ‘allow list’ of headers here).

Topic		Replies	Views
Can we use snowplow pipeline without collector? Collectors	3	1005	September 3, 2021
Thrift Parsing Format For engineers	2	820	September 7, 2019
RFC: Snowplow webhook proxy RFCs	9	2158	November 13, 2017
Snowplow Collector sending data to kafka Kafka real-time pipeline	1	776	May 9, 2023
Snowplow > Kafka > Druid Kafka real-time pipeline	1	1467	October 20, 2019

Stream Collector - Accept connections via Thrift or Protobuf?

Related topics