Stream Collector - Accept connections via Thrift or Protobuf?

We use Snowplow Collectors between two data centers, and therefore we have to pay for outbound bandwidth from one of our data centers, which is our biggest cost.

From what I understand, Snowplow Stream Collectors accept inbound connections via JSON, and then use Thrift to pass the message into the data sink.

We’d like to optimise the data connection that emits the JSON requests on the outbound platform, and an obvious optimisation is to move from JSON to a more compact format like Protobuf or Thrift.

Has anyone optimised the data collection in this way? One possibility is to move the Snowplow data collector to the outbound platform, and then the Thrift message gets sent over the wire to the sink on the inbound platform.

Any advice and feedback is most welcome! And in the short term, we unfortunately cannot consolidate our data centers.

1 Like

How do the two Snowplow collectors communicate (if at all? Is this on prem or in GCP / AWS?). Just trying to figure out at what stage of the pipeline this egress is coming from.

Yes, there are other supported mimetypes as well but if you are using the POST endpoints this is mostly JSON.

Are you able to sketch this infrastructure out? Thrift is serialised at the collector so this should make it reasonably efficient to relay post-collection.

This is doable - if you mean optimisation the format of the data between the tracker and the collector. The trackers could (but don’t currently) support Protobuf and depending on how many / which trackers you use this wouldn’t be a huge amount of effort to support. The larger effort would likely be on the collector side which would have to deserialise Protobuf and then serialise to Thrift. I don’t think this would need to be particularly bespoke for Snowplow, but it might be better just to consider having a gRPC endpoint exposed by Akka on the collector at that point.

We actually did a hackathon to experiment with grpc/Protobuf a while back. In case it’s of interest here are the relevant branches:

2 Likes

We send Snowplow events from AWS to our Snowplow Collector on GCP. The Snowplow Collector then places the events in GCP PubSub. We have complete control over how the event is sent from AWS.

(event from internet) → AWS (processing and packaging) (via HTTP) → Snowplow Collector @ GCP (via JSON) → GCP PubSub (via Thrift)

I am not wedded to Protobuf, just want to find the best combination of compression to save bandwidth and processing power to compress/decompress at the endpoints. If using Thrift as a protocol means that we don’t need to deserialize and reserialise at the Collector level, even better.

We ran a test compressing our data via gzip and we swamped our servers. There are better methods (and a lot of methods) to test.

If anyone has spent time in this area to optimise, I’d love to hear more!

Thanks @Colm ! We’ll check this out.

Is the processing and packaging here an AWS load balancer that then forwards the request on to the GCP collector or something else more involved?

This is a pretty fine balance and is going to depend on your latency requirements but given you are doing cross-cloud transfers I’d probably optimise for compression over processing power.

If you have to run this cross cloud I’d be tempted to go:

(event) → AWS (via HTTP) → Snowplow Collector @ AWS → GCP PubSub (via Thrift).

You’ll still be incurring egress but Thrift should for the most part be smaller than your HTTP request (this will depend a little bit on what you are doing with header forwarding).

If you wanted to minimise this egress even further I think it’d be worth investigating the switch from binary protocol to compact protocol for the collector / enricher.

I’ve had a bit more of a think about this and Thrift compact actually makes minimal difference - mostly because strings are encoded identically between the binary and compact versions so you are only saving a few bytes here and there as the CollectorPayload structure is mostly strings. I suspect Protobuf will probably suffer from the same problem - though it does have the concept of compression the structure itself.

In that case, possibly the easiest way to make the payload smaller is to restrict / filter the request headers coming in to the collector as this list of strings can be quite large and may not necessarily be needed (e.g., all cookie values, X- headers, trace headers etc).

Thanks Mike - I appreciate your thoughts. The two options that we are evaluating are:

  1. Move the Snowplow Collector to AWS, so the over-the-wire communication is via Thrift, and therefore no request headers. Downside is that it is major surgery, likely lots of tuning, and if we add additional enrichments these would be sent over-the-wire as well.

  2. Reduce/filter the request headers. Downside is that it would be hard to understand/manage (depending on implementation), but it would be a straightforward change.

If you are adding enrichments (and enrich is running in GCP already) this shouldn’t add any overhead to the raw payloads which will just be the contents of the request itself.

I suspect there are different approaches here depending on what is happening at the AWS processing / packaging stage but you could remove them at the load balancer or even the collector itself (probably easiest to keep an ‘allow list’ of headers here).