Support for multiple emitters in the mobile trackers

I came across this thread on the old user list regarding support for multiple emitters: https://groups.google.com/forum/#!topic/snowplow-user/6ELjJPGRPjU

Any update? I would like to do the same thing: send data to both a realtime and batch collector for one of the workflows I’m attempting to support.

Hi @dcow - no update on this currently, it somewhat fell off the radar. I’ve created a meta-ticket to track the feature:

Add support for multiple emitters #2867

So the interim solution is just use multiple trackers?

Hi @dcow unfortunately due to how the Emitter objects persist data it is not possible to run more than one Tracker at a time. They are all hardcoded to point to one database and one table within that database. There is also no metadata stored with the event to know which Emitter to send it to.

Hi @josh,

What’s the update on this? Is there any time line for supporting multiple emitters form mobile tracker?

And if not currently supported… then what should be the best way to send some events to batch pipeline and some to real time pipeline?

Hi @rahul - there’s no timeline for this support at this time.

Have you tried defining a second tracker instance - what’s the roadblock that you hit doing this?

Hi @alex how does creating 2 instances help? I need to track 2 types of events, 1 real time and other for batch processing. For this I am planning to have 2 collectors. 1 clojure and other scala stream collector.

To do this I need to have 2 emitter objects pointing to 2 different collectors. Correct me if I am wrong.

What is your suggestion for any workaround in this case?

Hi @alex,

Your inputs will be very helpful on this. As we are using snowplow heavily, and our use case demands sending events to both batch pipeline and real time pipeline separately. We will be sending ~100 million events everyday to each pipeline (i.e. 200 million event/day in total).

It would be great if you can help us with any workaround for this, also we would love to have this feature (allowing 2 emitters to send events to 2 different collectors) implemented from snowplow in their stack.

There one more question you can help me with. For real time events we are planning to use Kafka instead of Kinesis.

As the snowplow documentation of Snowplow 85 Metamorphosis says that Kafka support is in Beta :

  • Is it still in Beta? Can we use this on production ?
  • If yes, then how many companies (it would be great if you can name a few) using this on production?

These answers will make us move ahead with confidence on this implementation.

Hi @rahul - the Snowplow Kafka support is still relatively immature compared to our AWS batch and real-time pipelines. Please do share your findings as you roll-out your deployment through testing into production.

We’re not aware of a workaround - I think the most straightforward solution here is for the multiple emitters to be supported by the trackers.

@alex thank you for your reply. We will sure share our findings here. We will experiment with multiple emitters sending data to multiple collectors and also with the snowplow kafka pipeline.
Also how can I raise the request to get this feature implemented in snowplow stack? I believe later or sooner many others will also feel the need of supporting multiple emitters to be able to send events to multiple collectors.

Hi @rahul - we’ll reach out separately to discuss the multiple emitters feature with you.

@alex
Is there any update on this? We are facing an issue with 2 emitters with 1 tracker. Whenever an user activity triggers events on batch pipeline, say with session-id ‘X’ and then perform some activity to trigger events on real-time pipeline with session-id say ‘Y’.
When the same user comes to batch pipeline activity again then the session-id is no longer ‘X’. Snowplow generates new session-id for new event now.

Due to this we are getting exponentially increased no. of sessions in our tracking data and thus getting wrong analysis of data.

Can you please help us with this?

We are facing this issue with android tracker specifically.

Thanks

Hey @rahul - the best people to talk to about this are @yali and @mhadam. All the best.

Thanks @alex.

@yali @mhadam

It would be great if you can help us on this.

Hi @rahul,

Many thanks for raising the request and apologies for the delay replying. We’ve had this ticket for a while. However, we haven’t had many people make the request for mobile specifically - if anything, we’ve had lots of people ask us to implement downstream technology (e.g. relays) to save:

  1. More SDK bloat in the apps themselves
  2. Reduce the network traffic coming off the mobile app

Just to confirm - (2) doesn’t seem to be a concern in your case?

Can I be a bit nosey - can you share any more about where the requirement comes from i.e. why do you want to send some data to a batch pipeline and some to a real-time pipeline? Would having a single real-time pipeline and then a micro-service that triages the data (e.g. that reads the enriched stream and writes out to different downstream streams) be preferable, both in terms of simplifying the the mobile implementation, and consolidating your data processing infrastructure?

I want to make sure we understand the context of the requirement as well as possible before diving into the best updates to make to the stack to support them.

Thanks!

Yali

1 Like

@yali
Thanks for replying.

Where the requirements of tracking data on 2 different pipeline comes from? :
We are tracking a huge amount of data on our platforms everyday. We are using the snowplow batch-pipeline for all our tracking for more than 3 years now. All this batch pipeline data is processed once in a day and used. But now we have requirements where we want to analyse data in realtime for few activities on our platform. Some of these activities trigger from the same place where we already have batch pipeline tracking for some other activities. So from the same place we have build logic to track events on different pipelines on the basis of what type of activity is happening.

I hope requirement is clear?

We haven’t thought of having all tracking on realtime and then use some micro-service to further push data on batch pipeline.
Most of our tracking (~95%) is on batch, but rest 5% is no less important for us. It would be little difficult to move from batch to realtime immediately. We can think about this in near future.

What would you suggest to do as next step in our use case?

thanks

@yali

Please let us know what you think about our use case, shared in my previous post. Also, if you can suggest on some work around, for time being, till we get it as a feature in snowplow stack in future.

Thanks

Hi @rahul - thanks for the additional information.

In your situation, I would migrate to sending all events to the real-time pipeline. You would need to make sure you’d provisioned the real-time pipeline properly to cope with your full event volume, but then you’d simply need to update the CNAME in front of your existing batch pipeline to redirect all existing data through your real-time pipeline. You could setup the real-time pipeline to load the event data into the same data store that the current batch pipeline feeds (e.g. Redshift, Snowflake) so that there’d be no gaps in your data as you do the migration.

You’d then have the ability to extend real-time use cases to your full data set.

I am conscious though that you wrote:

It would be little difficult to move from batch to realtime immediately. We can think about this in near future.

Can you elaborate a bit more there - what difficulties do you envisage?

Hey @alex @yali Though this is quite a old post, I am curious to know is there any update on supporting multiple tracker instances in android tracker?

@yali We tried doing what you suggested (creating 2 tracker instances) here are the details :

Made 2 Trackers with the following specs,

Tracker A:
 - namespace: "A"
Emitter A 
- uri:  "tracking.staging.com"

Tracker B:
 - namespace: "B"
Emitter B 
 - uri:  "tracking.prod.com"

I sending Event A on Tracker A and Event B on Tracker B.

Logs:

Hitting Event A first,
D/MainActivity: Event A { tracker namespace : A , emitter uri : http://tracking.staging.com/i }
D/MainActivity: Event B { tracker namespace : A , emitter uri : http://tracking.staging.com/i }

Hitting Event B first,
D/MainActivity: Event B { tracker namespace : B , emitter uri : http://tracking.prod.com/i }
D/MainActivity: Event A { tracker namespace : B , emitter uri : http://tracking.prod.com/i }

What’s happening here is that it will only use one tracker, which is set first, due to singleton implementation and init function logic and send all events on that tracker.

We can see in above example we are not able to send different events to different collector endpoints.

@yali In last reply to this thread you suggested to move whole event pipeline to realtime (or batch). We can think of it but still if by any chance snowplow is coming up with support for multiple tracker then we would love to go with that approach first.

Thanks

Hi @rahul,

If the problem you’re looking to solve is to send Snowplow data to a real-time and a batch pipeline, my recommendation would still be to send it to a single real-time pipeline, esp. as we’re deprecating support for batch components, and have already deprecated the Clojure Collector and Spark Enrich. On the real-time side we plan to launch new versions of RDB Loader, that we hope will be more cost effective, deliver the data at lower latency, support Parquet in S3 and be more robust and autoscaling. So the case to upgrade should get much stronger :smiley:.

Is that the primary problem you’re looking to solve? Is there another use case for multiple emitters? It is something we can look to prioritise, but need to understand the use case to make the right prioritisation call.

Many thanks,

Yali