Autoscaling Spark Structured Streaming applications

cmanrique · January 12, 2021, 7:09pm

Hi everyone!

To give you a little bit of background. I’m working on a project where all the infrastructure is on AWS. We are using Autoscaling Groups for components like Collector and Enrich and it’s working great so far. Also since the capacity of kinesis streams will limit the speed of those components we already started scaling the kinesis streams.

While working on making the overall capacity of the pipeline more elastic we decided to investigate how to autoscale our Spark applications.

We are using Spark Structure Streaming to run different aggregations on enriched events. After giving a try to Dynamic Allocation (Spark built-in autoscaling feature) we found that it doesn’t work well. Basically confirming this:

In Spark Streaming, data comes in batches, and executors run whenever data is available. If the executor idle timeout is less than the batch duration, executors are constantly added and removed. However, if the executor idle timeout is greater than the batch duration, executors are never removed. Therefore, Cloudera recommends that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.
Spark Streaming and Dynamic Allocation

I concluded that it’s better to build a custom autoscaling logic using the Spark Developer API and handle the autoscaling based on the duration of the last x number of batches and depending on that request or remove executors using requestExecutors and killExecutor methods that are available on SparkContext. Very similar to what’s explained here.

Have you faced this same problem? Can you please provide some suggestions?

Thanks in advance!

BenB · January 13, 2021, 1:21pm

Hi @cmanrique,

Welcome to the Snowplow community !

Thanks for the details. Internally we use Amazon Kinesis Data Analytics to perform aggregation on Kinesis data streams. The big advantage of doing so is that this is serverless and autoscaling.

Otherwise, have you considered computing aggregates directly from a data warehouse ?

cmanrique · January 15, 2021, 3:50pm

Hi @BenB!

I don’t think computing aggregates directly from a data warehouse would be a good option for us since we want as little latency as possible between the time we process the records to the time users see the results. Let me know if you think this assumption is incorrect.

On the other hand Kinesis Data Analytics looks pretty interesting. Are you using Flink to compute aggregates?

BenB · January 15, 2021, 5:53pm

Hi @cmanrique ,

Indeed if you care about latency then computing the aggregates directly from the stream sounds like a better approach.

Yes we are using Flink, we used these libs to write our aggregating job :

"com.amazonaws"     %  "aws-kinesisanalytics-runtime" % "1.1.0"
"com.amazonaws"     %  "aws-kinesisanalytics-flink"   % "1.1.0"
"org.apache.flink"  %% "flink-streaming-scala"        % "1.8.1" % Provided

Topic		Replies	Views
Autoscaling Kinesis in AWS stream architecture AWS real-time pipeline	2	1858	May 7, 2020
Why is Snowplow using Kinesis/Kafka for real-time pipeline? AWS real-time pipeline	4	6038	July 12, 2016
Scala Stream Collector and Kinesis Shards AWS real-time pipeline	1	1800	September 13, 2017
Resharding Kinesis and the Enricher AWS real-time pipeline	3	2241	September 23, 2016
Scala Stream Collector - scaling Collectors	7	3520	January 25, 2017

Autoscaling Spark Structured Streaming applications

Related topics