Protecting the scala stream collector and scala stream enricher against known security vulnerabilities

kaushikbhanu · November 17, 2018, 12:15am

We are using the scala stream collector and scala stream enricher to feed into kinesis analytics. we analyzed our setup for security vulnerabilities using zaproxy and identified that there are various issues in terms of allowing extra special characters are allowed which let the data load to redshift fail.
I have 2 questions:

Has anyone observed this before ?
Are there any know best practices which we can use to protect ourselves against these ?

Regards
Bhanu

alex · November 17, 2018, 11:05pm

Thanks for raising this @kaushikbhanu. Is the problem covered by this ticket in RDB Loader:

Or is it a separate issue?

kaushikbhanu · November 27, 2018, 4:07am

Doesn’t seem like it ,
we tried sending “\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\W” this as data in one of the fields. it passes through Scala collector and Scala stream enricher as everything but when we try to copy this into redshift the copy fails as it complains about the length exceeds ddl.

kaushikbhanu · November 27, 2018, 4:19am

to give a more background on our setup , we have Scala collector -> kinesis stream->Scala-stream-enrich-> kinesis stream-> kinesis-analytics - > kinesis-firehose->redshift. We were mostly tracking unstruct event with browser and mobile context etc. we were able to protect the custom schema by regex/lengths etc … but the snowplow schema we don’t want to manage/maintain ourself in our repo. hence the question !!

mike · November 27, 2018, 8:30am

What field are you sending this through in? I’m not sure if this is a security issue so much as a value exceeding a database length (which ideally shouldn’t happen but having a non-zero MAXERROR on a Redshift load isn’t uncommon).

alex · November 27, 2018, 11:47pm

That’s not a supported architecture - a Snowplow pipeline loading Redshift would use our Kinesis S3 Loader and our standard load process - so it’s difficult for us to treat this as an issue in Snowplow…

kaushikbhanu · November 29, 2018, 3:54am

I get that it is not part of supported pipeline by snowplow bu this is stream enrich issue in my opinion.

kaushikbhanu · November 29, 2018, 4:11am

@mike all the fields were manipulated. the idea is to send garbage data as shown in the example and see how the pipeline behaves. for example sent this in aid or eid , this passes enricher which basically fails the pipeline down stream. if we were to protect this from happening, how would we do it. I don’t think it shouldnt matter what the consumer is for the precessed data.

mike · November 29, 2018, 4:59am

I think this might fail with the way you are loading data via Firehose but \\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\W is a valid appid and should be loaded into Redshift or any other downstream targets correctly in the standard pipeline.

alex · November 29, 2018, 11:24pm

But you are not seeing how the pipeline behaves - because you are not using the Snowplow pipeline downstream. If the same issue occurs with the actual Snowplow pipeline, we can file a bug in the offending Snowplow project.

kaushikbhanu · November 30, 2018, 2:06am

@Alex will try that out. Is kinesis analytics support somewhere on the roadmap.

alex · December 1, 2018, 12:10am

Not currently I’m afraid Bhanu.

Reza_n · April 28, 2022, 12:25pm

Hi
I am new to snowplow
I wonder if we could prevent these type of data pass through collector and enricher.
how enricher or collector handle object de-serialization of untrusted data which can lead to remote code execution

Colm · April 28, 2022, 1:03pm

Hi @Reza_n ,

Neither the collector nor the enricher handle data in ways that can lead to remote code execution - none of our components do.

If you have any suspicion that this isn’t the case, all of our code is open-source, and we’d welcome a report - in fact we have a disclosure program. We’ve also had extensive pen testing done across our tech estate, we take security concerns quite seriously.

This thread is 4 years old and concerns someone trying to intentionally break the pipeline with unexpected characters (not RCE), and finding that the only non-Snowplow maintained part of their architecture was what broke.

I’m going to close the thread here because it’s so old, and just from a forum hygene point of view it’s better for us not to have very old posts brought back up.

However please don’t take that as discouragement from opening a discussion on this (or any other) topic. If you have any questions or concerns about security at all, you’re welcome to raise them - I’ll just ask you to open a new thread for it.

Topic		Replies	Views
Enriched event stream into Redshift using Kinesis Firehose AWS real-time pipeline	7	5765	May 31, 2016
Unknown error from Stream Enrich with Localstack Enrichment	1	1073	August 13, 2021
Redshift loading error: null byte - field longer than 1 byte Storage targets	9	5279	November 14, 2017
Snowplow System Columns Enrichment	3	1274	April 14, 2021
Only using collectors without enrichers For engineers	3	778	July 9, 2019

Protecting the scala stream collector and scala stream enricher against known security vulnerabilities

Related topics