We are using the scala stream collector and scala stream enricher to feed into kinesis analytics. we analyzed our setup for security vulnerabilities using zaproxy and identified that there are various issues in terms of allowing extra special characters are allowed which let the data load to redshift fail.
I have 2 questions:
Has anyone observed this before ?
Are there any know best practices which we can use to protect ourselves against these ?
Doesn’t seem like it ,
we tried sending “\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\W” this as data in one of the fields. it passes through Scala collector and Scala stream enricher as everything but when we try to copy this into redshift the copy fails as it complains about the length exceeds ddl.
to give a more background on our setup , we have Scala collector -> kinesis stream->Scala-stream-enrich-> kinesis stream-> kinesis-analytics - > kinesis-firehose->redshift. We were mostly tracking unstruct event with browser and mobile context etc. we were able to protect the custom schema by regex/lengths etc … but the snowplow schema we don’t want to manage/maintain ourself in our repo. hence the question !!
What field are you sending this through in? I’m not sure if this is a security issue so much as a value exceeding a database length (which ideally shouldn’t happen but having a non-zero MAXERROR on a Redshift load isn’t uncommon).
That’s not a supported architecture - a Snowplow pipeline loading Redshift would use our Kinesis S3 Loader and our standard load process - so it’s difficult for us to treat this as an issue in Snowplow…
@mike all the fields were manipulated. the idea is to send garbage data as shown in the example and see how the pipeline behaves. for example sent this in aid or eid , this passes enricher which basically fails the pipeline down stream. if we were to protect this from happening, how would we do it. I don’t think it shouldnt matter what the consumer is for the precessed data.
I think this might fail with the way you are loading data via Firehose but \\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\W is a valid appid and should be loaded into Redshift or any other downstream targets correctly in the standard pipeline.
But you are not seeing how the pipeline behaves - because you are not using the Snowplow pipeline downstream. If the same issue occurs with the actual Snowplow pipeline, we can file a bug in the offending Snowplow project.
I am new to snowplow
I wonder if we could prevent these type of data pass through collector and enricher.
how enricher or collector handle object de-serialization of untrusted data which can lead to remote code execution
Neither the collector nor the enricher handle data in ways that can lead to remote code execution - none of our components do.
If you have any suspicion that this isn’t the case, all of our code is open-source, and we’d welcome a report - in fact we have a disclosure program. We’ve also had extensive pen testing done across our tech estate, we take security concerns quite seriously.
This thread is 4 years old and concerns someone trying to intentionally break the pipeline with unexpected characters (not RCE), and finding that the only non-Snowplow maintained part of their architecture was what broke.
I’m going to close the thread here because it’s so old, and just from a forum hygene point of view it’s better for us not to have very old posts brought back up.
However please don’t take that as discouragement from opening a discussion on this (or any other) topic. If you have any questions or concerns about security at all, you’re welcome to raise them - I’ll just ask you to open a new thread for it.