This will depend on how other things in our roadmap pan out, but I don’t think it will happen this month. There is a good chance it happens during the summer though
This is awesome, thanks @Colm - just what I’m looking for.
I’ve just tested it and although the following command appears to run without error, no events are being saved to file - any ideas? Appreciate any assistance you can offer.
sudo ./snowplow-event-generator --config ./snowplow-enrich-event-generator/config.hocon --output file:/snowplow-enrich-event-generator/kafka/my-events
Here is the contents of my config file:
{
“seed”: 1
“payloadsTotal”: 1000
“withRaw”: true
“withEnrichedTsv”: true
“withEnrichedJson”: true
“compress”: false
“payloadsPerFile”: 1000
“eventPerPayloadMin”: 1
“eventPerPayloadMax”: 1
“duplicates”: {
“natProb”: 0.0
“synProb”: 0.0
“natTotal”: 1
“synTotal”: 1
}
“timestamps”: {
“type”: “Fixed”
“at”: “2022-02-01T01:01:01z”
}
}j
The only thing that jumps out to me is the output filepath - I might be wrong but I think it expects absolute path (tbh with this project it’s something we chip away at for internal testing purposes rather than something we treat as a ‘product’ if that makes sense. So we haven’t had much of a focus on making it more user friendly).
I think try with either file:"$(pwd)/snowplow-enrich-event-generator/kafka/my-events
(assumes bash, tested on mac), or just the absolute path to your dir.
If I’m right I think it would’ve created a dir in your root folder, and the data will be in there.
(Edit: ./
might also work, I can’t remember why I used $(pwd)
in a script from months ago tbh, but I did that for some reason so that’s my guess for this case too )
A few tweaks and we’re up and running. You’re a star - appreciate the help and quick response.
NP at all. Keep us posted, keen to hear how you get on
Hope you’re both well - just wondering if there has been any progress on this PR?
FYI: Within Azure Stream Analytics, you can now write an Event Hub / Kafka stream directly to Azure Data Lake Storage (Gen 2) in Delta Parquet (and Microsoft have just changed their Stream Analytics pricing model, so it is now significantly cheaper ). It is currently in
preview
, but all being well it won’t be long until it’s GA.
Cheers,
Steve
Just wondering if there was an update on the PR above? Champing at the bit, as you may have gathered
Cheers,
Steve
Hi @stanch -
Hope all is good with you?
Wondering if there is any news / anything I can do to help on this one?
Look forward to hearing from you.
Cheers,
Steve
Hi @steve.gingell,
The Kafka source PR is from an external contributor, so we can’t really provide a timeline (unless you’d like to step in and help get it over the line, of course ).
Regarding the overall Azure support, stay tuned for announcements next week! It looks like we’ll be able to release the new lake loading component this summer as planned.
Hi @stanch -
Good to hear from you and I look forward to hearing next weeks’ announcements Very exciting.
If I need to return to the PR option, not sure I have the technical expertise, but if it’s a case of trying to coordinate, then happy to help.
Thanks again for getting back to me and roll on next week
Cheers,
Steve
Hi @stanch -
Just saw the announcement and read your article: Announcing open source Azure support | Snowplow - very cool!!
Presumably, in my set-up, I can use just the Transformer Kafka from the RDB Loader, not worry about the Loader Snowflake implementation, and then consume the Transformer Kafka output myself in a custom application - is that right?
Thanks again for all of your help, Nick.
Cheers,
Steve
I was hoping that the Transformer Kafka output was another Kafka stream, but it looks like it’s blob storage - is that right? To switch it from blob storage to Kafka stream would, presumably, require dev work my side?
Correct. It’s an intermediate blob storage location that’s used as a staging area by the loader. If you are already reading from the enriched Kafka/EventHubs stream (which the Enrich application writes to), then you will not benefit from Transformer/Loader.
You might, however, benefit from our Terraform modules to run the rest of the pipeline. And hopefully, in a few weeks we should have a new dedicated lake loader as well.
Thanks for the quick response, @stanch, and congrats again on this Azure integration milestone
The reason I mention the Transformer is that I was looking to apply some transformations to the enriched data, so thought this would be needed …
When you mention “dedicated lake loader”, does that mean the intermediate blob storage location will no longer be needed and the data will be loaded directly into the lake? In a different format, for example delta parquet…?
You could say “Transformer” is an unfortunate name, but then again it used to be called “Shredder” It does some pre-canned transformation that’s needed for the Loader, but it’s not meant be to be used standalone.
For transformations, I think what you are looking for is Snowbridge with Kafka input (I know, I know), or Benthos.
Maybe
Thanks, Nick.
So to progress the Snowbridge with Kafka input, I need to reach out to the contributor who raised the PR and take it from there, right?
@steve.gingell I’ve implemented the tests that we needed, it’s in PR review now. There’s some cleanup needed but once I manage to find time to get that done and get a final review it’ll be released.
In the meantime I’ve released a pre-release asset - you can use version 2.2.0-rc1
to experiment with. Here’s a config example for the source.
@Colm - you’re an absolute star! Thanks for this; really appreciate it.
I’ll get cracking and let you know how I get on
Thanks again,
Steve
@steve.gingell it’s now released. I recommend using the prod asset over the rc I pointed you to as it contains vulnerability fixes. Should be no difference apart from that though.
Hi @steve.gingell, just to follow up on this in case you haven’t seen the announcement: https://snowplow.io/blog/announcing-snowplow-lake-loader/.