We are pleased to announce updated versions of our bucket loaders: 0.3.1 of Google Cloud Storage loader and 0.7.0 of S3 Loader that allow you to partition events by schema.
1. Partitioning by schema
At Snowplow we use self-describing JSON format to keep a well-defined, type-spec’d data definitions. When used with self-describing JSON, bucket loaders are now able to send each schema-formatted event to applicable schema directory in a tidy directory structure.
The change comes along with R118 changes introducing a beta release of the new bad event format for easier post-processing.
2. Upgrading
We have aligned configuration settings to expose the same set of options so it is easier to deploy across GCP and AWS. In future we will further consolidate configuration to make the configuration even more portable. This, along with new capabilities, required some backwards-incompatible options.
S3 Loader
If you want to make use of the new partitioning mechanism make sure to set additional new parameter s3.partitionedBucket=s3://[BUCKET]
in your configuration file. The parameter is pointing to s3 URI where partitioned JSON files are to be stored. Otherwise no partitioning will be performed and data will be stored within top-level directory followed by s3.dateFormat
if set.
We introduced a new optional configuration settings s3.outputDirectory
for static directory prefix, s3.dateFormat={YYYY}/{MM}/{dd}/{HH}
that follows the outputDirectory
and s3.filenamePrefix
that prefixes file resulting in: s3://[s3.bucket|s3.partitionedBucket]/[s3.outputDirectory]/[s3.dateFormat]/[s3.filenamePrefix]-...
Google Cloud Storage Loader
Google Cloud Storage loader deployment is no longer supported using Dataflow templates. This comes from upstream limitation for optional, runtime parameters. Therefore from now on only command-line, docker-based deployment is supported.
We introduced a new flag --dateFormat=YYYY/MM/dd/HH/
that removes date formatting from --outputDirectory
flag. Output directory no longer interprets date format string and therefore becomes --outputDirectory=gs://[BUCKET]/
instead of --outputDirectory=gs://[BUCKET]/YYYY/MM/dd/HH/
.
Moreover we now also allow setting GCP dataflow-specific flags directly through flag parameters. An often requested one was --labels={\"environment\": \"prod\"}
that may be used for filtering costs on your cloud deployments.
If you want to make use of the new partitioning mechanism make sure to set additional new parameter --partitionedOutputDirectory=gs://[BUCKET]/[BUCKET_DIR]
is set. Otherwise no partitioning will be performed and data will be stored within top-level directory followed by --dateFormat
if set.
3. Roadmap
GCS loader and S3 loader continue to evolve at Snowplow. If you have other features in mind, feel free to log an issue in GCS loader GitHub repository or S3 loader GitHub repository.
4. Contributing
You can check out the GCS loader repository and S3 loader repository if you’d like to get involved!