Ideal file size for enrichment

rgabo · May 27, 2016, 7:42am

Hello Snowplowers,

Snowplow settled on the 128MB LZO file size to deal with the small files problem quite early on (2013/05 to be precise: http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/).

How did you end up with the 128MB file size? More recent recommendations for Spark are file sizes anywhere between 64MB and 1GB which Snowplow fits into, especially that LZO is splittable, but I was still wondering whether the 128MB target file size is the most ideal.

HDFS stores 64MB blocks afaik so any multiple of that is ideal on HDFS but what about S3? Do you have experience with long-term S3 storage in larger, splittable files?

Gabor

alex · June 4, 2016, 12:15am

Hey @rgabo - lots of good questions there.

No real magic - it’s just the current default blocksize for Hadoop, dfs.blocksize, see hdfs-default.xml.

We don’t have any particular experience in this, but I suspect yes, having e.g. the enriched events stored as splittable lzo in far fewer & bigger files would be highly performant. Would love to hear what you find out if you test this!

rgabo · June 8, 2016, 7:23am

Seems like the Spark/Parquet default compression is gzip to optimize for storage of persistent data. Snappy is used for temporary data between stages in Spark by default.

The one drawback of LZO is that its licensing does not permit companies like Databricks to package it in their service so you need to install manually. Snappy does not have the issue and its very comparable.

I’ll share more experience around compression types and file sizes later on.

Topic		Replies	Views
Archive/raw file format and encoding details Enrichment	5	1586	March 20, 2019
Disc usage during EMR stage AWS batch pipeline (Legacy)	2	1712	August 8, 2019
How to estimate the EBS storage size needed for EMR process? For engineers	5	1111	June 6, 2018
Spark memory woes AWS batch pipeline (Legacy)	1	1937	December 14, 2017
Replay data from S3 AWS real-time pipeline	3	2635	February 14, 2018

Ideal file size for enrichment

Related topics