No, Hadoop Shred is the more hard-working job (assuming all Hadoop Enrich enrichments are working performantly), and increasingly so as it does more work around event de-duplication…
Ahh makes sense, I totally forgot about the dedupe and was just thinking it was mapping them out for import into redshift
I’m trying to wrap my head around this math here… The workload size according to your post is 14GB or within an order of magnitude of it. How does that relate to consuming 2TB of Hadoop capacity? This makes zero sense to me, but I observe similar behavior on my workloads.
So 14gb is from S3 so that would be compressed, and text compresses really well.
I am not sure if the records are uncompressed on HDFS, and I also assume that there would be multiple copies for raw, enriched, shredded, etc
But I am not sure if that would equate to 2tb or if my assumptions have any basis…
Just a quick update, I am now running twice a day and it is processing 15-20million events per batch and taking about 1h20m to do that.
So it looks like it was just struggling a bit with the higher volumes (72 and 58m rows) coupled with the large number of task servers.
I am still running with just core nodes but would like to test task at some point as it would be a significant cost saving
Cheers,
Dean