Hey @mjensen, one thing that significantly increases load time is RDB Loader’s consistency check. Basically this is a hack to overcome infamous S3 eventual consistency that often results in failed loads due ghost folders on S3. RDB Loader “solves” this problem by waiting some time until S3 starts to give consistent results. This time is specific to your dataset and calculated by formulae ((atomicFiles.length * 0.1 * shreddedTypes.length) + 5) seconds
, so if you have many shredded types - this easily can reach 30 mins. Upcoming R29 is aiming to solve this problem in a more elegant way, so stay tuned.
Also, with R97 Knossos you can add --skip consistency_check
to skip this stage, but chance of failure increases significantly (you’ll have to re-run load manually then, so maybe not critical).
I suspect it’s because the DynamoDB table could get very large.
@gareth, that’s a valid suggestion, however from our experience (most manifests we’re working with are 20-35GB) DynamoDB is very elastic in this sense - you only tweak throughput and this is the only thing that affects time required to put data.
is it possible to reprocess Cloudfront with global event de-duplication
If I understand the question correctly, you want to de-duplicate historical data. If by the time of original processing it you didn’t have de-duplication enabled then it should be safe thing to do - pipeline knows nothing about those events. But obviously this is quite big job and requires cleaning-up Redshift from historical data, but techically it is possible.
i am not sure why snowplow etl cant just query redshift and download the event ids and then discard the common ones , like i do that will save the dynamodb lookup
That’s an interesting approach, @bhavin, please feel free to submit an issue at snowplow/snowplow to explore and elaborate. Here’s what I think (objections mostly, but to maintain the discussion, not to reject the idea!):
- In the end we want our de-duplication to be DB-independent and environment-independent (batch/RT). While we still do depend on DynamoDB, I believe having one lightweight external DB is easier to implement and maintain than many heavyweight.
- We can solve above point, by saying that we actually don’t want to store it in DB, but instead as Avro file on S3 and then just joining these datasets during ETL. That can be quite viable option.
- However it still makes de-duplication RT-unfriendly. We cannot afford ourselves to have this joined dataset in Kinesis (nor we can query Redshift for single event id).
- Even inside batch pipeline, I believe
JOIN
s will introduce unnecessary shuffling and implementation burden. For DynamoDB on the other hand we’re planning to release it as a separate testable and portable module.
- In the end I would be very surprised that DB-lookup/JOIN approach is significantly better having that from our experience capacious enough DynamoDB adds close-to-zero delays.
Again - I would like to discuss and consider this approach, please don’t consider above items as refusal.
Thanks.