How do you separate a batch of a certain size from enriched data? We used an Airflow task to generate a list of files in S3 with a total size of 1.5GB and copy the files to a separate directory. After that, we already launch the dataflow runner. Perhaps there is an easier way through the steps in the EMR itself with S3DistCp or other?
I didn’t find anything similar in the discussions.
Hi @Edward_Kim , maybe you can achieve what you want by a combination of settings in S3DistCp: -groupBy and -targetSize. The first let’s you create batches by specifying a group-by condition, and the second limits the size of how big those batches will be.
Thanks for the advice @dilyan . But this will only help to form one file from several files grouped in a certain way, but at the same time it will take all the files from the specified directory. I need to select exactly as many files from the directory as the transformer can process.