Thanks @caleb_bertsch. No, empty files are fine, this is an artifact of Spark/Hadoop distribution model.
Just as an experiment can we try to unzip couple of files and run RDB Loader against it? Here’s the default load-statement that we use to load data, just in case:
COPY $tableName FROM STDIN
WITH CSV ESCAPE E'\x02' QUOTE E'\x01'
I suspect that there’s nothing in RDB Loader takes gzip output compression into account for Postgres. If this is the case then switching to none output compression should help. Let me know if it works - I’ll submit a bugreport.
So yes the manual COPY command with ungzipped data worked.
I see now in another emr-etl config that Redshift is the only database that supports gzip compression. The issue is that stream mode also only supports gzip. So as of right now, would you agree that it’s impossible to use PostgreSQL and stream mode together?
If I remember right, it is up to EmrEtlRunner to decide whether RDB Shredder should dump data in gz archive and EmrEtlRunner itself makes this decision based on output_compression setting in config.yml, so whole process should be decoupled from Stream/Batch mode.