Yes, the shredded files are gzipped. Is it normal for there to be many empty shredded files? (see screenshot, the 20 byte objects are empty after decompression)
Thanks @caleb_bertsch. No, empty files are fine, this is an artifact of Spark/Hadoop distribution model.
Just as an experiment can we try to unzip couple of files and run RDB Loader against it? Here’s the default load-statement that we use to load data, just in case:
COPY $tableName FROM STDIN
WITH CSV ESCAPE E'\x02' QUOTE E'\x01'
DELIMITER '\t'
NULL ''
I suspect that there’s nothing in RDB Loader takes gzip output compression into account for Postgres. If this is the case then switching to none output compression should help. Let me know if it works - I’ll submit a bugreport.
So yes the manual COPY command with ungzipped data worked.
I see now in another emr-etl config that Redshift is the only database that supports gzip compression. The issue is that stream mode also only supports gzip. So as of right now, would you agree that it’s impossible to use PostgreSQL and stream mode together?
If I remember right, it is up to EmrEtlRunner to decide whether RDB Shredder should dump data in gz archive and EmrEtlRunner itself makes this decision based on output_compression setting in config.yml, so whole process should be decoupled from Stream/Batch mode.
I just want to update that I manage to complete all the steps by setting output_compression to NONE and re-do the shredding process (ie. continuing using --skip staging_stream_enrich`).
I can vouch that line #56 as told by @caleb_bertsch is incorrect since using GZIP as output compression results in: invalid byte sequence for encoding "UTF8"
error.