RDBLoader PostgreSQL Error

caleb_bertsch · June 12, 2018, 8:32pm

I’m receiving this error when my RDBLoader step runs:

ERROR: Data loading error org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00
  Where: COPY events, line 1
Following steps completed: [Discover,Download]

I’m running PostgreSQL 9.6.6 on AWS RDS.
rdb_loader: 0.14.0
rdb_shredder: 0.13.1
emr etl: 104 stoplesteinan

I see in the Snowplow docs it demonstrates installing PostgreSQL 9.1… is 9.6 compatible? I can’t go any lower than 9.4.7

anton · June 13, 2018, 8:04am

Hi @caleb_bertsch,

Yes, latest version of PostgreSQL should be supported - we’re using 42.0.0 JDBC Driver, which should be compatible with quite broad range of servers.

This is not something I’ve seen before (though, need to admit - Postgres is not something that we use/test actively). Is your shredder output gzipped?

caleb_bertsch · June 13, 2018, 1:03pm

Yes, the shredded files are gzipped. Is it normal for there to be many empty shredded files? (see screenshot, the 20 byte objects are empty after decompression)

anton · June 13, 2018, 2:50pm

Thanks @caleb_bertsch. No, empty files are fine, this is an artifact of Spark/Hadoop distribution model.

Just as an experiment can we try to unzip couple of files and run RDB Loader against it? Here’s the default load-statement that we use to load data, just in case:

COPY $tableName FROM STDIN
 WITH CSV ESCAPE E'\x02' QUOTE E'\x01'
 DELIMITER '\t'
 NULL ''

I suspect that there’s nothing in RDB Loader takes gzip output compression into account for Postgres. If this is the case then switching to none output compression should help. Let me know if it works - I’ll submit a bugreport.

caleb_bertsch · June 13, 2018, 4:00pm

Thanks @anton

So yes the manual COPY command with ungzipped data worked.
I see now in another emr-etl config that Redshift is the only database that supports gzip compression. The issue is that stream mode also only supports gzip. So as of right now, would you agree that it’s impossible to use PostgreSQL and stream mode together?

anton · June 13, 2018, 4:18pm

Thanks for update, @caleb_bertsch, I created an issue https://github.com/snowplow/snowplow-rdb-loader/issues/107

If I remember right, it is up to EmrEtlRunner to decide whether RDB Shredder should dump data in gz archive and EmrEtlRunner itself makes this decision based on output_compression setting in config.yml, so whole process should be decoupled from Stream/Batch mode.

caleb_bertsch · June 13, 2018, 6:36pm

I can confirm that turning off gzip compression fixed the issue.

I believe that makes the comment on line #56 here incorrect and should be set to NONE for PostgreSQL.

aditya · September 23, 2019, 11:12pm

Hi @caleb_bertsch , @anton

I had the exact same issue. I am in stream enrich mode and trying to load my data to postgresql and had exactly same error as above.

I tried to set output_compression to NONE and re-run from rdb_load using --resume_from rdb_load but it does not help. Any idea?

Or should I re-do the shredding process after setting output compression to NONE?

aditya · September 23, 2019, 11:44pm

I just want to update that I manage to complete all the steps by setting output_compression to NONE and re-do the shredding process (ie. continuing using --skip staging_stream_enrich`).

I can vouch that line #56 as told by @caleb_bertsch is incorrect since using GZIP as output compression results in:
invalid byte sequence for encoding "UTF8"
error.

Topic		Replies	Views
Loading Shredded-types to postgres Issue Storage targets	1	1263	April 4, 2018
Loading from S3's enriched events to PostgreSQL	4	1398	May 20, 2019
RDB Loader R35 can it be used with postgres? Storage targets	4	927	February 19, 2021
[IMPORTANT ALERT] R90-R91 bug may result in shredded types not loading into Redshift after recovery Troubleshooting	2	2581	September 7, 2017
Snowplow RDB Loader R32 released New releases	0	1606	March 6, 2020

RDBLoader PostgreSQL Error

Related topics