Atomic.events to postgres database

mhub · July 25, 2017, 7:22pm

I am trying to push atomic events to postgres database, but it remains empty.
I can see all events in archive shredding bucket, like this:

0	0	0	0	Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36	Chrome	Chrome	59.0.3071.115	Browser	WEBKIT	de	1	0	0	0	0	0	0	0	0	1	24	2400	1217	Windows 10	Windows	Microsoft Corporation	Europe/Berlin	Computer	0	1920	1080	UTF-8	2379	4530								Europe/Vienna							7da39c9f-cb5f-4e9c-b71f-1a76404d0038	2017-07-25 15:44:57.000	com.snowplowanalytics.snowplow	page_ping	jsonschema	1-0-0	e05f64f64bcf95f568f6420973078a02	
mhubid1	web	2017-07-25 16:34:48.956	2017-07-25 15:28:06.000	2017-07-25 15:28:06.850	page_view	1c699952-b8e8-4137-bcb4-cec52482475b		cf	js-2.5.3	clj-1.1.0-tom-0.2.0	spark-1.9.0-common-0.25.0		212.236.35.x	3103720193	513839fa-f693-48c8-952d-b856fd6c310d	12	bd91d2b9-3562-4c81-ac07-527474cf23ee	AT				48.199997	16.3667						https://page.com/	Page Title		https	page.com	80	/

Based on this diagram, it should work, but something is still wrong.

I am using the last version of emr and loader.

Can someone help me what to change? I can’t see any error messages and all logs are fine. I can see also that it tries to push the data:

Loading Snowplow events into PostgreSQL enriched events storage (PostgreSQL database)...
Opening database connection ...
I, [2017-07-25T19:03:20.155000 #46106]  INFO -- : SnowplowTracker::Emitter initialized with endpoint http://****.eu-west-1.elasticbeanstalk.com:80/i
I, [2017-07-25T19:03:20.190000 #46106]  INFO -- : Attempting to send 1 request
I, [2017-07-25T19:03:20.193000 #46106]  INFO -- : Sending GET request to http:/****eu-west-1.elasticbeanstalk.com:80/i...
I, [2017-07-25T19:03:20.287000 #46106]  INFO -- : GET request to http://****.eu-west-1.elasticbeanstalk.com:80/i finished with status code 200
Archiving Snowplow events...
  moving files from s3://mhubout/enriched/good/ to s3://mhubarchive/enriched/
(t0)    MOVE mhubout/enriched/good/run=2017-07-25-18-19-49/part-00000-5c3f...

My config file is:

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: ****
  secret_access_key: ****
  s3:
    region: eu-west-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: #s3://mhubjsonpathassets # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3n://mhublogs/logs/
      raw:
        in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
          - "s3n://elasticbeanstalk-eu-west-1-896554815027/resources/environments/logs/publish/e-vsn9sdraim"         # e.g. s3://my-old-collector-bucket
        processing: s3n://mhublog-processing/processing
        archive: s3://mhubarchive/raw    # e.g. s3://my-archive-bucket/raw
      enriched:
        good: s3://mhubout/enriched/good       # e.g. s3://my-out-bucket/enriched/good
        bad: s3://mhubout/enriched/bad        # e.g. s3://my-out-bucket/enriched/bad
        errors: s3://mhubout/enriched/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://mhubarchive/enriched    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
      shredded:
        good: s3://mhubout/shredded/good       # e.g. s3://my-out-bucket/shredded/good
        bad: s3://mhubout/shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
        errors: s3://mhubout/shredded/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://mhubarchive/shredded    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
    ami_version: 5.5.0
    region: eu-west-1        # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement: eu-west-1b     # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id:  # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: rabbit
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      job_name: snowplow # Give your job a name
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: clj-tomcat # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Conne$
enrich:
  versions:
    spark_enrich: 1.9.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
    rdb_shredder: 0.12.0        # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  download:
    folder: /var/www/postgres  # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: mhubid1  # e.g. snowplow
    collector: ****.eu-west-1.elasticbeanstalk.com # e.g. d3rkrsqld9gmqf.cloudfront.net

targets/postgres.json:

{
    "schema": "iglu:com.snowplowanalytics.snowplow.storage/postgresql_config/jsonschema/1-0-0",
    "data": {
        "name": "PostgreSQL enriched events storage",
        "host": "****",
        "database": "****",
        "port": 5432,
        "sslMode": "DISABLE",
        "username": "****",
        "password": "****",
        "schema": "atomic",
        "purpose": "ENRICHED_EVENTS"
    }
}

Postgres database is running on the same host (localhost) and I can connect to that with navicat. Atomic schema / events table and pg_hba.conf are also ok.

ihor · July 25, 2017, 10:22pm

Hi @mhub,

Your targets configuration file contains no details of your database (host, database). Have you removed it intentionally before posting on discourse? Otherwise, I assume it could be the reason for skipped data load as your logs suggest the enriched data has been archived which is the step after data load.

Additionally, I would like to note our load process for Postgres only supports loading atomic.events currently; it doesn’t perform shredding and thus loading of self-describing events and custom contexts into dedicated tables. We plan on eventually adding this support into Postgres but it hasn’t been prioritised yet.

The diagram depicts the data load into Redshift.

nishanth · July 26, 2017, 7:28am

Hi @mhub ,

From the collector console I was able to get the output of javascript & directed to enrich module via pipelining. I was able to get the output exactly the same as described above
0 0 0 0 Mozilla/5.0 (Windows NT 10.0; Win64; x64) Apple…

I have changed some config file that was leading to get the output below

 {"line":"163933066scala+stream+collector+akkaactordefault+dispatcher+5INFOakkaeventslf4jSlf4jLogger+Slf4jLoggerstarte","errors":[{"level":"error","message":"Error deserializing raw event: Cannot read. Remote side has closed. Tried to read 1 bytes, but only got 0 bytes. (This is often indicative of an internal error on the server side. Please check your server logs.)"}],"failure_tstamp":"2017-07-19T11:09:40.488Z"}

Please will you tell us regarding running the enrich module what should be the configuration that has to be applied in .conf & .json files.
Though I was able to get the correct output from enrich module, but I’m unable to get right now…

I was facing the similar issue & solving the same Storage Loader on postgresql

Kindly mention the changes that has applied to .conf & .json files.

Thanking You,
Nishanth
@nishanth

mhub · July 26, 2017, 9:54am

HI @ihor,

I am using the sample configuration and there is no target section anymore

github.com

snowplow/snowplow/blob/r88-angkor-wat/3-enrich/emr-etl-runner/config/config.yml.sample

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: <%= ENV['AWS_SNOWPLOW_ACCESS_KEY'] %>
  secret_access_key: <%= ENV['AWS_SNOWPLOW_SECRET_KEY'] %>
  s3:
    region: ADD HERE
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: ADD HERE
      raw:
        in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
          - ADD HERE         # e.g. s3://my-old-collector-bucket
          - ADD HERE         # e.g. s3://my-new-collector-bucket
        processing: ADD HERE
        archive: ADD HERE    # e.g. s3://my-archive-bucket/raw
      enriched:
        good: ADD HERE       # e.g. s3://my-out-bucket/enriched/good
        bad: ADD HERE        # e.g. s3://my-out-bucket/enriched/bad
        errors: ADD HERE     # Leave blank unless :continue_on_unexpected_error: set to true below

This file has been truncated. show original

However I tried to add manually, but it didn´t help
targets:
- name: "My Postgres database"
type: postgres
host: localhost # The endpoint as shown in the Redshift console
database: mail4you # Name of database
port: 5432 # Default Redshift port
table: atomic.events
username: postgres
password: ****
ssl_mode: disable

I know that postgres supports only atomic.events, but it would be enough for us. We don`t have self-describing and custom events.

I can see in log that it is loading snowplow events into postgres (or at least it tries):
I, [2017-07-26T09:37:48.589000 #56317] INFO – : Completed successfully
Loading Snowplow events into PostgreSQL enriched events storage (PostgreSQL database)…
Opening database connection …
I, [2017-07-26T09:38:01.642000 #59960] INFO – : SnowplowTracker::Emitter initialized with endpoint http:…
We dont have any errors in postgres log.

Do you have maybe any other idea?

ihor · July 26, 2017, 3:36pm

@mhub,

I was referring to your targets/postgres.json file. The parameters host and database are empty in your original comment as well as your credentials (quoted below).

You cannot use targets section in your config.yml any longer starting from R87 (you are apparently using R89).

Could you fill in all the missing values and try again (provided you do have some enriched good events to load).

ihor · July 26, 2017, 3:41pm

@nishanth,

I provided some guidance to solve your problem in your own post. Please, check my advice there.

mhub · July 26, 2017, 3:54pm

It is filled out correctly, I just removed that before I post it here. I
{
“schema”: "iglu:com.snowplowanalytics.snowplow.storage/postgresql_config/js$
“data”: {
“name”: “PostgreSQL enriched events storage”,
“host”: “localhost”,
“database”: “database”,
“port”: 5432,
“sslMode”: “DISABLE”,
“username”: “postgres”,
“password”: “password”,
“schema”: “atomic”,
“purpose”: “ENRICHED_EVENTS”
}
}

I think my buckets are not configured properly.
Now, I have 2 different config.yml files for emr and storageloader. How should I configure my buckets for storageloader? What should I use as raw in?
The config you can see above is my config.yml for emr.

ihor · July 26, 2017, 9:53pm

@mhub,

Why do you have 2 different config.yml? Both EmrEtlRunner and StorageLoader use the same YAML configuration file.

Your buckets look OK, see no problem there.

If I remember it right there was an issue in the past if Postgres is not set up locally where the StorageLoader runs. Hence, you have a dedicated to Postgres download:folder section in the config.yml.

mhub · July 27, 2017, 12:31pm

@ihor

download:folder is confuigured correctly. I set 777 permission on it.
(Note: If I put some files into that directory for testing, and I got an error message from storage loader that it is not empty. I think it is fine… I deleted them again)

I changed the databse name also just for testing in postgres.json and I got the following error:
Opening database connection …
Unexpected error: FATAL: database “newname” does not exist

I changed it back to the real one.

I’ve installed a postgres database on amazon RDS and tried to load all enriched data but It stays also empty.

I have no idea what could I do more. All suggestions are welcome.

snowplow-runner-and-loader.sh config:

Update these for your environment

RUNNER_PATH=/var/www/backup/snowplow/3-enrich/emr-etl-runner/bin/snowplow-emr-etl-runner
LOADER_PATH=/var/www/backup/snowplow/4-storage/storage-loader/bin/snowplow-storage-loader
RUNNER_CONFIG=/var/www/backup/snowplow/3-enrich/emr-etl-runner/config/config.yml
RESOLVER=/var/www/backup/snowplow/3-enrich/config/iglu_resolver.json
RUNNER_ENRICHMENTS=/var/www/backup/snowplow/3-enrich/config/enrichments
LOADER_CONFIG=/var/www/backup/snowplow/3-enrich/emr-etl-runner/config/config.yml
TARGETS_PATH=/var/www/backup/snowplow/4-storage/config/targets

Run the ETL job on EMR

{RUNNER_PATH} --config {RUNNER_CONFIG} --resolver {RESOLVER} --enrichments {RUNNER_ENRICHMENTS} --targets ${TARGETS_PATH}

Check the damage

ret_val=$?
if [ $ret_val -eq 3 ]
then
echo "No Snowplow logs to process since last run, exiting with return code 0. StorageLoader not run"
exit 0
elif [ $ret_val -eq 4 ]
then
echo "EmrEtlRunner returned 4, directory is not empty. StorageLoader not run"
exit 0
elif [ ret_val -ne 0 ] then echo "Error running EmrEtlRunner, exiting with return code {ret_val}. StorageLoader not run"
exit $ret_val
fi

If all okay, run the storage load too

{LOADER_PATH} --config {LOADER_CONFIG} --targets {TARGETS_PATH} --resolver {RESOLVER}

##########################
/var/www/backup/snowplow/3-enrich/emr-etl-runner/config/config.yml

aws:

Credentials can be hardcoded or set in environment variables

access_key_id: ****
secret_access_key: ****
s3:
region: eu-west-1
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: #s3://mhubjsonpathassets # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: s3n://mhublogs/logs/
raw:
in: # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
- “s3n://elasticbeanstalk-eu-west-1-896554815027/resources/environments/logs/publish/e-vsn9sdraim” # e.g. s3://my-old-collector-bucket
processing: s3n://mhublog-processing/processing
archive: s3://mhubarchive/raw # e.g. s3://my-archive-bucket/raw
enriched:
good: s3://mhubout/enriched/good # e.g. s3://my-out-bucket/enriched/good
bad: s3://mhubout/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
errors: s3://mhubout/enriched/errors # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://mhubarchive/enriched # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: s3://mhubout/shredded/good # e.g. s3://my-out-bucket/shredded/good
bad: s3://mhubout/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
errors: s3://mhubout/shredded/errors # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://mhubarchive/shredded # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
emr:
ami_version: 5.5.0
region: eu-west-1 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using aws emr create-default-roles service_role: EMR_DefaultRole # Created using aws emr create-default-roles
placement: eu-west-1b # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: rabbit
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: # Optional. To launch on cluster, provide version, “0.92.0”, keep quotes. Leave empty otherwise.
lingual: # Optional. To launch on cluster, provide version, “1.1”, keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
jobflow:
job_name: snowplow # Give your job a name
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
volume_size: 100 # Gigabytes
volume_type: "gp2"
volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
ebs_optimized: false # Optional. Will default to true
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
additional_info: # Optional JSON string for selecting additional features
collectors:
format: clj-tomcat # For example: ‘clj-tomcat’ for the Clojure Collector, ‘thrift’ for Thrift records, ‘tsv/com.amazon.aws.cloudfront/wd_access_log’ for Cloudfront access logs or ‘ndjson/urbanairship.connect/v1’ for UrbanAirship Conne$
enrich:
versions:
spark_enrich: 1.9.0 # Version of the Spark Enrichment process
continue_on_unexpected_error: false # Set to ‘true’ (and set :out_errors: above) if you don’t want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
versions:
rdb_shredder: 0.12.0 # Version of the Spark Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
download:
folder: /var/www/postgres
monitoring:
tags: {} # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production
snowplow:
method: get
app_id: mhubid1
collector: Mhub01-env.9e494pt3mb.eu-west-1.elasticbeanstalk.com

#####################
/var/www/backup/snowplow/4-storage/config/targets/postgres.json

{
“schema”: “iglu:com.snowplowanalytics.snowplow.storage/postgresql_config/jsonschema/1-0-0”,
“data”: {
“name”: “PostgreSQL enriched events storage”,
“host”: “localhost”,
“database”: “mail4you”,
“port”: 5432,
“sslMode”: “DISABLE”,
“username”: “postgres”,
“password”: “****”,
“schema”: “atomic”,
“purpose”: “ENRICHED_EVENTS”
}
}

mhub · July 27, 2017, 12:33pm

this is my storageloader log

Loading Snowplow events into PostgreSQL enriched events storage (PostgreSQL database)…
Opening database connection …
I, [2017-07-27T12:00:03.422000 #6702] INFO – : SnowplowTracker::Emitter initialized with endpoint http://Mhub01-env.9e494pt3mb.eu-west-1.elasticbeanstalk.com:80/i
I, [2017-07-27T12:00:03.509000 #6702] INFO – : Attempting to send 1 request
I, [2017-07-27T12:00:03.518000 #6702] INFO – : Sending GET request to http://Mhub01-env.9e494pt3mb.eu-west-1.elasticbeanstalk.com:80/i…
I, [2017-07-27T12:00:03.621000 #6702] INFO – : GET request to http://Mhub01-env.9e494pt3mb.eu-west-1.elasticbeanstalk.com:80/i finished with status code 200
Archiving Snowplow events…
moving files from s3://mhubout/enriched/good/ to s3://mhubarchive/enriched/
(t0) MOVE mhubout/enriched/good/run=2017-07-27-11-22-11/part-00000-e80df802-a7ea-475c-bee0-076f673f8a09.csv -> mhubarchive/enriched/run=2017-07-27-11-22-11/part-00000-e80df802-a7ea-475c-bee0-076f673f8a09.csv
±> mhubarchive/enriched/run=2017-07-27-11-22-11/part-00000-e80df802-a7ea-475c-bee0-076f673f8a09.csv
x mhubout/enriched/good/run=2017-07-27-11-22-11/part-00000-e80df802-a7ea-475c-bee0-076f673f8a09.csv
moving files from s3://mhubout/enriched/good/ to s3://mhubarchive/enriched/
(t0) MOVE mhubout/enriched/good/run=2017-07-27-11-22-11/_SUCCESS -> mhubarchive/enriched/run=2017-07-27-11-22-11/_SUCCESS
±> mhubarchive/enriched/run=2017-07-27-11-22-11/_SUCCESS
x mhubout/enriched/good/run=2017-07-27-11-22-11/_SUCCESS
moving files from s3://mhubout/shredded/good/ to s3://mhubarchive/shredded/
(t0) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/atomic-events/part-00000 -> mhubarchive/shredded/run=2017-07-27-11-22-11/atomic-events/part-00000
(t1) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/atomic-events/part-00001 -> mhubarchive/shredded/run=2017-07-27-11-22-11/atomic-events/part-00001
(t2) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=application_context/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt -> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=application_context/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
(t3) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_started/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt -> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_started/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
(t4) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_status/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt -> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_status/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
(t5) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_succeeded/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt -> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_succeeded/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
(t6) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=jobflow_step_status/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt -> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=jobflow_step_status/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
(t7) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=load_succeeded/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt -> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=load_succeeded/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
(t8) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=link_click/format=jsonschema/version=1-0-1/part-00000-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt -> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=link_click/format=jsonschema/version=1-0-1/part-00000-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
(t9) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=link_click/format=jsonschema/version=1-0-1/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt -> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=link_click/format=jsonschema/version=1-0-1/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
±> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=load_succeeded/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
±> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_started/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
±> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=application_context/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
±> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=jobflow_step_status/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
±> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_succeeded/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
±> mhubarchive/shredded/run=2017-07-27-11-22-11/atomic-events/part-00001
±> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=link_click/format=jsonschema/version=1-0-1/part-00000-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
±> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_status/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
±> mhubarchive/shredded/run=2017-07-27-11-22-11/atomic-events/part-00000
±> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=link_click/format=jsonschema/version=1-0-1/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
x mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=load_succeeded/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
(t7) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=ua_parser_context/format=jsonschema/version=1-0-0/part-00000-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt -> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=ua_parser_context/format=jsonschema/version=1-0-0/part-00000-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
x mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_started/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
(t3) MOVE mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=ua_parser_context/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt -> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=ua_parser_context/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
x mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_succeeded/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
x mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=application_context/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
x mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=emr_job_status/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
x mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=link_click/format=jsonschema/version=1-0-1/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
x mhubout/shredded/good/run=2017-07-27-11-22-11/atomic-events/part-00000
x mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.monitoring.batch/name=jobflow_step_status/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
x mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=link_click/format=jsonschema/version=1-0-1/part-00000-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
x mhubout/shredded/good/run=2017-07-27-11-22-11/atomic-events/part-00001
±> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=ua_parser_context/format=jsonschema/version=1-0-0/part-00000-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
±> mhubarchive/shredded/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=ua_parser_context/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt
x mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=ua_parser_context/format=jsonschema/version=1-0-0/part-00001-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt x mhubout/shredded/good/run=2017-07-27-11-22-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=ua_parser_context/format=jsonschema/version=1-0-0/part-00000-cb00d00f-3495-4b83-a0a2-d7643f0240eb.txt

moving files from s3://mhubout/shredded/good/ to s3://mhubarchive/shredded/
Completed successfully

alex · July 27, 2017, 10:10pm

Hi @mhub - this is odd:

It looks like the load into Postgres went through fine, and then StorageLoader proceeded into the archive.

Can you check the sizes of the files in:

mhubarchive/shredded/run=2017-07-27-11-22-11/atomic-events

Do they definitely have events in them?

mhub · July 28, 2017, 6:57am

HI @alex

I found part-0000 and part0001 files there. Filesize is around 80 kB. It is basically my testing data.
I can not see any issues there.

mhubid1	web	2017-07-27 11:22:11.949	2017-07-27 10:49:28.000	2017-07-27 10:48:34.560	page_view	f2a51a9b-5823-4cc7-b7ef-eb24a0ef8369		cf	js-2.5.3	clj-1.1.0-tom-0.2.0	spark-1.9.0-common-0.25.0		212.236.35.x	1751727486	77382f06-fe0a-4045-8276-2b7d447871b4	16	a36d8650-0bd4-4de0-ae48-1e39b8080d87	AT				48.199997	16.3667						https://marketinghub.io/features/	Features – MarketingHub	https://marketinghub.io/pricing/	https	marketinghub.io	80	/features/			https	marketinghub.io	80	/pricing/			internal																															Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0	Firefox	Firefox	54.0	Browser	GECKO	en-US	0	1	0	0	0	0	0	0	0	1	24	1280	611	Windows 10	Windows	Microsoft Corporation	Europe/Berlin	Computer	0	1280	720	UTF-8	1935	7036								Europe/Vienna							c2685593-0f04-424e-8ea8-14d9df3b7092	2017-07-27 10:49:28.000	com.snowplowanalytics.snowplow	page_view	jsonschema	1-0-0	2f70bdf55bd3e19ca3a475eed73ccb46	
mhubid1	web	2017-07-27 11:22:11.949	2017-07-27 10:49:13.000	2017-07-27 10:48:20.143	page_view	6e7b9a25-be5b-42c6-bf34-3b770030cfcd		cf	js-2.5.3	clj-1.1.0-tom-0.2.0	spark-1.9.0-common-0.25.0		212.236.35.x	1751727486	77382f06-fe0a-4045-8276-2b7d447871b4	16	a36d8650-0bd4-4de0-ae48-1e39b8080d87	AT				48.199997	16.3667						https://marketinghub.io/pricing/	Pricing – MarketingHub	https://marketinghub.io/event-directory/	https	marketinghub.io	80	/pricing/			https	marketinghub.io	80	/event-directory/			internal																															Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0	Firefox	Firefox	54.0	Browser	GECKO	en-US	0	1	0	0	0	0	0	0	0	1	24	1280	611	Windows 10	Windows	Microsoft Corporation	Europe/Berlin	Computer	0	1280	720	UTF-8	1263	4499								Europe/Vienna							c2685593-0f04-424e-8ea8-14d9df3b7092	2017-07-27 10:49:13.000	com.snowplowanalytics.snowplow	page_view	jsonschema	1-0-0	baf82a16a7240b4a169314a2947d4760	
mhubid1	web	2017-07-27 11:22:11.949	2017-07-27 10:53:34.000	2017-07-27 10:52:40.617	page_view	821773ee-4758-4e2d-a99b-76033ed0c8bd		cf	js-2.5.3	clj-1.1.0-tom-0.2.0	spark-1.9.0-common-0.25.0		212.236.35.x	1751727486	77382f06-fe0a-4045-8276-2b7d447871b4	16	a36d8650-0bd4-4de0-ae48-1e39b8080d87	AT				48.199997	16.3667						https://marketinghub.io/pricing/	Pricing – MarketingHub	https://marketinghub.io/features/	https	marketinghub.io	80	/pricing/			https	marketinghub.io	80	/features/			internal																															Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0	Firefox	Firefox	54.0	Browser	GECKO	en-US	0	1	0	0	0	0	0	0	0	1	24	1280	611	Windows 10	Windows	Microsoft Corporation	Europe/Berlin	Computer	0	1280	720	UTF-8	1263	4499								Europe/Vienna							c2685593-0f04-424e-8ea8-14d9df3b7092	2017-07-27 10:53:34.000	com.snowplowanalytics.snowplow	page_view

mhub · July 28, 2017, 11:45am

@alex @ihor

I just want to give you a short update that I reinstalled everything and now I can see errors in postgres db.
< 2017-07-28 11:40:34.237 UTC >ERROR: null value in column “collector_tstamp” violates not-null constraint
< 2017-07-28 11:40:34.237 UTC >DETAIL: Failing row contains (, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null).
< 2017-07-28 11:40:34.237 UTC >STATEMENT: INSERT INTO “atomic”.“events” (“app_id”) VALUES (’’) RETURNING “app_id”, “platform”, “etl_tstamp”, “collector_tstamp”, “dvce_created_tstamp”, “event”, “event_id”, “txn_id”, “name_tracker”, “v_tracker”, “v_collector”, “v_etl”, “user_id”, “user_ipaddress”, “user_fingerprint”, “domain_userid”, “domain_sessionidx”, “network_userid”, “geo_country”, “geo_region”, “geo_city”, “geo_zipcode”, “geo_latitude”, “geo_longitude”, “geo_region_name”, “ip_isp”, “ip_organization”, “ip_domain”, “ip_netspeed”, “page_url”, “page_title”, “page_referrer”, “page_urlscheme”, “page_urlhost”, “page_urlport”, “page_urlpath”, “page_urlquery”, “page_urlfragment”, “refr_urlscheme”, “refr_urlhost”, “refr_urlport”, “refr_urlpath”, “refr_urlquery”, “refr_urlfragment”, “refr_medium”, “refr_source”, “refr_term”, “mkt_medium”, “mkt_source”, “mkt_term”, “mkt_content”, “mkt_campaign”, “se_category”, “se_action”, “se_label”, “se_property”, “se_value”, “tr_orderid”, “tr_affiliation”, “tr_total”, “tr_tax”, “tr_shipping”, “tr_city”, “tr_state”, “tr_country”, “ti_orderid”, “ti_sku”, “ti_name”, “ti_category”, “ti_price”, “ti_quantity”, “pp_xoffset_min”, “pp_xoffset_max”, “pp_yoffset_min”, “pp_yoffset_max”, “useragent”, “br_name”, “br_family”, “br_version”, “br_type”, “br_renderengine”, “br_lang”, “br_features_pdf”, “br_features_flash”, “br_features_java”, “br_features_director”, “br_features_quicktime”, “br_features_realplayer”, “br_features_windowsmedia”, “br_features_gears”, “br_features_silverlight”, “br_cookies”, “br_colordepth”, “br_viewwidth”, “br_viewheight”, “os_name”, “os_family”, “os_manufacturer”, “os_timezone”, “dvce_type”, “dvce_ismobile”, “dvce_screenwidth”, “dvce_screenheight”, “doc_charset”, “doc_width”, “doc_height”, “tr_currency”, “tr_total_base”, “tr_tax_base”, “tr_shipping_base”, “ti_currency”, “ti_price_base”, “base_currency”, “geo_timezone”, “mkt_clickid”, “mkt_network”, “etl_tags”, “dvce_sent_tstamp”, “refr_domain_userid”, “refr_dvce_tstamp”, “domain_sessionid”, “derived_tstamp”, “event_vendor”, “event_name”, “event_format”, “event_version”, “event_fingerprint”, “true_tstamp”

It looks much better now. I am not sure, but maybe mlocate & updatedb was missing.

mhub · July 28, 2017, 1:50pm

@ihor @alex
there is no more database issue, but I my database stays empty.
If you have any other idea, would be great. Thank you!

Topic		Replies	Views
EmrEtlRunner::EmrExecutionError AWS batch pipeline (Legacy)	3	1771	October 5, 2017
Storage Loader successful but not loading Redshift or Postgres DB Storage targets	4	2033	March 28, 2017
StorageLoader isn't working Storage targets	4	1820	March 27, 2018
Unstructured events problem with storageloader and postgresql Storage targets	3	2268	July 27, 2016
No data loaded in postgres, no errors either Storage targets	3	2339	April 12, 2017