I am trying to push atomic events to postgres database, but it remains empty.
I can see all events in archive shredding bucket, like this:
0 0 0 0 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 Chrome Chrome 59.0.3071.115 Browser WEBKIT de 1 0 0 0 0 0 0 0 0 1 24 2400 1217 Windows 10 Windows Microsoft Corporation Europe/Berlin Computer 0 1920 1080 UTF-8 2379 4530 Europe/Vienna 7da39c9f-cb5f-4e9c-b71f-1a76404d0038 2017-07-25 15:44:57.000 com.snowplowanalytics.snowplow page_ping jsonschema 1-0-0 e05f64f64bcf95f568f6420973078a02
mhubid1 web 2017-07-25 16:34:48.956 2017-07-25 15:28:06.000 2017-07-25 15:28:06.850 page_view 1c699952-b8e8-4137-bcb4-cec52482475b cf js-2.5.3 clj-1.1.0-tom-0.2.0 spark-1.9.0-common-0.25.0 212.236.35.x 3103720193 513839fa-f693-48c8-952d-b856fd6c310d 12 bd91d2b9-3562-4c81-ac07-527474cf23ee AT 48.199997 16.3667 https://page.com/ Page Title https page.com 80 /
Based on this diagram, it should work, but something is still wrong.
I am using the last version of emr and loader.
Can someone help me what to change? I can’t see any error messages and all logs are fine. I can see also that it tries to push the data:
Loading Snowplow events into PostgreSQL enriched events storage (PostgreSQL database)...
Opening database connection ...
I, [2017-07-25T19:03:20.155000 #46106] INFO -- : SnowplowTracker::Emitter initialized with endpoint http://****.eu-west-1.elasticbeanstalk.com:80/i
I, [2017-07-25T19:03:20.190000 #46106] INFO -- : Attempting to send 1 request
I, [2017-07-25T19:03:20.193000 #46106] INFO -- : Sending GET request to http:/****eu-west-1.elasticbeanstalk.com:80/i...
I, [2017-07-25T19:03:20.287000 #46106] INFO -- : GET request to http://****.eu-west-1.elasticbeanstalk.com:80/i finished with status code 200
Archiving Snowplow events...
moving files from s3://mhubout/enriched/good/ to s3://mhubarchive/enriched/
(t0) MOVE mhubout/enriched/good/run=2017-07-25-18-19-49/part-00000-5c3f...
My config file is:
aws:
# Credentials can be hardcoded or set in environment variables
access_key_id: ****
secret_access_key: ****
s3:
region: eu-west-1
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: #s3://mhubjsonpathassets # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: s3n://mhublogs/logs/
raw:
in: # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
- "s3n://elasticbeanstalk-eu-west-1-896554815027/resources/environments/logs/publish/e-vsn9sdraim" # e.g. s3://my-old-collector-bucket
processing: s3n://mhublog-processing/processing
archive: s3://mhubarchive/raw # e.g. s3://my-archive-bucket/raw
enriched:
good: s3://mhubout/enriched/good # e.g. s3://my-out-bucket/enriched/good
bad: s3://mhubout/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
errors: s3://mhubout/enriched/errors # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://mhubarchive/enriched # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: s3://mhubout/shredded/good # e.g. s3://my-out-bucket/shredded/good
bad: s3://mhubout/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
errors: s3://mhubout/shredded/errors # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://mhubarchive/shredded # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
emr:
ami_version: 5.5.0
region: eu-west-1 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
placement: eu-west-1b # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: rabbit
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
lingual: # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
jobflow:
job_name: snowplow # Give your job a name
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
volume_size: 100 # Gigabytes
volume_type: "gp2"
volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
ebs_optimized: false # Optional. Will default to true
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
additional_info: # Optional JSON string for selecting additional features
collectors:
format: clj-tomcat # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Conne$
enrich:
versions:
spark_enrich: 1.9.0 # Version of the Spark Enrichment process
continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
versions:
rdb_shredder: 0.12.0 # Version of the Spark Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
download:
folder: /var/www/postgres # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
monitoring:
tags: {} # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production
snowplow:
method: get
app_id: mhubid1 # e.g. snowplow
collector: ****.eu-west-1.elasticbeanstalk.com # e.g. d3rkrsqld9gmqf.cloudfront.net
targets/postgres.json:
{
"schema": "iglu:com.snowplowanalytics.snowplow.storage/postgresql_config/jsonschema/1-0-0",
"data": {
"name": "PostgreSQL enriched events storage",
"host": "****",
"database": "****",
"port": 5432,
"sslMode": "DISABLE",
"username": "****",
"password": "****",
"schema": "atomic",
"purpose": "ENRICHED_EVENTS"
}
}
Postgres database is running on the same host (localhost) and I can connect to that with navicat. Atomic schema / events table and pg_hba.conf are also ok.