Output (enriched/good and enriched/bad) are all empty!

swettk · February 27, 2017, 4:21pm

No errors reported in any logs I’ve found.

We’re loading data stored in S3 from a kinesis stream using the EMR processor. Here’s the config:

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: redacted
  secret_access_key: redacted
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      #jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3://wiw-snowplow-logging
      raw:
        in:                  # Multiple in buckets are permitted
          - s3://snowplow-s3 # e.g. s3://my-in-bucket
        processing: s3://wiw-snowplow-output2/processing
        archive: s3://wiw-snowplow-output2/archive # e.g. s3://my-archive-bucket/in
      enriched:
        good: s3://wiw-snowplow-output2/enriched/good # e.g. s3://my-out-bucket/enriched/good
        bad: s3://wiw-snowplow-output2/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
        errors:  # Leave blank unless continue_on_unexpected_error: set to true below
        archive: s3://wiw-snowplow-output2/enriched # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
      shredded:
        good: s3://wiw-snowplow-output2/enriched/good # e.g. s3://my-out-bucket/shredded/good
        bad: s3://wiw-snowplow-output2/enriched/bad # e.g. s3://my-out-bucket/shredded/bad
        errors: # Leave blank unless continue_on_unexpected_error: set to true below
        archive: s3://wiw-snowplow-output2/enriched/archive  # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
   ami_version: 4.5.0      # Don't change this
   region: us-east-1
   jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
   service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
   placement: # Set this if not running in VPC. Leave blank otherwise
   ec2_subnet_id: subnet-e0e5afcd # Set this if running in VPC. Leave blank otherwise
   ec2_key_name: snowplow-ssh
   bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
   software:
     hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
     lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
   jobflow:
     master_instance_type: c4.2xlarge
     core_instance_count: 2
     core_instance_type: c4.2xlarge
     core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
       volume_size: 100    # Gigabytes
       volume_type: "gp2"
       volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
       ebs_optimized: false # Optional. Will default to true
     task_instance_count: 0 # Increase to use spot instances
     task_instance_type: c4.2xlarge
     task_instance_bid: #0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
   bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
   additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: thrift
enrich:
  job_name: Snowplow ETL # Give your job a name
  versions:
    hadoop_enrich: 1.8.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.10.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: false # Set to 'true' (and set out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: GZIP # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP

storage:
  download:
    folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
  targets:
    - name: "My Redshift database"
      type: redshift
      host: redacted # The endpoint as shown in the Redshift console
      database: redsift # Name of database
      port: 5439 # Default Redshift port
      table: atomic.events
      username: redacted
      password: redacted
      maxerror: 1 # Stop loading on first error, or increase to permit more load errors
      comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
      ssl_mode: enable

monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production

Everything is empty (except the raw inputs!)
Where can I start looking?

anton · February 27, 2017, 5:21pm

Hello @swettk,

I think you might want to start investigation at AWS EMR console. Did enrich step finish successfully? Is there anything in stdlog/stdout logs?

swettk · February 27, 2017, 6:35pm

No errors, exitCode 0 in the controller,
here’s syslog:

2017-02-27 16:16:16,756 INFO cascading.flow.hadoop.util.HadoopUtil (main): resolving application jar from found main method on: com.snowplowanalytics.snowplow.enrich.hadoop.JobRunner$
2017-02-27 16:16:16,757 INFO cascading.flow.hadoop.planner.HadoopPlanner (main): using application jar: /mnt/var/lib/hadoop/steps/s-24FWFV1LAAFIK/snowplow-hadoop-enrich-1.8.0.jar
2017-02-27 16:16:16,765 INFO cascading.property.AppProps (main): using app.id: 427E8338DE404CB8BC20B216FF29B12A
2017-02-27 16:16:16,830 INFO org.apache.hadoop.conf.Configuration.deprecation (main): mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
2017-02-27 16:16:16,968 INFO org.apache.hadoop.conf.Configuration.deprecation (main): mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2017-02-27 16:16:17,575 INFO com.amazon.ws.emr.hadoop.fs.EmrFileSystem (main): Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
2017-02-27 16:16:17,711 INFO amazon.emr.metrics.MetricsSaver (main): MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1488211973639
2017-02-27 16:16:17,711 INFO amazon.emr.metrics.MetricsSaver (main): Created MetricsSaver j-FFWC39O27Z3S:i-04fa8471dbd4d00a3:RunJar:08029 period:60 /mnt/var/em/raw/i-04fa8471dbd4d00a3_20170227_RunJar_08029_raw.bin
2017-02-27 16:16:18,164 INFO cascading.util.Version (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): Concurrent, Inc - Cascading 2.6.0
2017-02-27 16:16:18,166 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] starting
2017-02-27 16:16:18,166 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] source: Hfs[“LzoByteArrayScheme[[UNKNOWN]->[ALL]]”][“hdfs:/local/snowplow/raw-events”]
2017-02-27 16:16:18,166 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] sink: Hfs[“TextDelimited[[‘json’]]”][“s3://wiw-snowplow-output2/enriched/bad/run=2017-02-27-16-06-43”]
2017-02-27 16:16:18,166 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] sink: Hfs[“TextDelimited[[‘app_id’, ‘platform’, ‘etl_tstamp’, ‘collector_tstamp’, ‘dvce_created_tstamp’, ‘event’, ‘event_id’, ‘txn_id’, ‘name_tracker’, ‘v_tracker’, ‘v_collector’, ‘v_etl’, ‘user_id’, ‘user_ipaddress’, ‘user_fingerprint’, ‘domain_userid’, ‘domain_sessionidx’, ‘network_userid’, ‘geo_country’, ‘geo_region’, ‘geo_city’, ‘geo_zipcode’, ‘geo_latitude’, ‘geo_longitude’, ‘geo_region_name’, ‘ip_isp’, ‘ip_organization’, ‘ip_domain’, ‘ip_netspeed’, ‘page_url’, ‘page_title’, ‘page_referrer’, ‘page_urlscheme’, ‘page_urlhost’, ‘page_urlport’, ‘page_urlpath’, ‘page_urlquery’, ‘page_urlfragment’, ‘refr_urlscheme’, ‘refr_urlhost’, ‘refr_urlport’, ‘refr_urlpath’, ‘refr_urlquery’, ‘refr_urlfragment’, ‘refr_medium’, ‘refr_source’, ‘refr_term’, ‘mkt_medium’, ‘mkt_source’, ‘mkt_term’, ‘mkt_content’, ‘mkt_campaign’, ‘contexts’, ‘se_category’, ‘se_action’, ‘se_label’, ‘se_property’, ‘se_value’, ‘unstruct_event’, ‘tr_orderid’, ‘tr_affiliation’, ‘tr_total’, ‘tr_tax’, ‘tr_shipping’, ‘tr_city’, ‘tr_state’, ‘tr_country’, ‘ti_orderid’, ‘ti_sku’, ‘ti_name’, ‘ti_category’, ‘ti_price’, ‘ti_quantity’, ‘pp_xoffset_min’, ‘pp_xoffset_max’, ‘pp_yoffset_min’, ‘pp_yoffset_max’, ‘useragent’, ‘br_name’, ‘br_family’, ‘br_version’, ‘br_type’, ‘br_renderengine’, ‘br_lang’, ‘br_features_pdf’, ‘br_features_flash’, ‘br_features_java’, ‘br_features_director’, ‘br_features_quicktime’, ‘br_features_realplayer’, ‘br_features_windowsmedia’, ‘br_features_gears’, ‘br_features_silverlight’, ‘br_cookies’, ‘br_colordepth’, ‘br_viewwidth’, ‘br_viewheight’, ‘os_name’, ‘os_family’, ‘os_manufacturer’, ‘os_timezone’, ‘dvce_type’, ‘dvce_ismobile’, ‘dvce_screenwidth’, ‘dvce_screenheight’, ‘doc_charset’, ‘doc_width’, ‘doc_height’, ‘tr_currency’, ‘tr_total_base’, ‘tr_tax_base’, ‘tr_shipping_base’, ‘ti_currency’, ‘ti_price_base’, ‘base_currency’, ‘geo_timezone’, ‘mkt_clickid’, ‘mkt_network’, ‘etl_tags’, ‘dvce_sent_tstamp’, ‘refr_domain_userid’, ‘refr_dvce_tstamp’, ‘derived_contexts’, ‘domain_sessionid’, ‘derived_tstamp’, ‘event_vendor’, ‘event_name’, ‘event_format’, ‘event_version’, ‘event_fingerprint’, ‘true_tstamp’]]”][“hdfs:/local/snowplow/enriched-events”]
2017-02-27 16:16:18,167 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] parallel execution is enabled: true
2017-02-27 16:16:18,167 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] starting jobs: 3
2017-02-27 16:16:18,167 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] allocating threads: 3
2017-02-27 16:16:18,167 INFO cascading.flow.FlowStep (pool-5-thread-1): [com.snowplowanalytics…] starting step: (1/3)
2017-02-27 16:16:18,233 INFO org.apache.hadoop.yarn.client.RMProxy (pool-5-thread-1): Connecting to ResourceManager at ip-10-30-38-249.ec2.internal/10.30.38.249:8032
2017-02-27 16:16:18,365 INFO org.apache.hadoop.yarn.client.RMProxy (pool-5-thread-1): Connecting to ResourceManager at ip-10-30-38-249.ec2.internal/10.30.38.249:8032
2017-02-27 16:16:19,036 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (pool-5-thread-1): Total input paths to process : 104
2017-02-27 16:16:19,036 INFO org.apache.hadoop.conf.Configuration.deprecation (pool-5-thread-1): mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
2017-02-27 16:16:19,331 INFO org.apache.hadoop.mapreduce.JobSubmitter (pool-5-thread-1): number of splits:52
2017-02-27 16:16:19,460 INFO org.apache.hadoop.mapreduce.JobSubmitter (pool-5-thread-1): Submitting tokens for job: job_1488211967557_0002
2017-02-27 16:16:19,560 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (pool-5-thread-1): Submitted application application_1488211967557_0002
2017-02-27 16:16:19,585 INFO org.apache.hadoop.mapreduce.Job (pool-5-thread-1): The url to track the job: http://ip-10-30-38-249.ec2.internal:20888/proxy/application_1488211967557_0002/
2017-02-27 16:16:19,586 INFO cascading.flow.FlowStep (pool-5-thread-1): [com.snowplowanalytics…] submitted hadoop job: job_1488211967557_0002
2017-02-27 16:16:19,586 INFO cascading.flow.FlowStep (pool-5-thread-1): [com.snowplowanalytics…] tracking url: http://ip-10-30-38-249.ec2.internal:20888/proxy/application_1488211967557_0002/
2017-02-27 16:16:48,378 INFO cascading.util.Update (UpdateRequestTimer): newer Cascading release available: 2.6.3
2017-02-27 16:17:19,780 INFO cascading.flow.FlowStep (pool-5-thread-3): [com.snowplowanalytics…] starting step: (3/3) …d/run=2017-02-27-16-06-43
2017-02-27 16:17:19,781 INFO cascading.flow.FlowStep (pool-5-thread-2): [com.snowplowanalytics…] starting step: (2/3) …/snowplow/enriched-events
2017-02-27 16:17:19,798 INFO org.apache.hadoop.yarn.client.RMProxy (pool-5-thread-3): Connecting to ResourceManager at ip-10-30-38-249.ec2.internal/10.30.38.249:8032
2017-02-27 16:17:19,832 INFO org.apache.hadoop.yarn.client.RMProxy (pool-5-thread-2): Connecting to ResourceManager at ip-10-30-38-249.ec2.internal/10.30.38.249:8032
2017-02-27 16:17:19,857 INFO org.apache.hadoop.yarn.client.RMProxy (pool-5-thread-3): Connecting to ResourceManager at ip-10-30-38-249.ec2.internal/10.30.38.249:8032
2017-02-27 16:17:19,884 INFO org.apache.hadoop.yarn.client.RMProxy (pool-5-thread-2): Connecting to ResourceManager at ip-10-30-38-249.ec2.internal/10.30.38.249:8032
2017-02-27 16:17:20,294 INFO org.apache.hadoop.mapred.FileInputFormat (pool-5-thread-2): Total input paths to process : 52
2017-02-27 16:17:20,331 INFO org.apache.hadoop.mapreduce.JobSubmitter (pool-5-thread-2): number of splits:52
2017-02-27 16:17:20,369 INFO org.apache.hadoop.mapreduce.JobSubmitter (pool-5-thread-2): Submitting tokens for job: job_1488211967557_0003
2017-02-27 16:17:20,381 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (pool-5-thread-2): Submitted application application_1488211967557_0003
2017-02-27 16:17:20,383 INFO org.apache.hadoop.mapreduce.Job (pool-5-thread-2): The url to track the job: http://ip-10-30-38-249.ec2.internal:20888/proxy/application_1488211967557_0003/
2017-02-27 16:17:20,383 INFO cascading.flow.FlowStep (pool-5-thread-2): [com.snowplowanalytics…] submitted hadoop job: job_1488211967557_0003
2017-02-27 16:17:20,383 INFO cascading.flow.FlowStep (pool-5-thread-2): [com.snowplowanalytics…] tracking url: http://ip-10-30-38-249.ec2.internal:20888/proxy/application_1488211967557_0003/
2017-02-27 16:17:20,414 INFO org.apache.hadoop.mapred.FileInputFormat (pool-5-thread-3): Total input paths to process : 52
2017-02-27 16:17:20,448 INFO org.apache.hadoop.mapreduce.JobSubmitter (pool-5-thread-3): number of splits:52
2017-02-27 16:17:20,470 INFO org.apache.hadoop.mapreduce.JobSubmitter (pool-5-thread-3): Submitting tokens for job: job_1488211967557_0004
2017-02-27 16:17:20,481 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (pool-5-thread-3): Submitted application application_1488211967557_0004
2017-02-27 16:17:20,483 INFO org.apache.hadoop.mapreduce.Job (pool-5-thread-3): The url to track the job: http://ip-10-30-38-249.ec2.internal:20888/proxy/application_1488211967557_0004/
2017-02-27 16:17:20,483 INFO cascading.flow.FlowStep (pool-5-thread-3): [com.snowplowanalytics…] submitted hadoop job: job_1488211967557_0004
2017-02-27 16:17:20,483 INFO cascading.flow.FlowStep (pool-5-thread-3): [com.snowplowanalytics…] tracking url: http://ip-10-30-38-249.ec2.internal:20888/proxy/application_1488211967557_0004/
2017-02-27 16:18:55,675 INFO cascading.tap.hadoop.util.Hadoop18TapUtil (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): deleting temp path hdfs:/local/snowplow/enriched-events/_temporary
2017-02-27 16:18:55,970 INFO cascading.tap.hadoop.util.Hadoop18TapUtil (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): deleting temp path s3://wiw-snowplow-output2/enriched/bad/run=2017-02-27-16-06-43/_temporary

It’s puzzling that everything would report back as “Completed” without error

Topic		Replies	Views
Enriched good and bad buckets are empty in the enrich AWS batch pipeline (Legacy)	7	2134	December 4, 2017
EmrExecutionError - Enriched HDFS -> S3: FAILED Enrichment	7	1343	May 3, 2019
EmrEtlRunner fails when copying enriched events to S3 Enrichment	1	1615	October 28, 2016
EmrEtlRunner::EmrExecutionError while storing the events in redshift database AWS batch pipeline (Legacy)	2	2439	October 16, 2017
Frequently failing in the 4th steps of storage process AWS batch pipeline (Legacy)	4	1485	November 22, 2017

Output (enriched/good and enriched/bad) are all empty!

Related topics