Enriched good and bad buckets are empty in the enrich

sandesh · November 24, 2017, 3:31pm

i am running storage step in the snowplow but the events are not ending in the buckets.

When i check the buckets, everything was empty(0KB) file.

Please suggest me what wrong i am doing.

EC2 instace type :t2.medium

config.yml file is below

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: xxxx
  secret_access_key: xxxxxx
  #keypair: Snowplowkeypair
  #key-pair-file: /home/ubuntu/snowplow/4-storage/config/Snowplowkeypair.pem
  region: us-east-1
  s3:
	region: us-east-1
	buckets:
	  assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
	  jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
	  log: s3://snowplowdataevents2/logs
	  raw:
		in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
		  - s3://snowplowdataevents2/      # e.g. s3://my-old-collector-bucket
		processing: s3://snowplowdataevents2/raw/processing
		archive: s3://snowplowdataevents2/raw/archive   # e.g. s3://my-archive-bucket/raw
	  enriched:
		good: s3://snowplowdataevents2/enriched/good        # e.g. s3://my-out-bucket/enriched/good
		bad: s3://snowplowdataevents2/enriched/bad       # e.g. s3://my-out-bucket/enriched/bad
		errors: s3://snowplowdataevents2/enriched/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
		archive: s3://snowplowdataevents2/enriched/archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
	  shredded:
		good: s3://snowplowdataevents2/shredded/good        # e.g. s3://my-out-bucket/shredded/good
		bad: s3://snowplowdataevents2/shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
		errors: s3://snowplowdataevents2/shredded/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
		archive: s3://snowplowdataevents2/shredded/archive     # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
	ami_version: 5.5.0
	region: us-east-1       # Always set this
	jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
	service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
	placement: us-east-1a      # Set this if not running in VPC. Leave blank otherwise
	ec2_subnet_id:  # Set this if running in VPC. Leave blank otherwise
	ec2_key_name: Snowplowkeypair
	bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
	software:
	  hbase:              # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
	  lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
	# Adjust your Hadoop cluster below
	jobflow:
	  job_name: Snowplow ETL # Give your job a name
	  master_instance_type: m2.4xlarge
	  core_instance_count: 2
	  core_instance_type: m2.4xlarge
	  core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
		volume_size: 100    # Gigabytes
		volume_type: "gp2"
		volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
		ebs_optimized: false # Optional. Will default to true
	  task_instance_count: 0 # Increase to use spot instances
	  task_instance_type: m2.4xlarge
	  task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
	bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
	configuration:
	  yarn-site:
		yarn.resourcemanager.am.max-attempts: "1"
	  spark:
		maximizeResourceAllocation: "true"
	additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: thrift # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
	spark_enrich: 1.9.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
	rdb_loader: 0.12.0
	rdb_shredder: 0.12.0        # Version of the Spark Shredding process
	hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
	level: DEBUG # You can optionally switch to INFO for production
  #snowplow:
	#method: get
	#app_id: unilog # e.g. snowplow
	#collector: 172.31.38.39:8082 # e.g. d3rkrsqld9gmqf.cloudfront.net

iglu_resolver.json file is as below

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
	"cacheSize": 500,
	"repositories": [
	  {
		"name": "Iglu Central",
		"priority": 0,
		"vendorPrefixes": [ "com.snowplowanalytics" ],
		"connection": {
		  "http": {
			"uri": "http://iglucentral.com"
		  }
		}
	  }
	]
  }
}

redshift.json file is as below.

{
	"schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-0-0",
	"data": {
		"name": "AWS Redshift enriched events storage",
		"host": "snowplow.xxxxxx.us-east-1.redshift.amazonaws.com",
		"database": "xxx",
		"port": 5439,
		"sslMode": "DISABLE",
		"username": "unilog2",
		"password": "xxxxx",
		"roleArn": "arn:aws:iam::302576851619:role/NewRedshiftRole",
		"schema": "atomic",
		"maxError": 1,
		"compRows": 20000,
		"purpose": "ENRICHED_EVENTS"
	}
}

error in that particular step is as below

Exception in thread “main” java.lang.RuntimeException: Error running job
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:927)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:705)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-27-212.ec2.internal:8020/tmp/8e1694aa-baa4-4c6c-8862-573a58bfedb6/files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:317)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:352)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:901)
… 10 more

I am not getting what mistake i am doing, please help me out

Below is the command i am running…

 ./snowplow-emr-etl-runner run --config snowplow/4-storage/config/emretlrunner.yml --resolver snowplow/4-storage/config/iglu_resolver.json --targets snowplow/4-storage/config/targets/ --skip analyze

please suggest me the configuration files changes…

Note: Before couple of weeks it was working fine, now both good and bad buckets are empty please help me out.

leon · November 25, 2017, 2:14pm

Hi @sandesh,

It seems like the pipeline had a failure earlier and you are now trying to run it again from the start. This will not work; you need to resume it from a certain point depending on where it failed previously.

The first thing to do is to work out what state the pipeline is in currently by examining the buckets (not only good and bad) and also the data load in Redshift.

From there on you can determine how to resume the pipeline and which steps you have to skip.

This diagram shows exactly where to look for the data and how to resume. There are different variants depending on the Snowplow version.

The EMR logs can also help. If you can find the EMR logs of the initial failed run you can see if it failed during one of those steps making it even easier to identify how to resume.

If you can’t work it out with the diagram please tell us the state (empty/full) of the different buckets and we’ll help you further.

Good luck!

Leon

sandesh · November 27, 2017, 7:31am

Hi @leon thanks for the reply,
Below is the architecture i am following.

 javascript tracker -> scala stream collector ->Kinesis S3 -> S3 -> EmrEtlRunner (shredding+enrich) -> Redshift

Below is the deatils:

In the web page i have added the page veiw tracking script.

when i load the do the action on the web page, below is the response i am getting in the scala stream collector

06:24:25.672 [scala-stream-collector-akka.actor.default-dispatcher-8] DEBUG s.can.server.HttpServerConnection - Dispatching GET request to http://localhost:8082/i?stm=1511763945885&e=pv&url=http://localhost:8085/ChangesHTML/SampleExampleTracker.html&page=Fixed+Width+2+Blue&tv=js-2.8.0&tna=cf&aid=13&p=web&tz=Asia/Kolkata&lang=en-US&cs=UTF-8&f_pdf=1&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=0&f_java=0&f_gears=0&f_ag=0&res=1366x768&cd=24&cookie=1&eid=f12f640c-ba08-48a8-96b0-f9ff57ddccdc&dtm=1511763945883&vp=1517x735&ds=1518x736&vid=1&sid=117cb3c3-a8a5-4716-921b-32d5b4dae585&duid=e727b978-3be6-4a78-aac5-42ba1f0c6e38&fp=4265106636 to handler Actor[akka://scala-stream-collector/system/IO-TCP/selectors/$a/1#1753195764]
06:24:26.310 [scala-stream-collector-akka.actor.default-dispatcher-8] DEBUG s.can.server.HttpServerConnection - Dispatching GET request to http://localhost:8082/i?stm=1511763946625&e=pv&url=http://localhost:8085/ChangesHTML/SampleExampleTracker.html&page=Fixed+Width+2+Blue&tv=js-2.8.0&tna=cf&aid=13&p=web&tz=Asia/Kolkata&lang=en-US&cs=UTF-8&f_pdf=1&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=0&f_java=0&f_gears=0&f_ag=0&res=1366x768&cd=24&cookie=1&eid=74795290-9a1c-4567-85b2-3e25389ccfd8&dtm=1511763946622&vp=1517x735&ds=1499x1028&vid=1&sid=6b95a78f-7ae8-4656-8538-0ba0c9d8ce8e&duid=6567aa9b-809f-472c-a6db-e73dab17ddc1&fp=4265106636 to handler Actor[akka://scala-stream-collector/system/IO-TCP/selectors/$a/1#1753195764]

Then i have created 2 kinises stream, to pass the events to s3 bucket.
below is the command to run.
(./snowplow-kinesis-s3-0.5.0 --config kinises.conf)
Below is the data that is passed from kinesis stream to S3 bucket.

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.storage.kinesis.s3.S3Emitter - Flushing buffer with 8 records.
[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.storage.kinesis.s3.S3Emitter - Successfully serialized 8 records out of 8
[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.storage.kinesis.s3.S3Emitter - Successfully emitted 8 records to S3 in s3://databaseregionevents/2017-11-27-49578891737724711875591370082515362848639313462685597698-49578891737724711875591370107949953167511509038834647042.lzo
[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.storage.kinesis.s3.S3Emitter - Successfully emitted 8 records to S3 in s3://databaseregionevents/2017-11-27-49578891737724711875591370082515362848639313462685597698-49578891737724711875591370107949953167511509038834647042.lzo.index

Below is the data when i opened .lzo file(Note: i have just opened that file with notepad ++ software, didnt extract anything)
(s3://databaseregionevents/2017-11-27-49578891737724711875591370082515362848639313462685597698-49578891737724711875591370107949953167511509038834647042.lzo)

‰LZO 

 €	@      ¤Ze±%     ,Ý€  j  ë   )ØÕXÍL)²¼W™!q½ÿV  byte[]É d   172.31.38.39
È  _ü+N) Ò   UTF-8 Ü   ssc-0.9.0-kinesis,   rMozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.366   
;http://localhost:8085/ChangesHTML/SampleExampleTracker.html@   /iJ        
stm=1511764297427&e=pp&url=http%3A%2F%2Flocalhost%3A8085%2FChangesHTML%2FSampleExampl
eTracker.html&page=Fixed%20Width%202%20Blue&pp_mix=0&pp_max=0&pp_miy=0&pp_may=0&tv=js-
2.8.0&tna=cf&aid=13&p=web&tz=Asia%2FKolkata&lang=en-US&cs=9 &f_pdf=1&f_qt=0&f_realp‡wmaŸ 
dirž fl¿jav gearsˆ  Mag=0&res=1366x768&cd=24&cookie=1&eid=ac450cd0-8bd6-4b53-bf6d-
1a7a0397105e&dtm=1511764297423&vp=1517x735&ds=1499x860&vid=1&sid=4fb9e7d0-5df4-4e60-9226-
8457b36193bd&duid=e071f763-3e03-4e40-9472-6f4d56e444a0&fp=4265106636^      Host: 
localhost:8082   Connection: keep-alive   ~User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537. `    2Accept: image/webp,Ìapngì 	*, */*;q=0.8   DReferer: ht P
   "´ -Encoding: gzip, deflate, br    Ô Language: en-US, en„9  %C„I  ’: 
 rxVisitor=1495864829678I083H1MM3UVIPQREIQSNDG5V1FS6344V; _sp_id.1fff=a3310903-1094-4cb8-
a179-e209cb4198f9.1500386692.22.1503474180.1503471200.6f1d47eb-6b94-408c-b0dd-6c91d546bfc0; 
loginMessage=logout; F47F4A0F30FB7A75=sandesh.p@unilogcorp.com; 750B2E0333A28C1D=test1234; 
 F30FB33A2=true   	localhostš   $7ec37777-495b-4b2b-8a11-da7d24ef9a5dzi   
Aiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0

Below is the data of .lzo.index file(Note: i have just opened that file with notepad ++ software, didnt extract anything)
(s3://databaseregionevents/2017-11-27-49578891737724711875591370082515362848639313462685597698-49578891737724711875591370107949953167511509038834647042.lzo.index)

Then i will start the enrichement process of the events by using following command.

./snowplow-emr-etl-runner run --config snowplow/4-storage/config/emretlrunner.yml --resolver snowplow/3-enrich/config/iglu_resolver.json --enrichments snowplow/3-enrich/config/enrichments/

Once i run this command all the 12 steps completed suceessfully, even i checked in the EMR.
Below is the message i got after process completed

D, [2017-11-27T06:39:50.913000 #19422] DEBUG -- : Initializing EMR jobflow
D, [2017-11-27T06:39:55.161000 #19422] DEBUG -- : EMR jobflow j-1T9FRDP4EWWI8 started, waiting for jobflow to complete...
I, [2017-11-27T07:05:59.671000 #19422]  INFO -- : No RDB Loader logs
D, [2017-11-27T07:05:59.671000 #19422] DEBUG -- : EMR jobflow j-1T9FRDP4EWWI8 completed successfully.
I, [2017-11-27T07:05:59.671000 #19422]  INFO -- : Completed successfully

After the process completes successfully,
1a. Raw section, 2 folders has created i,e archive and processing. Inside the processing, logs folder has created and inside the archive under run folder kinesis events has copied.
2b. Enrich Section, 3 folders has created i,e archive, good and bad. Inside the archive, run folder has created and inside run folder .csv generated, Below is the data present in .csv file.

   13	web	2017-11-27 06:39:50.934	2017-11-27 06:30:17.385	2017-11-27 06:31:37.423	page_ping	ac450cd0-8bd6-4b53-bf6d-1a7a0397105e		cf	js-2.8.0	ssc-0.9.0-kinesis	spark-1.9.0-common-0.25.0		172.31.38.x	4265106636	e071f763-3e03-4e40-9472-6f4d56e444a0	1	7ec37777-495b-4b2b-8a11-da7d24ef9a5d												http://localhost:8085/ChangesHTML/SampleExampleTracker.html	Fixed Width 2 Blue		http	localhost	8085	/ChangesHTML/SampleExampleTracker.html																																						0	0	0	0	Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36	Chrome	Chrome	62.0.3202.94	Browser	WEBKIT	en-US	1	0	0	0	0	0	0	0	0	1	24	1517	735	Windows 10	Windows	Microsoft Corporation	Asia/Kolkata	Computer	0	1366	768	UTF-8	1499	860												2017-11-27 06:31:37.427			{"schema":"iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1","data":[{"schema":"iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0","data":{"useragentFamily":"Chrome","useragentMajor":"62","useragentMinor":"0","useragentPatch":"3202","useragentVersion":"Chrome 62.0.3202","osFamily":"Windows","osMajor":null,"osMinor":null,"osPatch":null,"osPatchMinor":null,"osVersion":"Windows","deviceFamily":"Other"}}]}	4fb9e7d0-5df4-4e60-9226-8457b36193bd	2017-11-27 06:30:17.381	com.snowplowanalytics.snowplow	page_ping	jsonschema	1-0-0	dcfc0cffb76b37e93a54d47d3b33ef1c

Inside the good folder run folder has generated with 0KB with no data inside that file
Inside the bad folder run folder has generated with 2 files i,e success and part_0 with no data inside(0KB)
3c. Shredded section, 3 folders has created i,e archive, good and bad. Inside the archive, run folder has created and inside run folder their is atomic-events and shredded-types.

Inside the good folder, run folder has generated and it has shredded_types inside run folder.

Inside the bad folder, run folder has generated and have so many files with 0KB

In the storage process, Below is the command used to run
./snowplow-emr-etl-runner run --config snowplow/4-storage/config/emretlrunner.yml --resolver snowplow/4-storage/config/iglu_resolver.json --targets snowplow/4-storage/config/targets/ --skip analyze

It is failing in the 4th steps of the process, Below is the error

Exception in thread "main" java.lang.RuntimeException: Error running job
	at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:927)
	at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:705)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-18-175.ec2.internal:8020/tmp/d28bb1f8-0bae-420d-97ef-45046305b36e/files
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:317)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
	at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:352)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
	at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:901)
	... 10 more

I have explained everything in details, IS anything is missing in the above steps please let me know the details.

Thanks,
Sandesh P

sandesh · November 28, 2017, 5:16am

@leon please help me to resolve this error…

leon · November 29, 2017, 3:03pm

Hi @sandesh,

Thank you for the detailed explanation.
To make sure that I understand everything correctly can you please confirm the state of the following buckets:

shredded good: not empty
enriched good: empty
raw processing: not empty

And can you please also tell us which version of Snowplow you are running?

At this moment I think that the pipeline originally had failed at the archive_raw step. If the connection with the EMR cluster had been interrupted the EMR process itself would have finished but the pipeline would not be able to continue after that. This does depend on what version you’re running, though.

If you can confirm the bucket state and the version then I think we are almost ready to solve this.

Leon

sandesh · November 30, 2017, 12:10pm

Your understanding is correct @leon.

shredded good: not empty
enriched good: empty
raw processing: not empty

i am using snowplow_emr_r92_maiden_castle version.

leon · November 30, 2017, 1:36pm

OK, so then if you use an ElasticSearch cluster for the bad rows it probably failed at that step, if you don’t then it most likely the S3DistCp step (a.k.a. archive_raw).

It’s a bit odd that the enriched: good bucket is also empty but the most important one in this case is the shredded: good since that is what has to go into Redshift.

The best and fastest way forward is to resume from archive_raw (step 12 in the diagram) so to run it with --skip staging,enrich,shred,elasticsearchoption (or–resume-from archive_raw`).

Please do report back and let us know if that worked!

sandesh · December 4, 2017, 5:15am

When i followed the process you mention, Below is the error i am getting.
Below is the command i am using to run.Tried both the process

./snowplow-emr-etl-runner run --config snowplow/4-storage/config/emretlrunner.yml --resolver snowplow/4-storage/config/iglu_resolver.json --targets snowplow/4-storage/config/targets/ --skip analyze,staging,enrich,shred,elasticsearch

./snowplow-emr-etl-runner run --config snowplow/4-storage/config/emretlrunner.yml --resolver snowplow/4-storage/config/iglu_resolver.json --targets snowplow/4-storage/config/targets/ --resume-from archive_raw



D, [2017-12-04T05:12:43.748000 #23675] DEBUG -- : Initializing EMR jobflow
E, [2017-12-04T05:12:45.629000 #23675] ERROR -- : No run folders in [s3://sample1bucketevents/enriched/good/] found
F, [2017-12-04T05:12:45.631000 #23675] FATAL -- :

Snowplow::EmrEtlRunner::UnexpectedStateError (No run folders in [s3://sample1bucketevents/enriched/good/] found):
	uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:727:in `get_latest_run_id'
	uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:485:in `initialize'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
	uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:100:in `run'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
	uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41:in `<main>'
	org/jruby/RubyKernel.java:979:in `load'
	uri:classloader:/META-INF/main.rb:1:in `<main>'
	org/jruby/RubyKernel.java:961:in `require'
	uri:classloader:/META-INF/main.rb:1:in `(root)'
	uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'

Please help me out.

Topic		Replies	Views
Output (enriched/good and enriched/bad) are all empty! AWS batch pipeline (Legacy)	2	1747	February 27, 2017
EmrExecutionError - Enriched HDFS -> S3: FAILED Enrichment	7	1343	May 3, 2019
Frequently failing in the 4th steps of storage process AWS batch pipeline (Legacy)	4	1485	November 22, 2017
EmrEtlRunner::EmrExecutionError while storing the events in redshift database AWS batch pipeline (Legacy)	2	2439	October 16, 2017
Help to solve EmrEtlRunner HDFS > S3 Enrichment	8	3480	August 31, 2017

Enriched good and bad buckets are empty in the enrich

Related topics