Didn’t want to hijack the thread linked above. So, here’s a new one. My pipeline is dead in the water. Error logs are cryptic - and make very little sense to me. What I observe is enrich is finishing rather quickly, shred takes abnormally longer. At some point it stalls, drops a few core nodes, resizes and then exits with errors. Console screenshot attached, maybe a massive coincidence, but it always breaks in the same place.
Errors I see are ( I sampled multiple repeated lines )
2017-03-28 04:37:45,247 WARN org.apache.hadoop.hdfs.DFSClient (IPC Server handler 2 on 10020): Failed to connect to /172.30.0.144:50010 for block, add to deadNodes and continue. java.net.NoRouteToHostException: No route to host
2017-03-28 03:43:16,271 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl (AsyncDispatcher event handler): Updating application attempt appattempt_1490669380627_0008_000001 with final state: FAILED, and exit status: -100
2017-03-28 03:43:16,272 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl (AsyncDispatcher event handler): appattempt_1490669380627_0008_000001 State change from FINAL_SAVING to FAILED
2017-03-28 03:43:16,272 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl (AsyncDispatcher event handler): The number of failed attempts is 0. The max attempts is 2
2017-03-28 04:35:14,286 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger (AsyncDispatcher event handler): USER=hadoop OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1490669380627_0009 failed 2 times due to AM Container for appattempt_1490669380627_0009_000003 exited with exitCode: -1000
Failing this attempt. Failing the application. APPID=application_1490669380627_0009
I doubled the compute nodes, added more juice to the master node ( it seemed to me the memory and disk capacity were creeping dangerously close to the redline ), didn’t make a slightest dent.
I’m at my wit’s end with this one. Every attempt to re-run produces different class of errors. Some are indicative of master node loosing its mind ( HDFS blocks missing ) some are as cryptic as the samples above.
Any ideas?
I’ve tried to use scala common enrich/shred as a dependency for for a realtime kinesis(Enriched)->[hypothetical service]->s3 (shredded)->redshift last resort development effort, but I can’t figure out how to use the library in a Java context. My scala authoring skills are non-existent Has anyone managed to develop streaming shredder?
Any pointers to setting up a spark beta pipeline?
We’ve recently added a few custom unstructured events, but only tests made it into the pipeline, no significant volume to speak of. I’ve checked the assets ( jsonpaths ) are in the right place on s3 and schemas are happily congregating in Iglu scala server. Maybe I have missed something, new event-wise?
2017-03-28 07:00:19,974 INFO org.apache.hadoop.mapred.ClientServiceDelegate (flow com.snowplowanalytics.snowplow.enrich.hadoop.ShredJob): Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2017-03-28 07:00:51,623 WARN cascading.flow.FlowStep (flow com.snowplowanalytics.snowplow.enrich.hadoop.ShredJob): [com.snowplowanalytics....] unable to kill job: (3/7)
java.io.IOException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException): Could not load history file hdfs://ip-172-30-25-193.ec2.internal:8020/tmp/hadoop-yarn/staging/history/done/2017/03/28/000000/job_1490677498829_0008-1490679714162-hadoop-%5BABC39635C9CD4B6DB5717397E3E4C9CA%2F9A0A9E5C683F4505-1490680143365-185-7-SUCCEEDED-default-1490679718905.jhist
<...>
Has anyone had any luck with golang Job flow manager for the raw pipeline processing? Should I be switching?
switched to dataflow-runner, increased resources from 4 core nodes to 5 core + 5 task
distributed enriched events from s3 to hdfs
distributed raw events from s3 to hdfs (just in case)
ran the shred job with the same parameters as the step was configured on the failed runs
same exact behavior - runs almost to the end, then fails the shred step. over and over again. Suggestions? What logs could be indicative of the root cause?
BTW, hdfs /local/snowplow/shreded-events is never created
Input size is at most 10 Gigs - how can it blow through 200G cluster capacity?
@alex seems to me there’s a bug either in iglu scala server or ilguctl code that pushes schemas to iglu scala server.
We recently added a few new event models to our stack. Some of them contained UTF-8 characters ( 山田 specifically )
The schemas were distributed to s3 repository and effectively picked up by scala hadoop enricher. Enricher completed, never faltered. On the other hand, scala hadoop shredder consistently failed until the schemas were republished without the character sequences above. Either igluctl does not include proper content-encoding headers, or iglu scalar server has a bug of some sort that causes downstream effects.
Do you have any suggestions on how to triangulate the issue? Any test suite I can use to validate iglu served documents?
Seems that corrupted shemas cause a runtime NPE.
I created a ticket to explore this issue. Just to be sure: are these unicode symbols contained in JSON Schema content?
What happens if you try to fetch these schemas from scala server via curl? This seems especially strange because Enrich and Shred jobs use same Iglu Client and if one of them fails another has to fail consistently. What Enrich/Hadoop jobs versions are using?
Versions:
hadoop_enrich: 1.8.0
hadoop_shred: 0.10.0 ( I’ve given 11RC1 a try but it bailed from the start, didn’t want to continue investigation with it)
schema pushed to an s3 bucket with igluctl was not different from the original file
schema pushed to iglu server and then retrieved via API and swagger UI was NOT identical when compared to the original.
Exact fields that I noticed a discrepancy on was:
"first_name" : {
"type": "string",
"maxLength": 255,
"example": "John",
"description" : "Romanized given name (i.e. 山田 is not welcome, should be romanized to Yamada) stripped of surrounding white space"
},
"last_name":{
"type": "string",
"maxLength": 255,
"example": "Smith",
"description" : "Romanized surname / family name (i.e 太郎 is not welcome, should be romanized to Tarō) stripped of surrounding white space"
}
Discrepancies were against the following sequences: 山田, 太郎, Tarō
I did purge the potentially offending character sequences from the repository to restore the data processing pipeline. I’ll have to setup another environment to run tests against. Can you suggest a testing strategy ?
@dashirov one way I see to reproduce error would be to run Scala Hadoop Shred tests with injected Iglu Registry (copy of Iglu Central that contains unicode characters), but I guess that is too much clutter if you’re not familiar with Scala and SHS code. I’ll try to reproduce it by myself tomorrow and will let you know about results.
Unfortunately I did not discover anything suspicious while running Scala Hadoop Shred test suite with Schemas with unicode characters. Here’s what I did:
Inserted exact your description in several key Iglu schemas (particularly, iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0)
Uploaded these schemas along all others from Iglu Central to snowplow mini instance
Injected Iglu Server from Snowplow Mini as Iglu Central in test suite
Ran test suite
Tests passed as usually, without any failures. If I make any errors in contexts schemas - I immediately start to see bad rows, exceptions etc.
However, I agree that these symptoms may point us to unicode support bug, but at the same time I’m puzzled about why it doesn’t fail on enrich step and why it didn’t appear in my test. Also I’m almost sure I saw schemas with unicode symbols before (not on Iglu Scala Server though) and not aware of any problems.
Previously we encountered bugs in Iglu Scala Client where it silently throw an exception that didn’t appear in bad rows and didn’t short-circuit job, but I don’t think it can happen now with igluctl lint and Iglu Client improvements.