Hi,
we managed to setup the shredder. The EMR job has been completed but we cannot find a shredding_complete.json
file in the top folder of the run. It seems that this file is required to trigger the RDB Loader via the SQS queue, right?
This is the content of the shredded bucket:
s3://our-shredded-bucket/good/run=2021-02-25-17-39-49/
├── _SUCCESS
├── vendor=com.myapp
│ ├── name=generic_tracking_event
│ │ └── format=json
│ │ └── model=1
│ │ ├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│ │ └── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│ └── name=minimal_tracking_event
│ └── format=json
│ └── model=1
│ ├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│ ├── part-00001-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│ ├── part-00002-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│ ├── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│ ├── part-00004-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│ ├── part-00005-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│ ├── part-00006-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│ └── part-00007-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
└── vendor=com.snowplowanalytics.snowplow
├── name=atomic
│ └── format=tsv
│ └── model=1
│ ├── part-00000-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
│ ├── part-00001-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
│ ├── part-00002-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
│ ├── part-00003-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
│ ├── part-00004-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
│ ├── part-00005-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
│ ├── part-00006-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
│ └── part-00007-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
└── name=duplicate
└── format=json
└── model=1
├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
├── part-00001-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
├── part-00002-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
├── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
├── part-00004-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
├── part-00006-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
└── part-00007-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
The config.hocon looks like this:
{
"name": "myapp",
"id": "4113ba83-2797-4436-8c92-5ced0b8ac5b6",
"region": "eu-west-1",
"messageQueue": "SQS_QUEUE",
"shredder": {
"input": "SP_ENRICHED_URI",
"output": "SP_SHREDDED_GOOD_URI",
"outputBad": "SP_SHREDDED_BAD_URI",
"compression": "GZIP"
},
"formats": {
"default": "JSON",
"json": [ ],
"tsv": [ ],
"skip": [ ]
},
"storage" = {
"type": "redshift",
"host": "redshift.amazon.com",
"database": "OUR_DB",
"port": 5439,
"roleArn": "arn:aws:iam::AWS_ACCOUNT_NUMBER:role/RedshiftLoadRole",
"schema": "atomic",
"username": "DB_USER",
"password": "DB_PASSWORD",
"jdbc": {"ssl": true},
"maxError": 10,
"compRows": 100000
},
"steps": ["analyze"],
"monitoring": {
"snowplow": null,
"sentry": null
}
}
We could not find anything in the logs of the EMR job that indicated that the shredder job has been aborted. Does it only create this shredding_complete.json if the output type is TSV?
Best,
M.