[SOLVED] Bad rows for schema violations are not loaded into Elasticsearch

Hi,

I’m now investigating an issue with the Elasticsearch loader and happy if I could have advice from you folks.

We have set up Snowplow using Terraform. We are receiving some bad rows with “schema_violations” errors but cannot find those data in Elasticsearch. Bad rows for “adapter_failures” are available on Elasticsearch so I’m sure Enrich Server and Elasticsearch are connected.

We are loading bad rows also into the S3 and we can find bad rows with “schema_violations” in the data on S3.

Our setup is as follows:

Collector - Kinesis - Enrich - Kinesis for Good - ES Loader - ES
                                                - S3 Loader - S3
Collector - Kinesis - Enrich - Kinesis for Bad - ES Loader - ES
                                               - S3 Loader - S3

And our terraform config the Enrich is as follows:

module "es_loader_for_enricher_output_bad_stream_staging" {
  source = "snowplow-devops/elasticsearch-loader-kinesis-ec2/aws"
  version          = "0.1.1"
  name             = "snowplow-es-loader-for-enricher-output-bad-stream-staging"
  vpc_id           = local.snowplow_conf_staging.vpc_id
  subnet_ids       = local.snowplow_conf_staging_subnets.private_subnet_ids
  ssh_key_name     = local.snowplow_conf_staging.ssh_key_name

  in_stream_type  = "bad"
  in_stream_name  = module.enricher_output_bad_stream_staging.name
  bad_stream_name = module.es_loader_shared_output_bad_stream_staging.name

  es_cluster_endpoint = aws_elasticsearch_domain.snowplow_elasticsearch_staging.endpoint
  es_cluster_port     = 443
  es_cluster_name     = aws_elasticsearch_domain.snowplow_elasticsearch_staging.domain_name

  es_cluster_index         = "snowplow-bad-enriched-index"
  es_cluster_document_type = "bad"

  aws_es_domain_name = aws_elasticsearch_domain.snowplow_elasticsearch_staging.domain_name

  telemetry_enabled = false
}

And on S3 we can find bad raw data as follows:

{"schema":"iglu:com.snowplowanalytics.snowplow.badrows/schema_violations/jsonschema/2-0-0","data":{"processor":{"artifact":"streamCommon","version":"2.0.5"},"failure":{"timestamp":"2022-03-02T11:12:29.542037Z","messages":[{"schemaKey":"iglu:com.snowplowanalytics.snowplow/link_click/jsonschema/1-0-1","error":{"error":"ValidationError","dataReports":[{"message":"$.targetUrl: is missing but it is required","path":"$","keyword":"required","targets":["targetUrl"]}]}}]},"payload": ...TRUNCATED

Have check CW logs for ES loader but coulnd’t identify helpful output.

Resolved. I was using the device clock as the ES timestamp filter and the device clock was very incorrect. So that bad raws were filtered out. I’ve changed the ES timestamp filter to use server timestamp.

3 Likes

Hey @shimpeko

Thanks for coming back and letting us know you found a solution. It’s much appreciated!