Snowplow miss data in Elasticsearch

RockstarAlex · July 15, 2022, 12:01pm

Hello!

We use Snowplow for data analytics and we use Postgres for storing data. We use metabase for data visualization and it had been good until it started to reduce speed for queering due to large amount of data.

As a result, we decided to test Elasticsearch. But we faced with a problem that amount of data in Elasticsearch is 5 times less than in Postgres. We do not understand why. They receive data from the same stream, but difference is massive.

Also, there is warn from elk stream loader: WARN com.snowplowanalytics.stream.loader.clients.ElasticsearchBulkSender - Returning 56 records as failed, but we did not find explanation of what it means.

Also it throws: [scala-execution-context-global-18] ERROR com.snowplowanalytics.stream.loader.clients.ElasticsearchBulkSender - Record

And: failed with message failed to parse

If someone faced with similar problem or may be work with Elasticsearch, can you give some explanation of how to deal with difference of data. I understand that Postgres and Elastic are different storage services but they use the same stream of data. May be there is some problems with schemes?

Thank you in advance

mike · July 17, 2022, 11:15pm

Are you able to share your config.hocon file for the ES loader?

RockstarAlex · July 18, 2022, 5:41am

source = "kinesis"
sink {
  good = "elasticsearch"
  bad = "kinesis"
}
enabled = "good"
aws {
  accessKey = iam
  secretKey = iam
}
queue {
  enabled = kinesis
  initialPosition = "TRIM_HORIZON"
  initialTimestamp = ""
  maxRecords = 10000
  region = "us-west-1"
  appName = "<server-app-name>"
  disableCloudWatch = true
}
streams {
  inStreamName = "wx-enriched-stream"
  outStreamName = "wx-bad-1-stream"
  buffer {
    byteLimit = 1000000
    recordLimit = 500
    timeLimit = 500
  }
}
elasticsearch {
  client {
    endpoint = "<elk-endpoint-ip>"
    port = "9200"
    maxTimeout = 10000
    maxRetries = 6
    ssl = false
  }
  aws {
    signing = false
    region = "us-west-1"
  }
  cluster {
    name = "<snowplow-cluster-name>"
    index = "snowplow-enriched-index"
    documentType = "good"
  }
}

mike · July 18, 2022, 11:36pm

Do you have some errors that are being emitted to Kinesis for the bad events that are not being inserted successfully? There should be some additional info in there.

RockstarAlex · July 19, 2022, 7:33am

There are some warnings and errors in enriched events loader’s logs.
First block is:

[RecordProcessor-0000] INFO com.snowplowanalytics.stream.loader.clients.ElasticsearchBulkSender - Emitted 97 records to Elasticseacrch
[RecordProcessor-0000] WARN com.snowplowanalytics.stream.loader.clients.ElasticsearchBulkSender - Returning 55 records as failed
[scala-execution-context-global-19] WARN com.snowplowanalytics.stream.loader.clients.ElasticsearchBulkSender - Cluster health is yellow

The second is a json block (or group of blocks) that starts with:

[scala-execution-context-global-19] ERROR com.snowplowanalytics.stream.loader.clients.ElasticsearchBulkSender - Record

and finishes with:

failed with message failed to parse

with information about an event between them.
Also I forgot to mention that we deployed snowplow’s stuff with terraform and the version of loader is 1.0.0 as it turned out, and we use elasticsearch version 7.13. May be there is a discrepancy between versions? What is the best stuck versions to use?

Topic		Replies	Views
Unable to receive Snowplow data into Elasticsearch Data store sources	14	3424	January 17, 2018
Is it bad to sink data from kinesis stream directly to postgres? Storage targets	3	2816	August 24, 2016
Elasticsearch Loader unexpected exception AWS real-time pipeline	1	1194	May 9, 2018
Unable to transfer data from Kinesis to ElasticSearch Data store sources	3	1833	February 20, 2017
Snowplow Elasticsearch Loader 0.10.2 released New releases	0	932	August 6, 2018

Snowplow miss data in Elasticsearch

Related topics