Question regarding Snowplow event recovery 0.1.0 configuration format

Hello,

We have a bad row that looks like this

{
    "line": "<base64 encoded string>",
    "errors": [
      {
        "level": "error",
        "message": "error: object instance has properties which are not allowed by the schema: [\"submitted\"]\n    level: \"error\"\n    schema: {\"loadingURI\":\"#\",\"pointer\":\"\"}\n    instance: {\"pointer\":\"\"}\n    domain: \"validation\"\n    keyword: \"additionalProperties\"\n    unwanted: [\"submitted\"]\n"
      }
    ],
    "failure_tstamp": "2021-03-21T10:03:26.280Z"
  }

We are trying to use snowplow-event-recovery-spark-0.1.0.jar to correct the bad row. We are just unsure as to what to give as the error filter in the config. Specifically, what should go inside the ‘error’ property in the configuration. Should we just copy the message field as follows?

{
  "schema": "iglu:com.snowplowanalytics.snowplow/recoveries/jsonschema/1-0-0",
  "data": [
    {
      "name": "RemoveFromBody",
      "error": "error: object instance has properties which are not allowed by the schema: [\"submitted\"]\n    level: \"error\"\n    schema: {\"loadingURI\":\"#\",\"pointer\":\"\"}\n    instance: {\"pointer\":\"\"}\n    domain: \"validation\"\n    keyword: \"additionalProperties\"\n    unwanted: [\"submitted\"]\n",
      "toRemove": "\"submitted\":\".*\",?"
    }
  ]
}

@onnu_thonala_ad , yes, you can either use the whole string with the exact characters as they are in the bad data error or just part of it sufficient to identify the rejected event you are after. For example, you could use just “object instance has properties which are not allowed by the schema: [“submitted”]”.

This is also shown in the doc example as

    # Removes a field which shouldn't be there
    {
      "name": "RemoveFromBody",
      "error": "object instance has properties which are not allowed by the schema: [\"test\"]",
      "toRemove": "\"test\":\".*\",?"
    }

Thank you @ihor

@ihor could you help us with one more thing? We have a field in the body (base64 encoded) that we need to replace. We need to replace

{
  "submitted": {
    "data": "xyz"
  }
}

with

{
  "submitted_modified": "xyz"
}

Could you help us with the regex for this? Thanks!

@onnu_thonala_ad, I think it would be something like this

"toReplace": "\"submitted\":\{\"data\":\"(.*)\"\}",
"replacement": "\"submitted_modified\":\"$1\""

I’m not sure if the raw data already has curly brackets escaped. If so, you would have \\{ and \\} in “toReplace”.

1 Like

Thanks a lot for helping out @ihor

Hello @ihor, I tried the regex that you had given but it didn’t work. I tried testing on my local but it was throwing errors. Basically, the parsing of the recoveryScenarios JSON using the circe parser fails for nested JSONs.

val recoveryScenarios = io.circe.parser.parse(getResourceContent("/recovery_scenarios.json"))
    .flatMap(_.hcursor.get[List[RecoveryScenario]]("data"))
    .fold(f => throw new Exception(s"invalid recovery scenarios: ${f.getMessage}"), identity)

I tried 3 different regex’s, but none of them worked. Attaching the screenshots for your reference -

Do you know where we’re going wrong?

Thanks!

@ihor Could you kindly help with this?

@onnu_thonala_ad , could you share an example of the whole bad record?

@ihor I’m afraid I wouldn’t be able to share it on the public forum because of some company policies. Is there a way I can DM you or email you?

Hey @ihor , sorry for bothering you again. Would it be possible to contact you privately to share the whole bad record?

@onnu_thonala_ad , you do not need to share any sensitive data - only a single example of the bad row with the sensitive data masked. I’m only interested in the structure of your bad event.

I’m afraid the support you are suggesting is beyond what I can do for OS users.

No problem, I understand @ihor. Here’s the structure of an example bad event (attaching the ue_px field decoded):

{
  "schema": "iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0",
  "data": {
    "schema": "iglu:xxxxx/yyyyy/jsonschema/1-0-0",
    "data": {
      "type": "xxxx",
      "shown_adaptive_links": [
        {
          "type": "xx",
          "name": "xx"
        },
        {
          "type": "xx",
          "name": "xx"
        }
      ],
      "submitted": {
        "data": "REQUIRED TO BE REPLACED"
      },
      "core": {
        "request_id": "xxxx",
        "session_id": "xxxx",
        "chat_id": "xxxx",
        "mode": "xxxx",
        "workspace_id": "xxxx",
        "conversation_id": "xxxx",
        "module": "xxxx",
        "language": "xxxx"
      }
    }
  }
}

I hope this helps in debugging. Thanks again for your help!

Hey @onnu_thonala_ad , this is already formatted and only custom data. I meant to see the whole bad row including the error message. Could you decode the encoded values, remove sensitive data and encode it back, and present the whole bad row?

I’m also amending the title of this post as it is version 0.1.0 (old bad format), not 1.0.0 (new format) you are to use to recover.