Failure NotTSV with BigQuery Loader

I’m trying to setup a basic snowplow pipeline for GCP. All the test events I’m sending ends up in the bad rows sub. So far I have the following component running:

  • Collector
  • Iglu server
  • BigQuery Streamloader.

I’m sending the following test:

curl 'http://xxx.xxx.xx.xx:xxxx/com.snowplowanalytics.snowplow/tp2' \
-H 'Content-Type: application/json; charset=UTF-8' \
-H 'Cookie: _sp=305902ac-8d59-479c-ad4c-82d4a2e6bb9c' \
--data-raw '{"schema":"iglu:com.snowplowanalytics.snowplow/payload_data/jsonschema/1-0-4","data":[{"e":"pv","tv":"js-3.4.0","p":"web"}]}'

Here is what I get in the bad rows pubsub subscription:

{"schema":"iglu:com.snowplowanalytics.snowplow.badrows/loader_parsing_error/jsonschema/2-0-0","data":{"processor":{"artifact":"snowplow-bigquery-streamloader","version":"1.4.0"},"failure":{"type":"NotTSV"},"payload":"\u000b\u0000d\u0000\u0000\u0000\u000b88.123.48.3\n\u0000�\u0000\u0000\u0001��ғ�\u000b\u0000�\u0000\u0000\u0000\u0005UTF-8\u000b\u0000�\u0000\u0000\u0000\u0016ssc-2.7.0-googlepubsub\u000b\u0001,\u0000\u0000\u0000\u000bcurl/7.77.0\u000b\u0001@\u0000\u0000\u0000#/com.snowplowanalytics.snowplow/tp2\u000b\u0001T\u0000\u0000\u0000|{\"schema\":\"iglu:com.snowplowanalytics.snowplow/payload_data/jsonschema/1-0-4\",\"data\":[{\"e\":\"pv\",\"tv\":\"js-3.4.0\",\"p\":\"web\"}]}\u000f\u0001^\u000b\u0000\u0000\u0000\u0006\u0000\u0000\u0000\u001bTimeout-Access: <function1>\u0000\u0000\u0000\u0018Host: {ipofmycollector}\u0000\u0000\u0000\u0017User-Agent: curl/7.77.0\u0000\u0000\u0000\u000bAccept: */*\u0000\u0000\u00000Cookie: _sp=305902ac-8d59-479c-ad4c-82d4a2e6bb9c\u0000\u0000\u0000\u0010application/json\u000b\u0001h\u0000\u0000\u0000\u0010application/json\u000b\u0001�\u0000\u0000\u0000\r104.199.44.20\u000b\u0001�\u0000\u0000\u0000$9ce62856-c74b-41df-87f6-833148cf3d77\u000bzi\u0000\u0000\u0000Aiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0\u0000"}}

The BigQuery loader is directly listening the collector, could this be the issue? (I don’t have any enrich component). What does NotTSV mean?

If needed here is the config file of the BigQuery loader:

{
  "projectId": "gcp-project-id"

  "loader": {
    "input": {
      "subscription": "good-sub"
    }

    "output": {
      "good": {
        "datasetId": "snowplow"
        "tableId": "events"
      }

      "bad": {
        "topic": "loader-bad"
      }

      "types": {
        "topic": "bq-types"
      }

      "failedInserts": {
        "topic": "failed-insert"
      }
    }
  }

  "mutator": {
    "input": {
      "subscription": "bq-types-sub"
    }

    "output": {
      "good": ${loader.output.good} # will be automatically inferred
    }
  }

  "repeater": {
    "input": {
      "subscription": "loader-failed-insert-sub"
    }

    "output": {
      "good": ${loader.output.good} # will be automatically inferred

      "deadLetters": {
        "bucket": "gs://sp-dead-letter-bucket-sw"
      }
    }
  }

  "monitoring": {} # disabled
}

Thanks!

The BigQuery loader is directly listening the collector, could this be the issue? (I don’t have any enrich component). What does NotTSV mean?

This is the issue - the loader can only handle data in the enriched TSV format. You’ll need the enrich component in-between in order to handle it.

The collector deals in thrift format, which is good for sending data over a network, but not amenable to working with the data. Part of the job of enrich is to produce a more amenable format.

Incidentally, if you’re looking to get up and running on GCP and explore from there, we have published quickstart terraform modules - using them might save you a bit of effort!

2 Likes