Failure NotTSV with BigQuery Loader

simonbreton · August 19, 2022, 4:26pm

I’m trying to setup a basic snowplow pipeline for GCP. All the test events I’m sending ends up in the bad rows sub. So far I have the following component running:

Collector
Iglu server
BigQuery Streamloader.

I’m sending the following test:

curl 'http://xxx.xxx.xx.xx:xxxx/com.snowplowanalytics.snowplow/tp2' \
-H 'Content-Type: application/json; charset=UTF-8' \
-H 'Cookie: _sp=305902ac-8d59-479c-ad4c-82d4a2e6bb9c' \
--data-raw '{"schema":"iglu:com.snowplowanalytics.snowplow/payload_data/jsonschema/1-0-4","data":[{"e":"pv","tv":"js-3.4.0","p":"web"}]}'

Here is what I get in the bad rows pubsub subscription:

{"schema":"iglu:com.snowplowanalytics.snowplow.badrows/loader_parsing_error/jsonschema/2-0-0","data":{"processor":{"artifact":"snowplow-bigquery-streamloader","version":"1.4.0"},"failure":{"type":"NotTSV"},"payload":"\u000b\u0000d\u0000\u0000\u0000\u000b88.123.48.3\n\u0000�\u0000\u0000\u0001��ғ�\u000b\u0000�\u0000\u0000\u0000\u0005UTF-8\u000b\u0000�\u0000\u0000\u0000\u0016ssc-2.7.0-googlepubsub\u000b\u0001,\u0000\u0000\u0000\u000bcurl/7.77.0\u000b\u0001@\u0000\u0000\u0000#/com.snowplowanalytics.snowplow/tp2\u000b\u0001T\u0000\u0000\u0000|{\"schema\":\"iglu:com.snowplowanalytics.snowplow/payload_data/jsonschema/1-0-4\",\"data\":[{\"e\":\"pv\",\"tv\":\"js-3.4.0\",\"p\":\"web\"}]}\u000f\u0001^\u000b\u0000\u0000\u0000\u0006\u0000\u0000\u0000\u001bTimeout-Access: <function1>\u0000\u0000\u0000\u0018Host: {ipofmycollector}\u0000\u0000\u0000\u0017User-Agent: curl/7.77.0\u0000\u0000\u0000\u000bAccept: */*\u0000\u0000\u00000Cookie: _sp=305902ac-8d59-479c-ad4c-82d4a2e6bb9c\u0000\u0000\u0000\u0010application/json\u000b\u0001h\u0000\u0000\u0000\u0010application/json\u000b\u0001�\u0000\u0000\u0000\r104.199.44.20\u000b\u0001�\u0000\u0000\u0000$9ce62856-c74b-41df-87f6-833148cf3d77\u000bzi\u0000\u0000\u0000Aiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0\u0000"}}

The BigQuery loader is directly listening the collector, could this be the issue? (I don’t have any enrich component). What does NotTSV mean?

If needed here is the config file of the BigQuery loader:

{
  "projectId": "gcp-project-id"

  "loader": {
    "input": {
      "subscription": "good-sub"
    }

    "output": {
      "good": {
        "datasetId": "snowplow"
        "tableId": "events"
      }

      "bad": {
        "topic": "loader-bad"
      }

      "types": {
        "topic": "bq-types"
      }

      "failedInserts": {
        "topic": "failed-insert"
      }
    }
  }

  "mutator": {
    "input": {
      "subscription": "bq-types-sub"
    }

    "output": {
      "good": ${loader.output.good} # will be automatically inferred
    }
  }

  "repeater": {
    "input": {
      "subscription": "loader-failed-insert-sub"
    }

    "output": {
      "good": ${loader.output.good} # will be automatically inferred

      "deadLetters": {
        "bucket": "gs://sp-dead-letter-bucket-sw"
      }
    }
  }

  "monitoring": {} # disabled
}

Thanks!

Colm · August 19, 2022, 6:00pm

The BigQuery loader is directly listening the collector, could this be the issue? (I don’t have any enrich component). What does NotTSV mean?

This is the issue - the loader can only handle data in the enriched TSV format. You’ll need the enrich component in-between in order to handle it.

The collector deals in thrift format, which is good for sending data over a network, but not amenable to working with the data. Part of the job of enrich is to produce a more amenable format.

Incidentally, if you’re looking to get up and running on GCP and explore from there, we have published quickstart terraform modules - using them might save you a bit of effort!

Topic		Replies	Views
GCP notTSV loader error - GCP pipeline	3	1065	December 18, 2020
About badrows pipeline choices GCP pipeline	1	868	October 23, 2021
How to collect all the badrows & badrows complete classification GCP pipeline	1	909	October 27, 2021
BigQuery Streamloader throwing errors and then stops working GCP pipeline	8	1216	August 24, 2022
Schema Violations error GCP pipeline	2	1164	January 27, 2022

Failure NotTSV with BigQuery Loader

Related topics