Trouble adding the BQ streamloader to the GCP quickstart implementation

The quickstart for GCP works well - events are streaming to the pg database. To continue my exploration of Snowplow as a solution I need to add the BQ streamloader into my quickstart setup (and mutator & repeater). To start, I am trying to deploy the streamloader docker image in GKE via the cloud console and the docs only show how to apply the config via the CLI with a docker run command. I am no docker/gke master and something is definitely missing in my understanding.

This may be obvious to someone more familiar, but how do you apply the HOCON config file when deploying in GKE?

… and if GKE is not the preferred method can someone throw me in the direction of the appropriate way?

2 Likes

One popular way of doing it would be to deploy each application: streamloader, mutator and repeater as a GKE Deployment. For the HOCON and Iglu resolver JSON, you can use another GKE resource, called ConfigMap.

In the Deployment, you specify the name and version of the container to use, as well as the arguments to be passed to the container. You would store the configuration as ConfigMap objects, which you then pass on to the Deployments via the args.

I am basically at the same point. Dilyan, thanks for you direction but I guess I need a bit more context. I was trying to setup a pod for the stream-loader using this config:

apiVersion: v1
kind: Pod
metadata:
  name: snowplow-bigquery-streamloader
spec:
  containers:
    - name: snowplow-bigquery-streamloader
      image: registry.hub.docker.com/snowplow/snowplow-bigquery-streamloader:latest
      volumeMounts:
      - name: config-volume
        mountPath: /etc/config
      command: ["snowplow-bigquery-streamloader --config=/etc/config/config --resolver=/etc/config/resolver.json"]
      
  volumes:
    - name: config-volume
      configMap:
        name: spconfigs
  restartPolicy: Never

which gets me this error message:
[ioapp-compute-0] ERROR com.snowplowanalytics.snowplow.storage.bigquery.streamloader.Main - Usage: snowplow-bigquery-streamloader

I guess it’s easy to resolve but my experience with pod deployment is a bit limited.

Hi @Timo_Dechau and @dilyan,

From what I can work out, the BigQuery Loader currently forces you to provide a base64 encoded config on the command line. So it is not possible to use the --config=/etc/config/config style of command, and therefore the ConfigMap idea will not work. @dilyan please correct me if I’m wrong about any of that.

So I think your only option is to set your command like this:

command: ["--config=<PASTE CONFIG HERE>",  "--resolver=<PASTE RESOLVER HERE>"]

There is a slight variation that might help a little bit: It is possible to specify individual config parameters using the java system properties:

command: [
  "--resolver=<PASTE RESOLVER HERE>",
  "-DprojectId=com-acme",
  "-Dloader.input.subscription=enriched-sub",
  "-Dloader.output.good.datasetId=snowplow",
  // etc
]

I think a more helpful solution would be if BigQuery loader accepted files on the command line. I opened this Github issue to add that feature, but I cannot make any promises on when we will make that change.

@Timo_Dechau , as @istreeter mentioned, you currently need to pass the whole HOCON as a base64-encoded string.

It is still possible to use a ConfigMap for that. You need to ensure you have a record like "config.hocon" = "bXlDb25maWdIb2Nvbg==" and then in the Deployment args you’ll refer to the config.hocon key from the ConfigMap.

This depends on how you create the ConfigMap, but for example with Terraform, you can have a parameterised config_hocon.tpl template file, which you would render with the inputs you provide and place as the value of the ConfigMap’s config.hocon key.

data "template_file" "config" {
  template = file("/path/to/config_hocon.tpl")

  vars = {
    PROJECT_ID = var.project_id
  }
}

resource "kubernetes_config_map" "bq_loader_config" {
  metadata {
    namespace = var.namespace
    name      = var.name
  }

  data = {
    "config.hocon" = data.template_file.config.rendered
  }
}

You can do something similar for the iglu-resolver.json. Then, when you create the Deployment, the args section would look like:

args = [
  "--resolver=${var.iglu_resolver_b64}",
  "--config=${var.config_b64}"
]

Hope this makes sense and gives you an idea about how to move forward.

Thanks for the feedback @dilyan and @istreeter!

Will try the base64 config first. When I was trying it locally I always got encoding errors with the base64 string. Let’s see how it looks like on the server side.

Were you were passing the base64 files themselves or the contents? It looks like from your yaml you were passing the files. I am pretty sure I had the same errors in my local terminal when I tried passing the files themselves in the docker run command.

It worked when I inserted the file contents as the argument value:
--resolver $(cat /snowplow/config/iglu_resolver_b64)
--config $(cat /snowplow/config/config_b64)

I also tried it with cat and still got an error.

But I am still struggling to get the deployment done. Current deployment.yml

apiVersion: v1
kind: Pod
metadata:
  name: snowplow-bigquery-streamloader
spec:
  containers:
    - name: snowplow-bigquery-streamloader
      image: registry.hub.docker.com/snowplow/snowplow-bigquery-streamloader:latest
      command: 
        - "snowplow-bigquery-streamloader --config=ewoicHJvam...  --resolver=ewogICAgInNjaGVtY..."
        
      
  restartPolicy: Never

I also tried it with the args argument.

I am getting: “StartError with exit code 128” - could it be, because I am pulling the container from docker within GKE?

Hi @Timo_Dechau, please try changing your command like this:

command:
  - "--config=ewoicHJvam..."
  - "--resolver=ewogICAgInNjaGVtY..."

Notice I made two changes: you don’t need “snowplow-bigquery-streamloader” because this command is run automatically by the docker image. And the config and resolver arguments should be separated strings in an array, not a single string. The same is true if you use the -D syntax.

Let me know what happens!

Thanks for your reply!

tried this:

apiVersion: v1
kind: Pod
metadata:
  name: snowplow-bigquery-streamloader
spec:
  containers:
    - name: snowplow-bigquery-streamloader
      image: registry.hub.docker.com/snowplow/snowplow-bigquery-streamloader:latest
      command:
        - "--config=ewoicHJvamVjdElkIj.."
        - "--resolver=ewogICAgInNj..."
      
  restartPolicy: Never

unfortunately same result: StartError with exit code 128

@Timo_Dechau Could you please provide some more details about the error you’re getting? It might gives us some clues about what’s causing it.

Also, just in case, did you try:

containers:
    - name: snowplow-bigquery-streamloader
      image: registry.hub.docker.com/snowplow/snowplow-bigquery-streamloader:latest
      args:
        - "--config=ewoicHJvamVjdElkIj.."
        - "--resolver=ewogICAgInNj..."

Thanks dilyan.

This looks good. I tested a bit with args key but most likely not in that way! Pod is up and running so far.

Just wanted to say thanks for the help!

1 Like

Sorry for bringing this up again. I had some time now to finally get the collector working and wanted to test the stream loader.

But the service gets: PERMISSION_DENIED: User not authorized to perform this action error and get load the events.

The pod is using the default service account which is the default compute service account. I granted this account pub/sub and bigquery admin access and after nothing was working even tested with owner permissions. Same error.

Do I miss something else?