Hi,
I setup the Snowplow for Opensource pipeline in AWS using the quick examples, and added a “transformer-kinesis-ec2/aws” instance to shred the enriched messages, plus an instance running the snowplow/rdb-loader-redshift:5.1.1-rc2 to send the events to my Redshift cluster.
Now the RDB loader for Redshift is able to copy to the database all the events I tried so far except the one generated by the trackSiteSearch
method. It fails when creating the “atomic.com_snowplowanalytics_snowplow_site_search_1” table with the following error:
rdb-loader-redshift_1 | INFO com.snowplowanalytics.snowplow.rdbloader: RDB Loader 5.1.1-rc2 has started.
rdb-loader-redshift_1 | INFO HikariPool-1 - Starting...
rdb-loader-redshift_1 | INFO HikariPool-1 - Driver does not support get/set network timeout for connections. ([Amazon][JDBC](10220) Driver does not support this optional feature.)
rdb-loader-redshift_1 | INFO HikariPool-1 - Start completed.
rdb-loader-redshift_1 | INFO Loader: Target check is completed
rdb-loader-redshift_1 | INFO Loader: No operation prepare step is completed
rdb-loader-redshift_1 | INFO Manifest: No changes needed on the manifest table
rdb-loader-redshift_1 | INFO Loader: Manifest initialization is completed
rdb-loader-redshift_1 | INFO Loader: load_tstamp column already exists
rdb-loader-redshift_1 | INFO Loader: Adding load_tstamp column is completed
rdb-loader-redshift_1 | INFO FolderMonitoring: Configuration for monitoring.folders hasn't been provided - monitoring is disabled
rdb-loader-redshift_1 | INFO DataDiscovery: Received a new message
rdb-loader-redshift_1 | INFO DataDiscovery: Total 1 messages received, 0 loaded
rdb-loader-redshift_1 | INFO DataDiscovery: New data discovery at run=2022-11-08-22-25-00 with following shredded types:
rdb-loader-redshift_1 | * iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-*-* TSV
rdb-loader-redshift_1 | * iglu:nl.basjes/yauaa_context/jsonschema/1-*-* TSV
rdb-loader-redshift_1 | * iglu:com.google.analytics.enhanced-ecommerce/impressionFieldObject/jsonschema/1-*-* TSV
rdb-loader-redshift_1 | * iglu:com.snowplowanalytics.snowplow/site_search/jsonschema/1-*-* TSV
rdb-loader-redshift_1 | * iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-*-* TSV
rdb-loader-redshift_1 | * iglu:com.google.analytics.enhanced-ecommerce/action/jsonschema/1-*-* TSV
rdb-loader-redshift_1 | INFO Load: Loading transaction for s3://snowplow-vui-poc/transformed/good/run=2022-11-08-22-25-00/ has started
rdb-loader-redshift_1 | INFO Migration: Creating atomic.com_snowplowanalytics_snowplow_site_search_1 table for iglu:com.snowplowanalytics.snowplow/site_search/jsonschema/1-0-0
rdb-loader-redshift_1 | ERROR Loader: Loading of s3://snowplow-vui-poc/transformed/good/run=2022-11-08-22-25-00/ has failed. Not adding into retry queue. [Amazon](500310) Invalid operation: column "filters" duplicated; - SqlState: 42701
rdb-loader-redshift_1 | INFO Acking SQS message because processing is complete.
The problem is I believe caused by the transformer which adds a duplicated “filters” column in the shredded TSV file:
com.snowplowanalytics.snowplow site_search jsonschema 1-0-0 e7e52081-1d66-4dde-9351-7f03e655cd87 2022-11-08 22:14:48.189 events ["events","site_search"] events ["bananas"] {"category":"books","sub-category":"non-fiction"} {"category":"books","sub-category":"non-fiction"} 48 48
I don’t understand how the transformer is adding the “filters” column twice. Can anybody shred some light on this problem?
My transformer terraform configuration is the following:
module "transformer_enriched_redshift" {
source = "snowplow-devops/transformer-kinesis-ec2/aws"
version = "0.1.0"
count = 1
name = "${var.prefix}-transformer-kinesis-enriched-server"
vpc_id = var.vpc_id
subnet_ids = var.public_subnet_ids
ssh_key_name = aws_key_pair.pipeline.key_name
ssh_ip_allowlist = var.ssh_ip_allowlist
stream_name = module.enriched_stream.name
s3_bucket_name = var.s3_bucket_name
s3_bucket_object_prefix = "${var.s3_bucket_object_prefix}transformed/good"
window_period_min = var.transformer_window_period_min
sqs_queue_name = aws_sqs_queue.message_queue_rs[0].name
transformation_type = "shred"
custom_iglu_resolvers = local.custom_iglu_resolvers
kcl_write_max_capacity = var.pipeline_kcl_write_max_capacity
iam_permissions_boundary = var.iam_permissions_boundary
telemetry_enabled = var.telemetry_enabled
user_provided_id = var.user_provided_id
tags = var.tags
cloudwatch_logs_enabled = var.cloudwatch_logs_enabled
cloudwatch_logs_retention_days = var.cloudwatch_logs_retention_days
}
resource "aws_sqs_queue" "message_queue_rs" {
count = 1
content_based_deduplication = true
name = "${var.prefix}-sf-loader.fifo"
fifo_queue = true
kms_master_key_id = "alias/aws/sqs"
}
Thanks in advance.
Danilo