Streaming transformer fails to write parquet

Hi,

I am trying to set up a stream transformer and a databricks loader to consume the enriched events and load them to Databricks.

To deploy the stream transformer, I am using the snowplow-devops/transformer-kinesis-ec2 terraform module with infra version 0.2.1 and app version 5.2.0. The application is working as expected when output format is set to json, but throws the following error when set to parquet:

org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

I believe this error was supposed to be resolved with release 5.1.1.
Is it still an outstanding bug or is there something I have missed in my setup/config?

My tf module config:

module "transformer_enriched" {
  source  = "snowplow-devops/transformer-kinesis-ec2/aws"
  version = "0.2.1"

  name                           = "${var.prefix}-transformer-kinesis-enriched-server"
  vpc_id                         = var.vpc_id
  subnet_ids                     = var.private_subnet_ids
  ssh_key_name                   = aws_key_pair.pipeline.key_name
  ssh_ip_allowlist               = var.ssh_ip_allowlist
  stream_name                    = module.enriched_stream.name
  s3_bucket_name                 = var.s3_bucket_name
  s3_bucket_object_prefix        = "${var.s3_bucket_object_prefix}transformed/good"
  window_period_min              = var.transformer_window_period_min
  sqs_queue_name                 = aws_sqs_queue.message_queue[0].name
  transformation_type            = "widerow"
  widerow_file_format            = "parquet"
  custom_iglu_resolvers          = local.custom_iglu_resolvers
  kcl_write_max_capacity         = var.pipeline_kcl_write_max_capacity
  iam_permissions_boundary       = var.iam_permissions_boundary
  telemetry_enabled              = var.telemetry_enabled
  user_provided_id               = var.user_provided_id
  tags                           = var.tags
  cloudwatch_logs_enabled        = var.cloudwatch_logs_enabled
  cloudwatch_logs_retention_days = var.cloudwatch_logs_retention_days
}

Thanks!

Hi @enikov , it looks like hadoop-aws might not be liking the s3 protocol. Can you please try changing the protocol to s3a on this line of the config template?

1 Like

Thanks @dilyan , that did the trick! :slight_smile: