Stream Transfomer failing on output Parquet wide row

ahid_002 · August 29, 2023, 3:45pm

I have Dockerized version of Snowplow Stream Transformer Kinesis running in EKS and as i am going to store the data in Azure Databricks thats why i’m trying to write it as wide row and parquet format. but i’m getting the following error

Pleae note that it works great with wide row json

org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"                      
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)                     
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)                       
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)                              
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)                      
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)                              
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)                                     
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)                                       
        at com.github.mjakubowski84.parquet4s.parquet.io$.$anonfun$validateWritePath$1(io.scala:35)     
        at fromSync @ com.snowplowanalytics.snowplow.rdbloader.transformer.stream.kinesis.Main$.run(Main
.scala:28)

And here is the config.hocon

 {

  "input": {
    "streamName": "{{.Values.config.streams.good}}"
  },
  "output": {
    "path": "s3://{{.Values.config.storage.bucket}}/transformed/"
  },
  "windowing": "1 minutes",
  "queue": {
    "type": "sqs""queueName": "{{.Values.config.sqs}}"
  },
  "formats": {
    "transformationType": "widerow",
    "fileFormat": "parquet"
  }
}

ahid_002 · August 29, 2023, 4:41pm

This was fixed by changing the path from s3://** to s3a://

josh · August 29, 2023, 11:01pm

Glad you got that resolved!

As an aside its always worth checking out the logic baked into the open-source modules first when debugging as we have hopefully handled most of those cases and they can then be used as a template for building your own implementations (https://github.com/snowplow-devops/terraform-aws-transformer-kinesis-ec2/blob/master/templates/config.json.tmpl#L30-L35)

Topic		Replies	Views
Streaming transformer fails to write parquet Troubleshooting	2	972	December 13, 2022
Snowflake Transformer Step HDFS Problems Troubleshooting	2	1362	January 5, 2021
Convert Snowplow thrift files (on S3) to parquet For engineers	2	2023	February 25, 2019
RDB Loader 4.0.0 (including Databricks Support) New releases	1	1593	June 28, 2022
Spark missing in Dataflow-runner Enrichment	25	3730	December 10, 2020

Stream Transfomer failing on output Parquet wide row

Related topics