Hello Folks!
I am currently working on writing a Spark-based lake loader for Iceberg that can load data from Kafka to GCS. I am taking some inspiration from the existing GCS loader, but I want to learn from your experiences in getting to know possible challenges and roadmaps.
One of the challenges I am facing is how to manage deserialization and schema evolution.
Deserialization
The Kafka messages will need to be deserialized into Spark DataFrames/Datasets before they can be loaded into Iceberg. I am considering two approaches:
Use a custom deserializer:
This would give me the most control and flexibility over the deserialization process.
-
One straight approach could be to simply convert TSV output from enrichment into JSON before loading it into GCS as an iceberg table.
-
The other approach would be to create some custom event objects for deserialization, which would be a bit more complex and would require changes to every schema.
Use a pre-built deserializer:
I think there is an existing Event class in the Analytics SDK that I can use to deserialze events into a JSON/Event Dataset in Spark.
Schema Evolution
Iceberg supports schema evolution, which allows the schema of a table to change over time. This is important for handling changes to the Kafka message schema.
I am considering two approaches for handling schema evolution:
- Use the Iceberg MergeWrite operation: This operation allows you to merge new data into an existing table, even if the schema of the new data is different from the schema of the existing table.
- Create a new Iceberg table partition for each schema version: This would be more complex to implement, but it would give me more control over how schema evolution is handled.
I am interested in hearing from others who have experience writing lake loaders or working with Iceberg schema evolution. What advice would you give me?
Discussion Questions:
- What are the pros and cons of using a custom deserializer vs a pre-built deserializer using Analytics SDK? Any limitations with Event in the case of self-describing schemas?
- What are the pros and cons of using the Iceberg MergeWrite operation vs creating a new Iceberg table for each schema version?
- What other challenges have you faced when writing lake loaders or working with Iceberg schema evolution?
- What advice would you give to someone who is new to writing lake loaders or working with Iceberg schema evolution?
More discussion here → Kafka to BigQuery/GCS loader