I am currently working on writing a Spark-based lake loader for Iceberg that can load data from Kafka to GCS. I am taking some inspiration from the existing GCS loader, but I want to learn from your experiences in getting to know possible challenges and roadmaps.
One of the challenges I am facing is how to manage deserialization and schema evolution.
The Kafka messages will need to be deserialized into Spark DataFrames/Datasets before they can be loaded into Iceberg. I am considering two approaches:
Use a custom deserializer:
This would give me the most control and flexibility over the deserialization process.
One straight approach could be to simply convert TSV output from enrichment into JSON before loading it into GCS as an iceberg table.
The other approach would be to create some custom event objects for deserialization, which would be a bit more complex and would require changes to every schema.
Use a pre-built deserializer:
I think there is an existing Event class in the Analytics SDK that I can use to deserialze events into a JSON/Event Dataset in Spark.
Iceberg supports schema evolution, which allows the schema of a table to change over time. This is important for handling changes to the Kafka message schema.
I am considering two approaches for handling schema evolution:
- Use the Iceberg MergeWrite operation: This operation allows you to merge new data into an existing table, even if the schema of the new data is different from the schema of the existing table.
- Create a new Iceberg table partition for each schema version: This would be more complex to implement, but it would give me more control over how schema evolution is handled.
I am interested in hearing from others who have experience writing lake loaders or working with Iceberg schema evolution. What advice would you give me?
- What are the pros and cons of using a custom deserializer vs a pre-built deserializer using Analytics SDK? Any limitations with Event in the case of self-describing schemas?
- What are the pros and cons of using the Iceberg MergeWrite operation vs creating a new Iceberg table for each schema version?
- What other challenges have you faced when writing lake loaders or working with Iceberg schema evolution?
- What advice would you give to someone who is new to writing lake loaders or working with Iceberg schema evolution?
More discussion here → Kafka to BigQuery/GCS loader