We are using a redshift instance type dc2.large and quickly reaching it’s storage limit. It makes sense we offload the atomic schema into redshift spectrum as we don’t often query this data and mainly use it for data modelling once a day.
Does there exist any ETL process which can transform into spectrum ready files?
You can use your archive files in spectrum directly or by turning them into parquet (using glue). Have a look at this article for some ideas: https://snowplowanalytics.com/blog/2019/04/04/use-glue-and-athena-with-snowplow-data/
This is good for one-time. But I guess one has to write a scheduled job to add more partitions as the data comes-in.
Also, is it possible to not run the Redshift copy job at all and instead replace it with a job to create Spectrum partitions? Would we still get the benefits of deduplication that the storage process contains?
Yes and AWS Glue makes that rather easy.
In theory you could but spectrum will not be as performant, I would think, as having the data locally in the native format. What I would do in your case is keep only the most recent data up to what your storage allows and keep adding partitions for spectrum.
I would really need to look closely into the loader deduplication to tell you if that would be included, but I would expect that it would be if you are reading from the shredded archive.