S3 bucket usage

Hi,
Can anyone please clarify the us of the S3 buckets in the snowplow pipeline.

In our setup we have data arriving in snowflake.

Data remains in the S3 bucket in a few folders, like:

/bad/2023-05-19-091257-49640651319786124478019421013524134068410955795710083074-49640651319786124478019421013524134068410955795710083074.gz
/enriched/2023-05-19-141453-49640610199352849926609418563779391020857037214386749442-49640610199352849926609418563779391020857037214386749442.gz
/enriched/run=2023-05-31-00-00-00/output=good/sink-cb6a5bed-1fd8-46a2-9a07-07e52a175fac-0001.txt.gz
raw/2023-05-19-122351-49640651313363509860842605548777018968499093315390537730-49640651313363509860842605548777018968499093315390537730.gz
transformed/good/run=2023-05-22-14-25-00/output=good/sink-1998d5cd-606c-4071-a9dd-f5a0e1719d13-0001.txt.gz

Does the pipeline delete any of these automatically?

Are any of these required once the data in in snowflake?

What is a normal policy for managing this leftover data to make sure it does not esculate?

Many thanks for any help!

Chris.

Hi @chris ,

The pipeline does not delete these files automatically. Once the data is in Snowflake then you don’t need these staging files any more.

A good way to achieve this is using S3’s lifecycle mangement. For example, you can configure S3 to automatically delete files older than 7 days. Another option is to use S3’s intelligent tiering so that old files get moved to a cheaper storage tier.

If you use the RDB loader’s folder monitoring feature, then you will need to set this config flag folders.since to match the delete policy. Otherwise it will complain about missing folders. But if you don’t use this RDB loader feature then there is no config change to make.

2 Likes