S3 bucket usage

chris · June 1, 2023, 9:56am

Hi,
Can anyone please clarify the us of the S3 buckets in the snowplow pipeline.

In our setup we have data arriving in snowflake.

Data remains in the S3 bucket in a few folders, like:

/bad/2023-05-19-091257-49640651319786124478019421013524134068410955795710083074-49640651319786124478019421013524134068410955795710083074.gz
/enriched/2023-05-19-141453-49640610199352849926609418563779391020857037214386749442-49640610199352849926609418563779391020857037214386749442.gz
/enriched/run=2023-05-31-00-00-00/output=good/sink-cb6a5bed-1fd8-46a2-9a07-07e52a175fac-0001.txt.gz
raw/2023-05-19-122351-49640651313363509860842605548777018968499093315390537730-49640651313363509860842605548777018968499093315390537730.gz
transformed/good/run=2023-05-22-14-25-00/output=good/sink-1998d5cd-606c-4071-a9dd-f5a0e1719d13-0001.txt.gz

Does the pipeline delete any of these automatically?

Are any of these required once the data in in snowflake?

What is a normal policy for managing this leftover data to make sure it does not esculate?

Many thanks for any help!

Chris.

istreeter · June 5, 2023, 5:28am

Hi @chris ,

The pipeline does not delete these files automatically. Once the data is in Snowflake then you don’t need these staging files any more.

A good way to achieve this is using S3’s lifecycle mangement. For example, you can configure S3 to automatically delete files older than 7 days. Another option is to use S3’s intelligent tiering so that old files get moved to a cheaper storage tier.

If you use the RDB loader’s folder monitoring feature, then you will need to set this config flag folders.since to match the delete policy. Otherwise it will complain about missing folders. But if you don’t use this RDB loader feature then there is no config change to make.

Topic		Replies	Views
Understanding S3 copy to snowflake For engineers	2	740	May 23, 2023
Trouble Configuring Snowplow Pipeline with AWS S3 as the Collector For engineers	1	237	April 3, 2024
Snowflake Loader - Process ran successfully but no data loaded in transform s3 bucket	7	1429	October 17, 2019
Snowflake loader with snowplow s3 loader - gzip? AWS real-time pipeline	2	1069	May 11, 2020
Collector -> S3 loader Collectors	3	1477	June 7, 2020

S3 bucket usage

Related topics