Low cost stream pipeline

fwahlqvist · January 26, 2020, 12:09am

Hey all,
Have been working with multiple clients to provide p.o.c stacks with Snowplow, most of these clients will have less then 250k events per month and very basic requirements, however now with all the updates thats happening with 3rd party cookies, google samesite, ITP etc i need to update the offering and can’t really do batch pipelines anymore, however before setting anything up I was wondering if anyone here is running any setup where they have less then 250k - 500k events per month and what cost that is per month?

robkingston · January 26, 2020, 4:07am

Can’t speak to Snowplow Mini - I’ve heard it could be able to handle that sort of volume. Might be worth a look for POC Snowplow RT installs.

As for low-cost RT pipelines, we’ve been toying with batching realtime/GCP by firing up Beam Enrich for a few hours each day. Haven’t tested it at volume, but I imagine it’ll be cost effective compared with AWS EMR/Batch. Using an n1-standard-4 instance for a couple of hours each day should keep your costs to no more than $0.50/day for enrichment/loading.

Just need crontab and a simple shell script to orchestrate this:

Start Beam Enrich @ 6am
Stop Beam Enrich @ 7am

#!/bin/bash

# Drain active beam-enrich jobs
JOBNAME="beam-enrich"
JOBID=$(gcloud dataflow jobs list --status=active --region us-central1 | grep "$JOBNAME" | awk '{ print $1 }')

if [[ -z "$JOBID" ]]; then
    echo "No active $JOBNAME jobs"
elif [[ -n "$JOBID" ]]; then
    echo "Stopping $JOBNAME: $JOBID"
    for JOB in $JOBID; do
        gcloud dataflow jobs drain $JOB --region us-central1
    done
fi

Then just rinse and repeat with BQ Loader and BQ Mutator.

stevecs · January 29, 2020, 2:24pm

Hi @fwahlqvist and @robkingston,

On using Mini for a production pipeline… it’s not a recommended approach. We do not actively test Mini’s ability to perform in a production capacity. You’re more likely to hit issues if you do, not necessarily related to scaling.

Colm · January 29, 2020, 5:02pm

Hey @fwahlqvist,

So I’m not sure if this fits your use case, but there’s a community project that might be worth looking at, which leverages serverless functions. I also can’t vouch for it myself since I haven’t looked into using it, but perhaps there are some others who can tell you more about their experience. I certainly think it’s a cool idea.

I believe it was built by someone working with the charity sector who neither have the budget nor the volumes to build a full-fat pipeline, which sounds a lot like what you’re looking at.

Best,

mike · January 30, 2020, 10:12pm

I think this (spinning up Dataflow temporarily) is a really interesting approach. Dataflow tends to be the most expensive part of the pipeline at lower volumes and there’s currently no ‘autoscale’ that scales down to 0 workers so you’re running at least 1 worker for each Dataflow job.

That said - there would be nothing to stop someone running Apache Beam on VMs directly rather than managed through Dataflow though the management overhead generally isn’t worth it.

Topic		Replies	Views
Recent cost information for Snowplow For engineers	8	4857	November 20, 2020
On-premise Realtime Pipeline For engineers	2	2438	January 3, 2018
Snowplow Pipeline with Presto and Minio For engineers	3	805	November 17, 2020
Setting up the real-time pipeline on AWS AWS real-time pipeline	24	5963	May 25, 2021
Webinar: Battle Hardening Snowplow Open Source Upcoming events	6	816	May 22, 2023

Low cost stream pipeline

Related topics