After reading the following:
- How to setup a Lambda architecture for Snowplow
- Error using StorageLoader to load data into Redshift
I believe I’ve followed the instructions correctly, but no logs are ever ending in s3://snowplow-wg/realtime/processing.
Here is more info about my setup:
1. I’m running snowplow-stream-collector-0.9.0 which is configured as per below to output to the kinesis stream “snowplow-collector-good”
collector {
interface = "0.0.0.0"
port = 80
production = false
p3p {
policyref = "/w3c/p3p.xml"
CP = "NOI DSP COR NID PSA OUR IND COM NAV STA"
}
cookie {
enabled = true
expiration = 365 # e.g. "365 days"
name = "spcol"
# The domain is optional and will make the cookie accessible to other
# applications on the domain. Comment out this line to tie cookies to
# the collector's full domain
# domain = "{{collectorCookieDomain}}"
}
sink {
enabled = "kinesis"
kinesis {
thread-pool-size: 10 # Thread pool size for Kinesis API requests
aws {
access-key: "iam"
secret-key: "iam"
}
stream {
region: "ap-southeast-2"
good: "snowplow-collector-good"
bad: "snowplow-collector-bad"
}
backoffPolicy: {
minBackoff: 20
maxBackoff: 60
}
}
kafka {
brokers: "{{collectorKafkaBrokers}}"
topic {
good: "{{collectorKafkaTopicGoodName}}"
bad: "{{collectorKafkaTopicBadName}}"
}
}
buffer {
byte-limit: 4000000
record-limit: 500 # Not supported by Kafka; will be ignored
time-limit: 5000
}
}
}
akka {
loglevel = DEBUG # 'OFF' for no logging, 'DEBUG' for all logging.
loggers = ["akka.event.slf4j.Slf4jLogger"]
}
spray.can.server {
remote-address-header = on
uri-parsing-mode = relaxed
raw-request-uri-header = on
parsing {
max-uri-length = 32768
}
}
2. I’m running snowplow-kinesis-s3-0.5.0 and it is successfully putting (many) log files in s3://snowplow-kinesis-wg
An example file in that S3 bucket is 2017-08-15-49575083451241952265900828384403460541645649702371196930-49575083451241952265900828384479622868281380961104429058.gz
And sink.conf file looks like this (note I’ve tried both lzo and gzip below)
# Default configuration for kinesis-lzo-s3-sink
sink {
AWS_SECRET_ACCESS_KEY
aws {
access-key: "iam"
secret-key: "iam"
}
kinesis {
in {
# Kinesis input stream name
stream-name: "snowplow-collector-good"
initial-position: "TRIM_HORIZON"
max-records: 10000
}
out {
# Stream for events for which the storage process fails
stream-name: "snowplow-sink-failed"
}
region: "ap-southeast-2"
app-name: "SnowplowLzoS3Sink-snowplow-enriched-out"
}
s3 {
region: "ap-southeast-2"
bucket: "snowplow-kinesis-wg"
# Format is one of lzo or gzip
# Note, that you can use gzip only for enriched data stream.
format: "gzip"
max-timeout: 60000
}
buffer {
byte-limit: 4000000
record-limit: 500
time-limit: 5000
}
logging {
level: "DEBUG"
}
monitoring {
snowplow {
collector-uri: "sp.winning.com.au"
collector-port: 80
app-id: "spenrich"
method: "GET"
}
}
}
Both #1 and #2 above run without errors.
3. And when I run snowplow-emr-etl-runner, it runs “successfully” but does nothing (i.e. s3://snowplow-wg/realtime/processing/ is empty):
[root@ip-10-10-22-45 storageloader]# ./snowplow-emr-etl-runner --config config.yml --resolver resolver.json --targets targets/ --enrichments enrichments/
D, [2017-08-15T07:56:52.913000 #26919] DEBUG -- : Staging raw logs...
moving files from s3://snowplow-kinesis-wg/ to s3://snowplow-wg/realtime/processing/
Here is a snippet of my config used for snowplow-emr-etl-runner (R89 which I used over R90 as it was was very different in architecture):
s3:
region: ap-southeast-2
buckets:
assets: s3://snowplow-hosted-assets
jsonpath_assets: s3://snowplow-wg/jsonpaths
log: s3://snowplow-wg/realtime/log
raw:
in:
- s3://snowplow-kinesis-wg
processing: s3://snowplow-wg/realtime/processing
archive: s3://snowplow-wg/realtime/archive/raw
enriched:
good: s3://snowplow-wg/realtime/enriched/good
bad: s3://snowplow-wg/realtime/enriched/bad
errors:
archive: s3://snowplow-wg/realtime/archive/enriched
shredded:
good: s3://snowplow-wg/realtime/shredded/good
bad: s3://snowplow-wg/realtime/shredded/bad
errors:
archive: s3://snowplow-wg/realtime/archive/shredded
So the main issue here is that the s3://snowplow-wg/realtme/processing folder is empty after #3 runs which makes me think I might have the wrong files in s3://snowplow-kinesis-wg
After reading the two URLs above, I made the kinesis.in.stream-name = “snowplow-collector-good” which is the same as the collector’s output stream which is collector.sink.kinesis.stream.good = “snowplow-collector-good”.
Greatly appreciate any help! I feel I’m really close to getting both batch and real time working in parallel.
Thanks,
Tim