User_ipaddress unknown for some events

Hello everyone,

I am facing an issue with user_ipaddress unknown for some web events, and for some web events we get the user_ipaddress and the ip_lookups are able to enrich these events.

This is our snowplow setup in google kubernetes engine.
JS tracker → collector(gke) → google pub/sub → enricher → google pub/sub → stream loader → Google Big query

we have repeater and mutator setup as well.
These are the versions used.

Collector : snowplow/scala-stream-collector-pubsub:2.4.5
Enricher  : snowplow/snowplow-enrich-pubsub:2.0.5
Stream_loader : snowplow/snowplow-bigquery-streamloader:1.2.0
Repeater : snowplow/snowplow-bigquery-streamloader:1.2.0
Mutator : snowplow/snowplow-bigquery-mutator:1.2.0

Could this be a collector config issue?
I already have the below akka.http.server settings enabled.

remote-address-header = on
remote-address-attribute = on
raw-request-uri-header = on

I can see a lot of warnings in collector logs like the one below.

scala-stream-collector-akka.actor.default-dispatcher-5] WARN akka.actor.ActorSystemImpl - Illegal header: Illegal 'user-agent' header: Invalid input '[', expected 'EOI', product-or-comment, WSP, comment or CRLF (line 1, column 112): Mozilla/5.0 (iPhone; CPU iPhone OS 15_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/19D52 [FBAN/FBIOS;FBDV/iPhone12,1;FBMD/iPhone;FBSN/iOS;FBSV/15.3.1;FBSS/2;FBID/phone;FBLC/en_GB;FBOP/5]

This is a warning just for the user agent (it can be ignored) so this shouldn’t impact anything around IP address. I would look at the raw collector payloads (in PubSub) which should have an IP address in the header - it’s possible that GKE may be manipulating this header but in general it’s quite unusual to have IP address as null.

1 Like

Hello Mike,
I checked the raw events in pub/sub subscription coming out of collector, it has both events with and without ip_address
eg below

d 107.34.56.23
d unknown

Ok - so that rules out the enrichment process doing anything odd.

At the collector level there is only really two instances where unknown is returned for the IP Address:

  1. If you have SP-Anonymous enabled (and it is being sent in the header) or
  2. Snowplow can’t find a header (e.g., Remote-Address, X-Forwarded-For) to extract the IP address so it will return ‘unknown’

If it’s the first option then you probably have that enabled for a reason, and if it’s the second unfortunately there’s not too much you can do if this information hasn’t been sent in the headers. I’d be tempted to check sending from the JS tracker straight to an external load balancer (rather than GKE directly) to see if you get additional headers that might be getting removed.

1 Like

Hello Mike,
I was going through the GCP External Load balancer logs, and this is a sample POST request to collector.

"httpRequest": {
    "requestMethod": "POST",
    "requestUrl": "https://zxy.collector.endpoint.com/com.snowplowanalytics.snowplow/tp2",
    "requestSize": "46",
    "status": 408,
    "responseSize": "384",
    "userAgent": "Mozilla/5.0 (iPhone; CPU iPhone OS 15_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.3 Mobile/15E148 Safari/604.1",
    "remoteIp": "y.y.y.y",
    "referer": "https://www.abcd.com/",
    "serverIp": "x.x.x.x",
    "latency": "30.029126s"

looks like remoteIp is a part of the header information sent to collector, and i also queried all the LB logs to see for any POST instance the remoteIp is empty, but for all of them the remoteIp has a value.

So basically all events that reach collector has a remoteIp, but when the events comes out of collector, some of the events have user_ipaddress = “unknown” .

I hope my analysis is correct, not sure if i am missing any step here.

If my memory serves me correctly, there are two ways to get an unknown user_ipaddress out of the collector.

First is that you’re using Anonymous Tracking from the trackers:

Second is that the akka function extractClientIp can’t extract an IP address as described in the docs:

If it’s Akka failing to parse one, then that suggests maybe something is happening at the load balancer and isn’t passing the headers on to the collector as expected.

1 Like

Thanks mike and paulBoocock for helping me out to pin-point the issue, as you both mentioned the issue was with SP-Anonymous Tracking.

So the tracker was sending a GET request to collector for the first visit to webpage, and collector was able to get ip_address and other info, but the subsequent visits the tracker was sending a POST request with SP-Anonymous* header, which as you mentioned, collector set’s the user_ipaddress to unknown.

This was the reason we were seeing events with and without user_ipaddress in the good_events table.

Thanks a lot for helping me out, really appreciate.

No worries - thanks for the update and glad to hear it’s an expected behaviour!

1 Like