The documentation seems to indicate there are both recipe and cube sql files available for download but I only see recipes at this github link. So are cube sql files still available and if so, where do I find them?
@ryanrozich I’m not sure what docs your referencing, but those are indeed the files. Let me know if you have any questions or feedback (there’s no documentation ready yet - it will be added before we publish them).
@christophe the docs I was referring to were the link that i pasted in the original question for this thread
Thanks for posting these, Im opening up now and taking a first look at these files, it looks like these are creating materialized views as tables in these schema. A couple questions
(1) Is the purpose of materializing the views to speed up query execution as opposed to just having them be dynamic views?
(2) Do you have any benchmarks on how much additional cluster resources (disk, etc) it would require to run all of the data cube DDL files you posted? Specifically how much we should allocate on top of what is loaded and stored in the atomic schema.
(3) Whats the recommended way of using these? Is it (a) to drop and recreate the schema every time you want to do a study on web data or (b) Is the intent that we should try to keep these in somewhat in sync with the raw data that is in the atomic schema? If so, should all of these get dropped and recreated in batch on a certain schedule (like every day), or would there be some sort of incremental loading of just new data?
It shouldn’t have too much of an impact on disk usage (which will increase at most a few percentage points).
I’d expect most people to run them on a regular schedule. One option is to, for example, have a nightly run. Another one is to run them each time a new batch of events gets loaded into Redshift. If you’re a Looker user, you can use Looker variant of the model. If you use our SQL Runner application to schedule SQL, you can use this version instead.
Thanks @christophe for the answers! These tables are great by the way.
I believe that I did find a bug in one of your SQL files 03-events-scroll-depth.sql
LEAST(GREATEST(MIN(NVL(ev.pp_xoffset_min, 0)), 0), MAX(ev.br_viewwidth)) AS hmin, -- should be zero
LEAST(GREATEST(MAX(NVL(ev.pp_xoffset_max, 0)), 0), MAX(ev.br_viewwidth)) AS hmax,
LEAST(GREATEST(MIN(NVL(ev.pp_yoffset_min, 0)), 0), MAX(ev.br_viewheight)) AS vmin, -- should be zero (edge case: not zero because the pv event is missing - but these are not in scratch.dev_pv_01 so not an issue)
LEAST(GREATEST(MAX(NVL(ev.pp_yoffset_max, 0)), 0), MAX(ev.br_viewheight)) AS vmax,
I think you want to limit the scroll to the max width/height of the document not the max width/height of the browser viewport, for example I believe that the last line should read:
LEAST(GREATEST(MAX(NVL(ev.pp_yoffset_max, 0)), 0), MAX(ev.doc_height)) AS vmax,
I noticed that for many of the page views that the vmax was equal to the viewport height exactly, and when I looked at MAX(ev.br_viewheight) I noticed that the users were scrolling much further down the page.
One more question @christophe - your data modeling queries seem to run very fast, I’m trying to add my own data modeling queries to add additional context and they seem to take forever to run. I’m wondering if you have some general principals you use to construct these queries to make them execute quickly.
Let me give you an example: We have a custom context that attach to page view events in the JS tracker, right now all we do is put the page canonical_link (if it exists on the page) in this context. I would like this canonical_link to end up in the web.page_views table when the data modeling is done. Following your examples I am trying to create a temp table for relating page_view_id to canonical_link and plan on joining this into the final web_page_tmp table in the 06-page-views.sql file.
However the query to create the temp table is taking forever. To start, here is the DDL for the custom context I am querying against:
-- AUTO-GENERATED BY igluctl DO NOT EDIT
-- Generator: igluctl 0.2.0
-- Generated: 2016-10-23 08:34
CREATE SCHEMA IF NOT EXISTS atomic;
CREATE TABLE IF NOT EXISTS atomic.com_onespot_pageview_context_2 (
"schema_vendor" VARCHAR(128) ENCODE RUNLENGTH NOT NULL,
"schema_name" VARCHAR(128) ENCODE RUNLENGTH NOT NULL,
"schema_format" VARCHAR(128) ENCODE RUNLENGTH NOT NULL,
"schema_version" VARCHAR(128) ENCODE RUNLENGTH NOT NULL,
"root_id" CHAR(36) ENCODE RAW NOT NULL,
"root_tstamp" TIMESTAMP ENCODE LZO NOT NULL,
"ref_root" VARCHAR(255) ENCODE RUNLENGTH NOT NULL,
"ref_tree" VARCHAR(1500) ENCODE RUNLENGTH NOT NULL,
"ref_parent" VARCHAR(255) ENCODE RUNLENGTH NOT NULL,
"canonical_link" VARCHAR(4096) ENCODE LZO,
FOREIGN KEY (root_id) REFERENCES atomic.events(event_id)
COMMENT ON TABLE atomic.com_onespot_pageview_context_2 IS 'iglu:com.onespot/pageview-context/jsonschema/2-0-0';
Here is a query that I tried writing to put into the data modeling pipeline to relate page_view_id to canonical_link in this table
For this case I found that joining on scratch.web_events instead of scratch.web_page_context sped up this query quite a bit. However, if you have any general tips on writing efficient queries for data modeling, I’d appreciate it.