Hey, we have deployed the Streaming RDB Loader into our production environment but this is now failing on some events. The RDB Loader is running as part of the Snowplow Stream with Databricks.
Events have varying numbers of columns as some have custom fields added on our front end. This causes the COPY INTO
statement to fail with the following error:
[ERROR] com.snowplowanalytics.snowplow.rdbloader: Transaction aborted. Giving up after 3 retries. Caught exception: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500051) ERROR
processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: cannot resolve '`custom_column`' given
input columns: [app_id, base_currency, br_colordepth, br_cookies, br_family, br_features_director, br_features_flash, br_features_gears, br_features_java, br_features_pdf, br_features_quicktime,
br_features_realplayer, br_features_silverlight, br_features_windowsmedia, br_lang, br_name, br_renderengine, br_type, br_version, br_viewheight, br_viewwidth, collector_tstamp, contexts_com_snowplowanalytics_mobile_application_lifecycle_1, contexts_com_snowplowanalytics_snowplow_client_session_1, contexts_com_snowplowanalytics_snowplow_mobile_context_1,
contexts_com_snowplowanalytics_snowplow_ua_parser_context_1, contexts_com_snowplowanalytics_snowplow_web_page_1, derived_tstamp, doc_charset, doc_height, doc_width, domain_sessionid, domain_sessionidx, domain_userid, dvce_created_tstamp, dvce_ismobile,
dvce_screenheight, dvce_screenwidth, dvce_sent_tstamp, dvce_type, etl_tags, etl_tstamp, event, event_fingerprint, event_format, event_id, event_name, event_vendor, event_version, geo_city,
geo_country, geo_latitude, geo_longitude, geo_region, geo_region_name, geo_timezone, geo_zipcode, ip_domain, ip_isp, ip_netspeed, ip_organization, mkt_campaign, mkt_clickid, mkt_content, mkt_medium,
mkt_network, mkt_source, mkt_term, name_tracker, network_userid, os_family, os_manufacturer, os_name, os_timezone, page_referrer, page_title, page_url, page_urlfragment, page_urlhost,
page_urlpath, page_urlport, page_urlquery, page_urlscheme, platform, pp_xoffset_max, pp_xoffset_min, pp_yoffset_max, pp_yoffset_min, refr_domain_userid, refr_dvce_tstamp, refr_medium, refr_source,
refr_term, refr_urlfragment, refr_urlhost, refr_urlpath, refr_urlport, refr_urlquery, refr_urlscheme, se_action, se_category, se_label, se_property, se_value, ti_category, ti_currency, ti_name, ti_orderid,
ti_price, ti_price_base, ti_quantity, ti_sku, tr_affiliation, tr_city, tr_country, tr_currency, tr_orderid, tr_shipping, tr_shipping_base, tr_state, tr_tax, tr_tax_base, tr_total, tr_total_base, true_tstamp, txn_id,
custom_column_x, custom_column_y, user_fingerprint, user_id, user_ipaddress, useragent, v_collector, v_etl, v_tracker]; line 3 pos 1838;
Is there any additional spark config we can add to our table or databricks cluster to resolve this? We are not sure why the schema doesn’t evolve for this given MERGESCHEMA
is set to TRUE
.
Our table definition is:
CREATE OR REPLACE TABLE schema.events (
app_id varchar(255),
platform varchar(255),
etl_tstamp timestamp,
collector_tstamp timestamp,
dvce_created_tstamp timestamp,
event varchar(128),
event_id char(36),
txn_id int,
name_tracker varchar(128),
v_tracker varchar(100),
v_collector varchar(100),
v_etl varchar(100),
user_id varchar(255),
user_ipaddress varchar(128),
user_fingerprint varchar(128),
domain_userid varchar(128),
domain_sessionidx smallint,
network_userid varchar(128),
geo_country char(2),
geo_region char(3),
geo_city varchar(75),
geo_zipcode varchar(15),
geo_latitude double,
geo_longitude double,
geo_region_name varchar(100),
ip_isp varchar(100),
ip_organization varchar(128),
ip_domain varchar(128),
ip_netspeed varchar(100),
page_url varchar(4096),
page_title varchar(2000),
page_referrer varchar(4096),
page_urlscheme varchar(16),
page_urlhost varchar(255),
page_urlport int,
page_urlpath varchar(3000),
page_urlquery varchar(6000),
page_urlfragment varchar(3000),
refr_urlscheme varchar(16),
refr_urlhost varchar(255),
refr_urlport int,
refr_urlpath varchar(6000),
refr_urlquery varchar(6000),
refr_urlfragment varchar(3000),
refr_medium varchar(25),
refr_source varchar(50),
refr_term varchar(255),
mkt_medium varchar(255),
mkt_source varchar(255),
mkt_term varchar(255),
mkt_content varchar(500),
mkt_campaign varchar(255),
se_category varchar(1000),
se_action varchar(1000),
se_label varchar(4096),
se_property varchar(1000),
se_value double,
tr_orderid varchar(255),
tr_affiliation varchar(255),
tr_total decimal(18, 2),
tr_tax decimal(18, 2),
tr_shipping decimal(18, 2),
tr_city varchar(255),
tr_state varchar(255),
tr_country varchar(255),
ti_orderid varchar(255),
ti_sku varchar(255),
ti_name varchar(255),
ti_category varchar(255),
ti_price decimal(18, 2),
ti_quantity int,
pp_xoffset_min int,
pp_xoffset_max int,
pp_yoffset_min int,
pp_yoffset_max int,
useragent varchar(1000),
br_name varchar(50),
br_family varchar(50),
br_version varchar(50),
br_type varchar(50),
br_renderengine varchar(50),
br_lang varchar(255),
br_features_pdf boolean,
br_features_flash boolean,
br_features_java boolean,
br_features_director boolean,
br_features_quicktime boolean,
br_features_realplayer boolean,
br_features_windowsmedia boolean,
br_features_gears boolean,
br_features_silverlight boolean,
br_cookies boolean,
br_colordepth varchar(12),
br_viewwidth int,
br_viewheight int,
os_name varchar(50),
os_family varchar(50),
os_manufacturer varchar(50),
os_timezone varchar(255),
dvce_type varchar(50),
dvce_ismobile boolean,
dvce_screenwidth int,
dvce_screenheight int,
doc_charset varchar(128),
doc_width int,
doc_height int,
tr_currency char(3),
tr_total_base decimal(18, 2),
tr_tax_base decimal(18, 2),
tr_shipping_base decimal(18, 2),
ti_currency char(3),
ti_price_base decimal(18, 2),
base_currency char(3),
geo_timezone varchar(64),
mkt_clickid varchar(128),
mkt_network varchar(64),
etl_tags varchar(500),
dvce_sent_tstamp timestamp,
refr_domain_userid varchar(128),
refr_dvce_tstamp timestamp,
domain_sessionid char(128),
derived_tstamp timestamp,
event_vendor varchar(1000),
event_name varchar(1000),
event_format varchar(128),
event_version varchar(128),
event_fingerprint varchar(128),
true_tstamp timestamp,
contexts_com_snowplowanalytics_mobile_application_lifecycle_1 array<struct<index:int,is_visible:boolean>>,
contexts_com_snowplowanalytics_snowplow_client_session_1 array<struct<event_index:int,first_event_id:string,first_event_timestamp:timestamp,previous_session_id:string,session_id:string,session_index:int,storage_mechanism:string,user_id:string>>,
contexts_com_snowplowanalytics_snowplow_mobile_context_1 array<struct<android_idfa:string,app_available_memory:bigint,apple_idfa:string,apple_idfv:string,available_storage:bigint,battery_level:int,battery_state:string,carrier:string,device_manufacturer:string,device_model:string,low_power_mode:boolean,network_technology:string,network_type:string,open_idfa:string,os_type:string,os_version:string,physical_memory:bigint,system_available_memory:bigint,total_storage:bigint>>,
contexts_com_snowplowanalytics_snowplow_ua_parser_context_1 array<struct<device_family:string,os_family:string,os_major:string,os_minor:string,os_patch:string,os_patch_minor:string,os_version:string,useragent_family:string,useragent_major:string,useragent_minor:string,useragent_patch:string,useragent_version:string>>,
contexts_com_snowplowanalytics_snowplow_web_page_1 array<struct<id:string>>,
custom_column_x struct<id:string,name:string,previous_id:string,previous_name:string,previous_type:string,transition_type:string,type:string>,
custom_column_y struct<background_index:int>,
load_tstamp timestamp,
collector_tstamp_date DATE GENERATED ALWAYS AS (DATE(collector_tstamp))
)
USING DELTA LOCATION 'path'
PARTITIONED BY (collector_tstamp_date, event_name);
Thanks for any help you can offer!