Databricks RDB Loader: Table or view not found for 'SHOW COLUMNS'

Daniel_Baron · August 24, 2022, 9:44am

We have deployed the Databricks RDB loader (version 4.2.1) with a Databricks cluster (DBR 9.1 LTS). Both are up, running and talking to each other and we can see the manifest table has been created correctly. We can also see queries being submitted to the cluster in the SparkUI.

However, once the manifest has been created the RDB Loader runs

SHOW columns in hive_metastore.snowplow_schema.events

Which correctly causes the following error as I would expect:

24/08/2022 08:29:45.0697+0000 [ERROR] com.snowplowanalytics.snowplow.rdbloader: Transaction aborted. Sleeping for 30 seconds for the first time. Caught exception: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found for 'SHOW COLUMNS': hive_metastore.snowplow_schema.events; line 1 pos 0;

This seems correct to me and I’ve verified this in a Databricks notebook. What I don’t understand is why isn’t the RDB Loader creating the table first?

Any ideas on how we can resolve this one?

Thanks!

Config

{
  "region": "eu-west-1",
  "messageQueue": ${?SQS_QUEUE_NAME},
  "storage" : {
    "type": "databricks",
    "host": ${DATABRICKS_HOST_NAME},
    "password": ${databricks_access_token}
    "catalog": "hive_metastore",
    "schema": ${SCHEMA_NAME},
    "port": 443,
    "httpPath": "path",
    "userAgent": "snowplow-rdbloader-oss"
    "loadAuthMethod": {
      "type": "NoCreds"
    }
  },
  "monitoring": {
    "metrics": {
      "stdout": {
      }
      "period": "5 minutes"
    },
  },
  "retries": {
    "backoff": "30 seconds"
    "strategy": "EXPONENTIAL"
    "attempts": 3,
    "cumulativeBound": "1 hour"
  },
  "readyCheck": {
    "backoff": "15 seconds"
    "strategy": "CONSTANT"
  },
  "initRetries": {
    "backoff": "30 seconds"
    "strategy": "EXPONENTIAL"
    "attempts": 3,
    "cumulativeBound": "1 hour"
  },
  "retryQueue": {
    "period": "30 minutes",
    "size": 64,
    "maxAttempts": 3,
    "interval": "5 seconds"
  },
  "timeouts": {
    "loading": "1 hour",
    "nonLoading": "10 minutes"
    "sqsVisibility": "5 minutes"
  }
}

Relevant Logs

[INFO] com.snowplowanalytics.snowplow.rdbloader: RDB Loader 4.2.1 has started. Listening queue.fifo
[INFO] HikariPool-1 - Starting...
[INFO] HikariPool-1 - Driver does not support get/set network timeout for connections. ([Databricks][JDBC](10220) Driver does not support this optional feature.)
[INFO] HikariPool-1 - Start completed.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.databricks.client.jdbc42.internal.io.netty.util.internal.ReflectionUtil (file:/app/snowplow-databricks-loader-4.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of com.databricks.client.jdbc42.internal.io.netty.util.internal.ReflectionUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[INFO] Manifest: The manifest table has been created
[INFO] FolderMonitoring: Configuration for monitoring.folders hasn't been provided - monitoring is disabled
[INFO] DataDiscovery: Received a new message
[INFO] DataDiscovery: Total 1 messages received, 0 loaded
[INFO] DataDiscovery: New data discovery at run=2022-08-23-12-05-00-61fec39a-6165-4504-8ffb-bb68cc37537d with following shredded types:
  * iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-*-* WIDEROW
  * iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-*-* WIDEROW
[INFO] Load: Loading transaction for s3://bucket/snowplow-stream/transformed/archive/run=2022-08-23-12-05-00-61fec39a-6165-4504-8ffb-bb68cc37537d/ has started
[INFO] Load: Loading s3://bucket/snowplow-stream/transformed/archive/run=2022-08-23-12-05-00-61fec39a-6165-4504-8ffb-bb68cc37537d/
[ERROR] com.snowplowanalytics.snowplow.rdbloader: Transaction aborted. Sleeping for 30 seconds for the first time. Caught exception: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found for 'SHOW COLUMNS': hive_metastore.snowplow_schema.events; line 1 pos 0;
'ShowColumns
+- 'UnresolvedTableOrView [hive_metastore, snowplow_schema, events], SHOW COLUMNS, true
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:1019)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:759)
\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
\tat org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties(ThriftLocalProperties.scala:112)
\tat org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties$(ThriftLocalProperties.scala:47)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:56)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:737)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:722)
\tat java.security.AccessController.doPrivileged(Native Method)
\tat javax.security.auth.Subject.doAs(Subject.java:422)
\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:771)
\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)
\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
\tat java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.AnalysisException: Table or view not found for 'SHOW COLUMNS': hive_metastore.snowplow_schema.events; line 1 pos 0;
'ShowColumns
+- 'UnresolvedTableOrView [hive_metastore, snowplow_schema, events], SHOW COLUMNS, true
\tat org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
\tat org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$2(CheckAnalysis.scala:116)
\tat org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$2$adapted(CheckAnalysis.scala:99)
\tat org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:262)
\tat org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:261)
\tat org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:261)
\tat scala.collection.Iterator.foreach(Iterator.scala:941)
\tat scala.collection.Iterator.foreach$(Iterator.scala:941)
\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)
\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)
\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)
\tat org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:261)
\tat org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:99)
\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
\tat com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
\tat org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:96)
\tat org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:96)
\tat org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:191)
\tat org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:248)
\tat org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:347)
\tat org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:245)
\tat org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:96)
\tat com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
\tat org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:134)
\tat org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:180)
\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)
\tat org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:180)
\tat org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:97)
\tat org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:94)
\tat org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:86)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$compileQuery$2(SparkExecuteStatementOperation.scala:848)
\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$compileQuery$1(SparkExecuteStatementOperation.scala:842)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getOrCreateDF(SparkExecuteStatementOperation.scala:831)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.compileQuery(SparkExecuteStatementOperation.scala:842)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:896)
\t... 16 more
, Query: SHOW columns in hive_metastore.snowplow_schema.events.

enes_aldemir · August 24, 2022, 11:26am

Hey @Daniel_Baron,

At the moment, Databricks Loader doesn’t create event table. It should be created beforehand. It looks like we’ve missed to mention it in our docs. We will add that. In the meantime, I include create table statement below to unblock you.

CREATE TABLE IF NOT EXISTS snowplow_schema.events (
  -- App
  app_id                      VARCHAR(255),
  platform                    VARCHAR(255),
  -- Date/time
  etl_tstamp                  TIMESTAMP,
  collector_tstamp            TIMESTAMP       NOT NULL,
  dvce_created_tstamp         TIMESTAMP,
  -- Event
  event                       VARCHAR(128),
  event_id                    CHAR(36)        NOT NULL,
  txn_id                      INTEGER,
  -- Namespacing and versioning
  name_tracker                VARCHAR(128),
  v_tracker                   VARCHAR(100),
  v_collector                 VARCHAR(100)    NOT NULL,
  v_etl                       VARCHAR(100)    NOT NULL,
  -- User and visit
  user_id                     VARCHAR(255),
  user_ipaddress              VARCHAR(128),
  user_fingerprint            VARCHAR(128),
  domain_userid               VARCHAR(128),
  domain_sessionidx           SMALLINT,
  network_userid              VARCHAR(128),
  -- Location
  geo_country                 CHAR(2),
  geo_region                  CHAR(3),
  geo_city                    VARCHAR(75),
  geo_zipcode                 VARCHAR(15),
  geo_latitude                DOUBLE,
  geo_longitude               DOUBLE,
  geo_region_name             VARCHAR(100),
  -- IP lookups
  ip_isp                      VARCHAR(100),
  ip_organization             VARCHAR(128),
  ip_domain                   VARCHAR(128),
  ip_netspeed                 VARCHAR(100),
  -- Page
  page_url                    VARCHAR(4096),
  page_title                  VARCHAR(2000),
  page_referrer               VARCHAR(4096),
  -- Page URL components
  page_urlscheme              VARCHAR(16),
  page_urlhost                VARCHAR(255),
  page_urlport                INTEGER,
  page_urlpath                VARCHAR(3000),
  page_urlquery               VARCHAR(6000),
  page_urlfragment            VARCHAR(3000),
  -- Referrer URL components
  refr_urlscheme              VARCHAR(16),
  refr_urlhost                VARCHAR(255),
  refr_urlport                INTEGER,
  refr_urlpath                VARCHAR(6000),
  refr_urlquery               VARCHAR(6000),
  refr_urlfragment            VARCHAR(3000),
  -- Referrer details
  refr_medium                 VARCHAR(25),
  refr_source                 VARCHAR(50),
  refr_term                   VARCHAR(255),
  -- Marketing
  mkt_medium                  VARCHAR(255),
  mkt_source                  VARCHAR(255),
  mkt_term                    VARCHAR(255),
  mkt_content                 VARCHAR(500),
  mkt_campaign                VARCHAR(255),
  -- Custom structured event
  se_category                 VARCHAR(1000),
  se_action                   VARCHAR(1000),
  se_label                    VARCHAR(4096),
  se_property                 VARCHAR(1000),
  se_value                    DOUBLE,
  -- Ecommerce
  tr_orderid                  VARCHAR(255),
  tr_affiliation              VARCHAR(255),
  tr_total                    DECIMAL(18,2),
  tr_tax                      DECIMAL(18,2),
  tr_shipping                 DECIMAL(18,2),
  tr_city                     VARCHAR(255),
  tr_state                    VARCHAR(255),
  tr_country                  VARCHAR(255),
  ti_orderid                  VARCHAR(255),
  ti_sku                      VARCHAR(255),
  ti_name                     VARCHAR(255),
  ti_category                 VARCHAR(255),
  ti_price                    DECIMAL(18,2),
  ti_quantity                 INTEGER,
  -- Page ping
  pp_xoffset_min              INTEGER,
  pp_xoffset_max              INTEGER,
  pp_yoffset_min              INTEGER,
  pp_yoffset_max              INTEGER,
  -- User Agent
  useragent                   VARCHAR(1000),
  -- Browser
  br_name                     VARCHAR(50),
  br_family                   VARCHAR(50),
  br_version                  VARCHAR(50),
  br_type                     VARCHAR(50),
  br_renderengine             VARCHAR(50),
  br_lang                     VARCHAR(255),
  br_features_pdf             BOOLEAN,
  br_features_flash           BOOLEAN,
  br_features_java            BOOLEAN,
  br_features_director        BOOLEAN,
  br_features_quicktime       BOOLEAN,
  br_features_realplayer      BOOLEAN,
  br_features_windowsmedia    BOOLEAN,
  br_features_gears           BOOLEAN,
  br_features_silverlight     BOOLEAN,
  br_cookies                  BOOLEAN,
  br_colordepth               VARCHAR(12),
  br_viewwidth                INTEGER,
  br_viewheight               INTEGER,
  -- Operating System
  os_name                     VARCHAR(50),
  os_family                   VARCHAR(50),
  os_manufacturer             VARCHAR(50),
  os_timezone                 VARCHAR(255),
  -- Device/Hardware
  dvce_type                   VARCHAR(50),
  dvce_ismobile               BOOLEAN,
  dvce_screenwidth            INTEGER,
  dvce_screenheight           INTEGER,
  -- Document
  doc_charset                 VARCHAR(128),
  doc_width                   INTEGER,
  doc_height                  INTEGER,
  -- Currency
  tr_currency                 CHAR(3),
  tr_total_base               DECIMAL(18, 2),
  tr_tax_base                 DECIMAL(18, 2),
  tr_shipping_base            DECIMAL(18, 2),
  ti_currency                 CHAR(3),
  ti_price_base               DECIMAL(18, 2),
  base_currency               CHAR(3),
  -- Geolocation
  geo_timezone                VARCHAR(64),
  -- Click ID
  mkt_clickid                 VARCHAR(128),
  mkt_network                 VARCHAR(64),
  -- ETL tags
  etl_tags                    VARCHAR(500),
  -- Time event was sent
  dvce_sent_tstamp            TIMESTAMP,
  -- Referer
  refr_domain_userid          VARCHAR(128),
  refr_dvce_tstamp            TIMESTAMP,
  -- Session ID
  domain_sessionid            CHAR(128),
  -- Derived timestamp
  derived_tstamp              TIMESTAMP,
  -- Event schema
  event_vendor                VARCHAR(1000),
  event_name                  VARCHAR(1000),
  event_format                VARCHAR(128),
  event_version               VARCHAR(128),
  -- Event fingerprint
  event_fingerprint           VARCHAR(128),
  -- True timestamp
  true_tstamp                 TIMESTAMP
)
PARTITIONED BY (collector_tstamp);

PaulBoocock · August 24, 2022, 1:06pm

As a minor addendum to this, it might be worth using a generated column for your partition too, as described in the 4.1.0 release:

Daniel_Baron · August 25, 2022, 3:20pm

Thanks @enes_aldemir & @PaulBoocock for getting back to us!

We’ve tried your suggestion and we have progressed to another error

It doesn’t seem to work with unity catalog, from the logs it seems to me it’s expecting schema_name.table_name FROM instead of catalog_name.schema_name.table_name FROM?

Please advise on any actions we can take to resolve this.

Logs:

[INFO] DataDiscovery: Received a new message
[INFO] DataDiscovery: Total 7 messages received, 0 loaded
[INFO] DataDiscovery: New data discovery at run=2022-08-25-13-55-00-2ffd4250-c0da-4b71-ab53-a9926bb3a389 with following shredded types:
  * iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-*-* WIDEROW
  * iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-*-* WIDEROW
[INFO] Load: Loading transaction for s3://bucket/snowplow-stream/transformed/archive/run=2022-08-25-13-55-00-2ffd4250-c0da-4b71-ab53-a9926bb3a389/ has started
[INFO] Load: Loading s3://bucket/snowplow-stream/transformed/archive/run=2022-08-25-13-55-00-2ffd4250-c0da-4b71-ab53-a9926bb3a389/
[INFO] Load: COPY events FROM s3://bucket/snowplow-stream/transformed/archive/run=2022-08-25-13-55-00-2ffd4250-c0da-4b71-ab53-a9926bb3a389/
[ERROR] com.snowplowanalytics.snowplow.rdbloader: Transaction aborted. Sleeping for 30 seconds for the first time. Caught exception: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input '.' expecting {'FROM', '-'}(line 1, pos 67)
== SQL ==
COPY INTO hive_metastore.snowplow_schema.events 
----------------------------------------^^^
                      FROM (
                        SELECT app_id,platform,etl_tstamp,collector_tstamp,dvce_created_tstamp,event,event_id,txn_id,name_tracker,v_tracker,v_collector,v_etl,user_id,user_ipaddress,user_fingerprint,domain_userid,domain_sessionidx,network_userid,geo_country,geo_region,geo_city,geo_zipcode,geo_latitude,geo_longitude,geo_region_name,ip_isp,ip_organization,ip_domain,ip_netspeed,page_url,page_title,page_referrer,page_urlscheme,page_urlhost,page_urlport,page_urlpath,page_urlquery,page_urlfragment,refr_urlscheme,refr_urlhost,refr_urlport,refr_urlpath,refr_urlquery,refr_urlfragment,refr_medium,refr_source,refr_term,mkt_medium,mkt_source,mkt_term,mkt_content,mkt_campaign,se_category,se_action,se_label,se_property,se_value,tr_orderid,tr_affiliation,tr_total,tr_tax,tr_shipping,tr_city,tr_state,tr_country,ti_orderid,ti_sku,ti_name,ti_category,ti_price,ti_quantity,pp_xoffset_min,pp_xoffset_max,pp_yoffset_min,pp_yoffset_max,useragent,br_name,br_family,br_version,br_type,br_renderengine,br_lang,br_features_pdf,br_features_flash,br_features_java,br_features_director,br_features_quicktime,br_features_realplayer,br_features_windowsmedia,br_features_gears,br_features_silverlight,br_cookies,br_colordepth,br_viewwidth,br_viewheight,os_name,os_family,os_manufacturer,os_timezone,dvce_type,dvce_ismobile,dvce_screenwidth,dvce_screenheight,doc_charset,doc_width,doc_height,tr_currency,tr_total_base,tr_tax_base,tr_shipping_base,ti_currency,ti_price_base,base_currency,geo_timezone,mkt_clickid,mkt_network,etl_tags,dvce_sent_tstamp,refr_domain_userid,refr_dvce_tstamp,domain_sessionid,derived_tstamp,event_vendor,event_name,event_format,event_version,event_fingerprint,true_tstamp,contexts_com_snowplowanalytics_snowplow_web_page_1,contexts_com_snowplowanalytics_snowplow_ua_parser_context_1,current_timestamp() AS load_tstamp from 's3://bucket/snowplow-stream/transformed/archive/run=2022-08-25-13-55-00-2ffd4250-c0da-4b71-ab53-a9926bb3a389/output=good/' 
                      )
                      FILEFORMAT = PARQUET
                      COPY_OPTIONS('MERGESCHEMA' = 'TRUE')
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:1019)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:759)
\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
\tat org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties(ThriftLocalProperties.scala:112)
\tat org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties$(ThriftLocalProperties.scala:47)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:56)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:737)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:722)
\tat java.security.AccessController.doPrivileged(Native Method)
\tat javax.security.auth.Subject.doAs(Subject.java:422)
\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:771)
\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)
\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
\tat java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:

Our table definition in AWS Glue Metastore looks like (with internal values changed):

{
  "Table": {
    "Name": "events",
    "DatabaseName": "snowplow_schema",
    "Description": "",
    "Owner": "",
    "CreateTime": "***",
    "UpdateTime": "***",
    "Retention": 0,
    "StorageDescriptor": {
      "Columns": [
        {
          "Name": "app_id",
          "Type": "varchar(255)"
        },
        {
          "Name": "platform",
          "Type": "varchar(255)"
        },
        {
          "Name": "etl_tstamp",
          "Type": "timestamp"
        },
        {
          "Name": "collector_tstamp",
          "Type": "timestamp"
        },
        {
          "Name": "dvce_created_tstamp",
          "Type": "timestamp"
        },
        {
          "Name": "event",
          "Type": "varchar(128)"
        },
        {
          "Name": "event_id",
          "Type": "char(36)"
        },
        {
          "Name": "txn_id",
          "Type": "int"
        },
        {
          "Name": "name_tracker",
          "Type": "varchar(128)"
        },
        {
          "Name": "v_tracker",
          "Type": "varchar(100)"
        },
        {
          "Name": "v_collector",
          "Type": "varchar(100)"
        },
        {
          "Name": "v_etl",
          "Type": "varchar(100)"
        },
        {
          "Name": "user_id",
          "Type": "varchar(255)"
        },
        {
          "Name": "user_ipaddress",
          "Type": "varchar(128)"
        },
        {
          "Name": "user_fingerprint",
          "Type": "varchar(128)"
        },
        {
          "Name": "domain_userid",
          "Type": "varchar(128)"
        },
        {
          "Name": "domain_sessionidx",
          "Type": "smallint"
        },
        {
          "Name": "network_userid",
          "Type": "varchar(128)"
        },
        {
          "Name": "geo_country",
          "Type": "char(2)"
        },
        {
          "Name": "geo_region",
          "Type": "char(3)"
        },
        {
          "Name": "geo_city",
          "Type": "varchar(75)"
        },
        {
          "Name": "geo_zipcode",
          "Type": "varchar(15)"
        },
        {
          "Name": "geo_latitude",
          "Type": "double"
        },
        {
          "Name": "geo_longitude",
          "Type": "double"
        },
        {
          "Name": "geo_region_name",
          "Type": "varchar(100)"
        },
        {
          "Name": "ip_isp",
          "Type": "varchar(100)"
        },
        {
          "Name": "ip_organization",
          "Type": "varchar(128)"
        },
        {
          "Name": "ip_domain",
          "Type": "varchar(128)"
        },
        {
          "Name": "ip_netspeed",
          "Type": "varchar(100)"
        },
        {
          "Name": "page_url",
          "Type": "varchar(4096)"
        },
        {
          "Name": "page_title",
          "Type": "varchar(2000)"
        },
        {
          "Name": "page_referrer",
          "Type": "varchar(4096)"
        },
        {
          "Name": "page_urlscheme",
          "Type": "varchar(16)"
        },
        {
          "Name": "page_urlhost",
          "Type": "varchar(255)"
        },
        {
          "Name": "page_urlport",
          "Type": "int"
        },
        {
          "Name": "page_urlpath",
          "Type": "varchar(3000)"
        },
        {
          "Name": "page_urlquery",
          "Type": "varchar(6000)"
        },
        {
          "Name": "page_urlfragment",
          "Type": "varchar(3000)"
        },
        {
          "Name": "refr_urlscheme",
          "Type": "varchar(16)"
        },
        {
          "Name": "refr_urlhost",
          "Type": "varchar(255)"
        },
        {
          "Name": "refr_urlport",
          "Type": "int"
        },
        {
          "Name": "refr_urlpath",
          "Type": "varchar(6000)"
        },
        {
          "Name": "refr_urlquery",
          "Type": "varchar(6000)"
        },
        {
          "Name": "refr_urlfragment",
          "Type": "varchar(3000)"
        },
        {
          "Name": "refr_medium",
          "Type": "varchar(25)"
        },
        {
          "Name": "refr_source",
          "Type": "varchar(50)"
        },
        {
          "Name": "refr_term",
          "Type": "varchar(255)"
        },
        {
          "Name": "mkt_medium",
          "Type": "varchar(255)"
        },
        {
          "Name": "mkt_source",
          "Type": "varchar(255)"
        },
        {
          "Name": "mkt_term",
          "Type": "varchar(255)"
        },
        {
          "Name": "mkt_content",
          "Type": "varchar(500)"
        },
        {
          "Name": "mkt_campaign",
          "Type": "varchar(255)"
        },
        {
          "Name": "se_category",
          "Type": "varchar(1000)"
        },
        {
          "Name": "se_action",
          "Type": "varchar(1000)"
        },
        {
          "Name": "se_label",
          "Type": "varchar(4096)"
        },
        {
          "Name": "se_property",
          "Type": "varchar(1000)"
        },
        {
          "Name": "se_value",
          "Type": "double"
        },
        {
          "Name": "tr_orderid",
          "Type": "varchar(255)"
        },
        {
          "Name": "tr_affiliation",
          "Type": "varchar(255)"
        },
        {
          "Name": "tr_total",
          "Type": "decimal(18,2)"
        },
        {
          "Name": "tr_tax",
          "Type": "decimal(18,2)"
        },
        {
          "Name": "tr_shipping",
          "Type": "decimal(18,2)"
        },
        {
          "Name": "tr_city",
          "Type": "varchar(255)"
        },
        {
          "Name": "tr_state",
          "Type": "varchar(255)"
        },
        {
          "Name": "tr_country",
          "Type": "varchar(255)"
        },
        {
          "Name": "ti_orderid",
          "Type": "varchar(255)"
        },
        {
          "Name": "ti_sku",
          "Type": "varchar(255)"
        },
        {
          "Name": "ti_name",
          "Type": "varchar(255)"
        },
        {
          "Name": "ti_category",
          "Type": "varchar(255)"
        },
        {
          "Name": "ti_price",
          "Type": "decimal(18,2)"
        },
        {
          "Name": "ti_quantity",
          "Type": "int"
        },
        {
          "Name": "pp_xoffset_min",
          "Type": "int"
        },
        {
          "Name": "pp_xoffset_max",
          "Type": "int"
        },
        {
          "Name": "pp_yoffset_min",
          "Type": "int"
        },
        {
          "Name": "pp_yoffset_max",
          "Type": "int"
        },
        {
          "Name": "useragent",
          "Type": "varchar(1000)"
        },
        {
          "Name": "br_name",
          "Type": "varchar(50)"
        },
        {
          "Name": "br_family",
          "Type": "varchar(50)"
        },
        {
          "Name": "br_version",
          "Type": "varchar(50)"
        },
        {
          "Name": "br_type",
          "Type": "varchar(50)"
        },
        {
          "Name": "br_renderengine",
          "Type": "varchar(50)"
        },
        {
          "Name": "br_lang",
          "Type": "varchar(255)"
        },
        {
          "Name": "br_features_pdf",
          "Type": "boolean"
        },
        {
          "Name": "br_features_flash",
          "Type": "boolean"
        },
        {
          "Name": "br_features_java",
          "Type": "boolean"
        },
        {
          "Name": "br_features_director",
          "Type": "boolean"
        },
        {
          "Name": "br_features_quicktime",
          "Type": "boolean"
        },
        {
          "Name": "br_features_realplayer",
          "Type": "boolean"
        },
        {
          "Name": "br_features_windowsmedia",
          "Type": "boolean"
        },
        {
          "Name": "br_features_gears",
          "Type": "boolean"
        },
        {
          "Name": "br_features_silverlight",
          "Type": "boolean"
        },
        {
          "Name": "br_cookies",
          "Type": "boolean"
        },
        {
          "Name": "br_colordepth",
          "Type": "varchar(12)"
        },
        {
          "Name": "br_viewwidth",
          "Type": "int"
        },
        {
          "Name": "br_viewheight",
          "Type": "int"
        },
        {
          "Name": "os_name",
          "Type": "varchar(50)"
        },
        {
          "Name": "os_family",
          "Type": "varchar(50)"
        },
        {
          "Name": "os_manufacturer",
          "Type": "varchar(50)"
        },
        {
          "Name": "os_timezone",
          "Type": "varchar(255)"
        },
        {
          "Name": "dvce_type",
          "Type": "varchar(50)"
        },
        {
          "Name": "dvce_ismobile",
          "Type": "boolean"
        },
        {
          "Name": "dvce_screenwidth",
          "Type": "int"
        },
        {
          "Name": "dvce_screenheight",
          "Type": "int"
        },
        {
          "Name": "doc_charset",
          "Type": "varchar(128)"
        },
        {
          "Name": "doc_width",
          "Type": "int"
        },
        {
          "Name": "doc_height",
          "Type": "int"
        },
        {
          "Name": "tr_currency",
          "Type": "char(3)"
        },
        {
          "Name": "tr_total_base",
          "Type": "decimal(18,2)"
        },
        {
          "Name": "tr_tax_base",
          "Type": "decimal(18,2)"
        },
        {
          "Name": "tr_shipping_base",
          "Type": "decimal(18,2)"
        },
        {
          "Name": "ti_currency",
          "Type": "char(3)"
        },
        {
          "Name": "ti_price_base",
          "Type": "decimal(18,2)"
        },
        {
          "Name": "base_currency",
          "Type": "char(3)"
        },
        {
          "Name": "geo_timezone",
          "Type": "varchar(64)"
        },
        {
          "Name": "mkt_clickid",
          "Type": "varchar(128)"
        },
        {
          "Name": "mkt_network",
          "Type": "varchar(64)"
        },
        {
          "Name": "etl_tags",
          "Type": "varchar(500)"
        },
        {
          "Name": "dvce_sent_tstamp",
          "Type": "timestamp"
        },
        {
          "Name": "refr_domain_userid",
          "Type": "varchar(128)"
        },
        {
          "Name": "refr_dvce_tstamp",
          "Type": "timestamp"
        },
        {
          "Name": "domain_sessionid",
          "Type": "char(128)"
        },
        {
          "Name": "derived_tstamp",
          "Type": "timestamp"
        },
        {
          "Name": "event_vendor",
          "Type": "varchar(1000)"
        },
        {
          "Name": "event_name",
          "Type": "varchar(1000)"
        },
        {
          "Name": "event_format",
          "Type": "varchar(128)"
        },
        {
          "Name": "event_version",
          "Type": "varchar(128)"
        },
        {
          "Name": "event_fingerprint",
          "Type": "varchar(128)"
        },
        {
          "Name": "true_tstamp",
          "Type": "timestamp"
        },
        {
          "Name": "contexts_com_snowplowanalytics_snowplow_web_page_1",
          "Type": "array<string>"
        },
        {
          "Name": "contexts_com_snowplowanalytics_snowplow_ua_parser_context_1",
          "Type": "array<string>"
        },
        {
          "Name": "load_tstamp",
          "Type": "timestamp"
        }
      ],
      "Location": "s3://bucket/snowplow_schema/events",
      "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
      "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "Compressed": false,
      "NumberOfBuckets": 0,
      "SerdeInfo": {
        "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
        "Parameters": {
          "path": "s3://bucket/snowplow_schema/events",
          "serialization.format": "1"
        }
      },
      "SortColumns": [],
      "StoredAsSubDirectories": false
    },
    "PartitionKeys": [
      {
        "Name": "collector_tstamp",
        "Type": "timestamp"
      }
    ],
    "TableType": "EXTERNAL_TABLE",
    "CreatedBy": "***",
    "IsRegisteredWithLakeFormation": false,
    "CatalogId": "***",
    "VersionId": "0"
  }
}

Apologies for the long post and thanks again for supporting on this!

Let me know if any additional info can help.

enes_aldemir · August 25, 2022, 3:39pm

Seems like you’ve bumped to the problem that is reported in here. The fix will be included in the next release. However, we’ve created 4.2.2-rc1 version which includes the fix for this problem. Could you try it to see if it solves your problem, please ?

Daniel_Baron · September 2, 2022, 1:37pm

Hey @enes_aldemir, I can confirm that 4.2.2-rc1 does resolve the COPY INTO issue when catalog is set to null.

Can you confirm when you expect 4.2.2 to be released with this change?

Couple additional of points I’d call out that might be good improvements/additions to the roadmap:

Schema

The schema appeared to miss a few of columns:

contexts_com_snowplowanalytics_snowplow_web_page_1
contexts_com_snowplowanalytics_snowplow_ua_parser_context_1
load_tstamp

Are these correct? Might be good to include the SQL to create the delta table in the docs/sample config if possible?

Delta Table

This requires the physical Delta table to be created on S3 prior to running the Loader which is slightly impractical. It would be ideal if this were created in the same way the manifest table is so it just works out the box.

Appreciate these may not be so trivial to resolve but would be great nice to haves!

Thanks for supporting us on this!

enes_aldemir · September 6, 2022, 11:23am

Sorry for late response @Daniel_Baron!

We’ve just released RDB Loader 4.2.2.

The schema appeared to miss a few of columns

It isn’t possible to add contexts_* columns in the beginning because we don’t know exactly which contexts columns will be used. Therefore, they are added to table later on.

It would be ideal if this were created in the same way the manifest table is so it just works out the box.

There is an open issue for that at the moment. It will be implemented in one of the next versions.

Daniel_Baron · September 8, 2022, 12:56pm

@enes_aldemir thanks for continued support on this!

It isn’t possible to add contexts_* columns in the beginning because we don’t know exactly which contexts columns will be used…

This makes sense - thanks for clarifying.

There is an open issue for that at the moment. It will be implemented in one of the next versions.

Thanks for pointing this out - we will be sure to track this issue.

I have 1 further question, I notice that the partition values on S3 are URL encoded. E.g.

s3://bucket/snowplow-schema/events/collector_tstamp=2022-09-08 10%3A24%3A50.687/part-00000-uuid.snappy.parquet

We’re considering fixing this in our dbt models but wondered if there’s a setting we can change to fix this; if not can this can be opened as an issue? Can’t see any existing issue covering this - happy to create one if I have access?

Thanks again!
Dan

Daniel_Baron · September 15, 2022, 8:38am

@enes_aldemir one other question we have: why is this data partitioned on a timestamp? Is there a setting we can use to partition on date?

PaulBoocock · September 15, 2022, 8:57am

My earlier comment is what you’re gonna need for partioning on a date instead, which is actually the recommended method following the 4.1.0 release: Databricks RDB Loader: Table or view not found for 'SHOW COLUMNS' - #3 by PaulBoocock

Basically, create your table as:

CREATE TABLE IF NOT EXISTS snowplow.events (
  app_id                      VARCHAR(255),
  collector_tstamp            TIMESTAMP       NOT NULL,
  event_name                  VARCHAR(1000),
  -- Lots of other fields go here

  -- Collector timestamp date for partitioning
  collector_tstamp_date       DATE GENERATED ALWAYS AS (DATE(collector_tstamp))
)
PARTITIONED BY (collector_tstamp_date, event_name);

enes_aldemir · September 15, 2022, 9:28am

I guess your second question is already answered by Paul.

For the first question, I am not sure if it is something we can change in the loader side. It seems like it is related with Databricks’ internal representation. However, I will look if there is anything we can do in there. It would be great if you can open an issue in Github with some description of the problem.

Topic		Replies	Views
RDB Loader Failing with Varying Schemas	7	1078	October 25, 2022
RDB loader failed. SQL `NULL` read at column 1	9	1401	December 5, 2022
Storage Loader Failure table Storage targets	2	1582	November 24, 2016
Step [rdb_load] stdout: Configuration error Attempt to decode value on failed cursor: DownField(sslMode) Troubleshooting	10	2293	November 6, 2019
RDB-Loader Redshift "Delimiter not found" For engineers	2	1369	November 7, 2022

Databricks RDB Loader: Table or view not found for 'SHOW COLUMNS'

Config

Relevant Logs

Schema

Delta Table

Related topics