Query multiple tables using a wildcard table  |  BigQuery  |  Google Cloud (2023)

Wildcard tables enable you to query multiple tables using concise SQLstatements. Wildcard tables are available only in Google Standard SQL. For equivalentfunctionality in legacy SQL, seeTable wildcard functions.

A wildcard table represents a union of all the tables that match the wildcardexpression. For example, the following FROM clause uses the wildcardexpression gsod* to match all tables in the noaa_gsod dataset that beginwith the string gsod.

FROM `bigquery-public-data.noaa_gsod.gsod*`

Each row in the wildcard table contains a special column, _TABLE_SUFFIX, which containsthe value matched by the wildcard character.

For information on wildcard table syntax, see Wildcard tablesin the Google Standard SQL reference.

Limitations

Wildcard table queries are subject to the following limitations.

  • The wildcard table functionality does not support views. If the wildcard tablematches any view in the dataset, the query returns an error. This is truewhether or not your query contains a WHERE clause on the _TABLE_SUFFIXpseudo column to filter out the view.
  • Currently, cached results are not supported for queries against multipletables using a wildcard even if the Use Cached Results option is checked.If you run the same wildcard query multiple times, you are billed for eachquery.
  • Wildcard tables support native BigQuery storage only. Youcannot use wildcards when querying an external tableor a view.
  • You cannot use wildcard queries over tables with incompatible partitioning ora mix of partitioned and non-partitioned tables.
  • Queries that containdata manipulation language(DML) statements cannot use a wildcard table as the target of the query. Forexample, a wildcard table may be used in the FROM clause of an UPDATEquery, but a wildcard table cannot be used as the target of the UPDATEoperation.
  • Filters on the _TABLE_SUFFIX or _PARTITIONTIME pseudo columns that include JavaScript user-defined functions do not limit the number of tables scanned in a wildcard table.
  • Wildcard queries are not supported for tables protected by customer-managedencryption keys (CMEK).
  • When using wildcard tables, all the tables in the dataset that begin with thetable name before * are scanned even if _TABLE_SUFFIX is used incombination with REGEXP_CONTAINSand is provided a regular expression, such as ^[0-9]{2}$. For example:

    SELECT *FROM `my_project.my_dataset.my_table_*`WHERE REGEXP_CONTAINS(_TABLE_SUFFIX, '^[0-9]{2}$');
  • If a single scanned table has a schema mismatch (that is, a column with thesame name is of a different type), the query fails with the error Cannotread field of type X as Y Field: column_name. All tables are matched evenif you are using the equality operator =. For example, in the followingquery, the table my_dataset.my_table_03_backup is also scanned. Thus, thequery may fail due to schema mismatch. However, if there is no schemamismatch, the results come from the table my_dataset.my_table_03 only, asexpected.

    SELECT *FROM my_project.my_dataset.my_table_*WHERE _TABLE_SUFFIX = '03'

Before you begin

  • Ensure that you are using Google Standard SQL. For moreinformation, see Switching SQL dialects.
  • If you are using legacy SQL, see Table wildcard functions.
  • Many of the examples on this page use a public dataset from theNational Oceanic and Atmospheric Administration (NOAA). For more informationabout the data, seeNOAA Global Surface Summary of the Day Weather Data.

When to use wildcard tables

Wildcard tables are useful when a dataset contains multiple, similarly namedtables that have compatible schemas. Typically, such datasets contain tablesthat each represent data from a single day, month, or year. For example, apublic dataset hosted by BigQuery, theNOAA Global Surface Summary of the Day Weather Data,contains a table for each year from 1929 through the present.

A query that scans all the table IDs from 1929 through 1940 would be very longif you have to name all 12 tables in the FROM clause (most of the tables areomitted in this sample):

#standardSQLSELECT max, ROUND((max-32)*5/9,1) celsius, mo, da, yearFROM ( SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1929` UNION ALL SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1930` UNION ALL SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1931` UNION ALL # ... Tables omitted for brevity SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1940` )WHERE max != 9999.9 # code for missing dataORDER BY max DESC

The same query using a wildcard table is much more concise:

#standardSQLSELECT max, ROUND((max-32)*5/9,1) celsius, mo, da, yearFROM `bigquery-public-data.noaa_gsod.gsod19*`WHERE max != 9999.9 # code for missing data AND _TABLE_SUFFIX BETWEEN '29' AND '40'ORDER BY max DESC

Wildcard tables support native BigQuery storage only. You cannot use wildcards when querying an external table or a view.

Querying sets of tables using wildcard tables

Wildcard tables enable you to query several tables concisely.For example, a public dataset hosted by BigQuery,the NOAA Global Surface Summary of the Day Weather Data,contains a table for each year from 1929 through the present that all share thecommon prefix gsod followed by the four-digit year. The tables are namedgsod1929, gsod1930, gsod1931, etc.

To query a group of tables that share a common prefix, use the table wildcardsymbol (*) after the table prefix in your FROM statement. For example,the following query finds the maximum temperature reported during the 1940s:

#standardSQLSELECT max, ROUND((max-32)*5/9,1) celsius, mo, da, yearFROM `bigquery-public-data.noaa_gsod.gsod194*`WHERE max != 9999.9 # code for missing dataORDER BY max DESC

Filtering selected tables using _TABLE_SUFFIX

To restrict a query so that it scans only a specified set of tables, use the_TABLE_SUFFIX pseudo column in a WHERE clause with a condition that is a constant expression.

The _TABLE_SUFFIX pseudocolumn contains the values matched by the table wildcard. For example, theprevious sample query, which scans all tables from the 1940s, uses a tablewildcard to represent the last digit of the year:

FROM `bigquery-public-data.noaa_gsod.gsod194*`

The corresponding _TABLE_SUFFIX pseudo column contains values in the range0 through 9, representing the tables gsod1940 through gsod1949. These_TABLE_SUFFIX values can be used in a WHERE clause to filter for specifictables.

For example, to filter for the maximum temperature in the years 1940 and 1944,use the values 0 and 4 for _TABLE_SUFFIX:

#standardSQLSELECT max, ROUND((max-32)*5/9,1) celsius, mo, da, yearFROM `bigquery-public-data.noaa_gsod.gsod194*`WHERE max != 9999.9 # code for missing data AND ( _TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4' )ORDER BY max DESC

Using _TABLE_SUFFIX can greatly reduce the number of bytes scanned, which helps reduce the cost of running your queries.

However, filters on _TABLE_SUFFIX that include conditions without constant expressions do not limit the number of tables scanned in a wildcard table. For example, the following query does not limit the tables scanned for the wildcard table bigquery-public-data.noaa_gsod.gsod19* because the filter uses the dynamic value of the table_id column:

#standardSQL# Scans all tables that match the prefix `gsod19`SELECT ROUND((max-32)*5/9,1) celsiusFROM `bigquery-public-data.noaa_gsod.gsod19*`WHERE _TABLE_SUFFIX = (SELECT SUBSTR(MAX(table_name), LENGTH('gsod19') + 1) FROM `bigquery-public-data.noaa_gsod.INFORMATION_SCHEMA.TABLES` WHERE table_name LIKE 'gsod194%')

As another example, the following query limits the scan based on the firstfilter condition,_TABLE_SUFFIX BETWEEN '40' and '60', because it is a constant expression.However, the following query does not limit the scan based on the second filtercondition, _TABLE_SUFFIX = (SELECT SUBSTR(MAX(table_name), LENGTH('gsod19')+ 1) FROM bigquery-public-data.noaa_gsod.INFORMATION_SCHEMA.TABLES WHERE table_name LIKE'gsod194%'), because it is a dynamic expression:

#standardSQL# Scans all tables with names that fall between `gsod1940` and `gsod1960`SELECT ROUND((max-32)*5/9,1) celsiusFROM `bigquery-public-data.noaa_gsod.gsod19*`WHERE _TABLE_SUFFIX BETWEEN '40' AND '60' AND _TABLE_SUFFIX = (SELECT SUBSTR(MAX(table_name), LENGTH('gsod19') + 1) FROM `bigquery-public-data.noaa_gsod.INFORMATION_SCHEMA.TABLES` WHERE table_name LIKE 'gsod194%')

As a workaround, you can perform two separate queries instead; for example:

First query:

#standardSQL# Get the list of tables that match the required table name prefixesSELECT SUBSTR(MAX(table_name), LENGTH('gsod19') + 1) FROM `bigquery-public-data.noaa_gsod.INFORMATION_SCHEMA.TABLES` WHERE table_name LIKE 'gsod194%'

Second query:

#standardSQL# Construct the second query based on the values from the first querySELECT ROUND((max-32)*5/9,1) celsiusFROM `bigquery-public-data.noaa_gsod.gsod19*`WHERE _TABLE_SUFFIX = '49'

These example queries use the INFORMATION_SCHEMA.TABLES view. For more information on the INFORMATION_SCHEMA table, see Getting table metadata using INFORMATION_SCHEMA.

Scanning a range of tables using _TABLE_SUFFIX

To scan a range of tables, use the _TABLE_SUFFIX pseudo column along withthe BETWEEN clause. For example, to find the maximum temperature reported inthe years between 1929 and 1935 inclusive, use the table wildcard to representthe last two digits of the year:

#standardSQLSELECT max, ROUND((max-32)*5/9,1) celsius, mo, da, yearFROM `bigquery-public-data.noaa_gsod.gsod19*`WHERE max != 9999.9 # code for missing data AND _TABLE_SUFFIX BETWEEN '29' and '35'ORDER BY max DESC

Scanning a range of ingestion-time partitioned tables using _PARTITIONTIME

To scan a range of ingestion-time partitioned tables, use the _PARTITIONTIMEpseudo column with the _TABLE_SUFFIX pseudo column. For example, the following query scans the January 1, 2017 partition in the table my_dataset.mytable_id1.

#standardSQLSELECT field1, field2, field3FROM `my_dataset.mytable_*`WHERE _TABLE_SUFFIX = 'id1' AND _PARTITIONTIME = TIMESTAMP('2017-01-01')

Querying all tables in a dataset

To scan all tables in a dataset, you can use an empty prefix and the tablewildcard, which means that the _TABLE_SUFFIX pseudo column containsfull table names. For example, the following FROM clause scans all tables inthe GSOD dataset:

FROM `bigquery-public-data.noaa_gsod.*`

With an empty prefix, the _TABLE_SUFFIX pseudo column contains full tablenames. For example, the following query is equivalent to the previous examplethat finds the maximum temperature between the years 1929 and 1935, but usesfull table names in the WHERE clause:

#standardSQLSELECT max, ROUND((max-32)*5/9,1) celsius, mo, da, yearFROM `bigquery-public-data.noaa_gsod.*`WHERE max != 9999.9 # code for missing data AND _TABLE_SUFFIX BETWEEN 'gsod1929' and 'gsod1935'ORDER BY max DESC

Note, however, that longer prefixes generally perform better. For moreinformation, see Best practices.

Query execution details

Schema used for query evaluation

In order to execute a Google Standard SQL query that uses a wildcard table,BigQuery automatically infers the schema for that table.BigQuery uses the schema for the most recently created tablethat matches the wildcard as the schema for the wildcard table. Even if yourestrict the number of tables that you want to use from the wildcard table usingthe _TABLE_SUFFIX pseudo column in a WHERE clause, BigQueryuses the schema for the most recently created table that matches the wildcard.

If a column from the inferred schema doesn't exist in a matched table, thenBigQuery returns NULL values for that column in the rows forthe table that is missing the column.

If the schema is inconsistent across the tables matched by the wildcardquery, then BigQuery returns an error. This is the case whenthe columns of the matched tables have different data types, or when the columnswhich are not present in all of the matched tables cannot be assumed to have anull value.

Best practices

  • Longer prefixes generally perform better than shorter prefixes. For example,the following query uses a long prefix (gsod200):

    #standardSQLSELECTmaxFROM`bigquery-public-data.noaa_gsod.gsod200*`WHEREmax != 9999.9 # code for missing dataAND _TABLE_SUFFIX BETWEEN '0' AND '1'ORDER BYmax DESC

    The following query generally performs worse because it uses an emptyprefix:

    #standardSQLSELECTmaxFROM`bigquery-public-data.noaa_gsod.*`WHEREmax != 9999.9 # code for missing dataAND _TABLE_SUFFIX BETWEEN 'gsod2000' AND 'gsod2001'ORDER BYmax DESC
  • Partitioning is recommended over sharding, because partitioned tables performbetter. Sharding reduces performance while creating more tables to manage. Formore information, see Partitioning versus sharding.

For best practices for controlling costs in BigQuery, see Controlling costs in BigQuery

What's next

  • For more information about Google Standard SQL, see theGoogle Standard SQL query reference.
Top Articles
Latest Posts
Article information

Author: Arline Emard IV

Last Updated: 02/23/2023

Views: 5687

Rating: 4.1 / 5 (52 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Arline Emard IV

Birthday: 1996-07-10

Address: 8912 Hintz Shore, West Louie, AZ 69363-0747

Phone: +13454700762376

Job: Administration Technician

Hobby: Paintball, Horseback riding, Cycling, Running, Macrame, Playing musical instruments, Soapmaking

Introduction: My name is Arline Emard IV, I am a cheerful, gorgeous, colorful, joyous, excited, super, inquisitive person who loves writing and wants to share my knowledge and understanding with you.