Work with ObjectRef values
This document describes ObjectRef values and how to create and use them in
BigQuery.
An ObjectRef value is a STRUCT type
with a predefined schema that references Cloud Storage objects for
multimodal analysis. It can be
processed by OBJ functions,
AI functions, or
Python user-defined functions.
Schema
An ObjectRef value has the following fields:
| Name | Type | Mode | Description | Example |
|---|---|---|---|---|
uri |
STRING |
REQUIRED |
The Cloud Storage object's URI. | "gs://cloud-samples-data/vision/demo-img.jpg" |
version |
STRING |
NULLABLE |
The object generation. | "1560286006357632" |
authorizer |
STRING |
NULLABLE |
A BigQuery connection ID for delegated
access or NULL for direct access.
The ID can have the following formats: "region.connection"or "project.region.connection" |
"myproject.us.myconnection" |
details |
JSON |
NULLABLE |
The object metadata or errors from processing the object. It can include the
fields content_type, md5_hash, size, and
updated for the object.
|
{"gcs_metadata":{"content_type":"image/png","md5_hash":"dfbbb5cf034af026d89f2dc16930be15","size":915052,"updated":1560286006000000}} |
The content_type field in the gcs_metadata field from the details
column is fetched from Cloud Storage. You can set an object's
content type in
Cloud Storage. If you omit it in Cloud Storage, then
BigQuery infers the content type from the suffix of the URI.
Create ObjectRef values
You can create ObjectRef values by using
object tables, the
OBJ.MAKE_REF function,
or Cloud Storage Insights datasets.
Use object tables
Use an object table if you don't have URIs stored in a table and want to list
all the objects from a Cloud Storage prefix. An object table stores
the reference to an object in each row, and has a ref column that contains
ObjectRef values. The following query uses the
CREATE EXTERNAL TABLE statement
to create an object table:
CREATE EXTERNAL TABLE mydataset.images
WITH CONNECTION `us.myconnection`
OPTIONS (uris=["gs://mybucket/images/*"], object_metadata="SIMPLE");
SELECT ref AS image_ref FROM mydataset.images;
ObjectRef values from an object table must have an authorizer for
delegated access. The authorizer connection is the same
connection that you use to create the object table.
Use the OBJ.MAKE_REF function
Use the OBJ.MAKE_REF
function if you already have URIs stored in a table and want to create
ObjectRef values from those URIs. The following queries show how
to create ObjectRef values in the image_ref column from the uri column
that contains Cloud Storage URIs:
-- Specify only the URI
SELECT *, OBJ.MAKE_REF(uri) AS image_ref FROM mydataset.images;
-- Specify the URI and the connection
SELECT *, OBJ.MAKE_REF(uri, "us.myconnection") AS image_ref FROM mydataset.images;
To modify the authorizers of an existing ObjectRef value, you can
use the OBJ.MAKE_REF function:
-- Remove the authorizer
SELECT *, OBJ.MAKE_REF(ref, authorizer=>NULL) AS image_ref FROM mydataset.images;
-- Change the authorizer
SELECT *, OBJ.MAKE_REF(ref, authorizer=>"us.myconnection2") AS image_ref FROM mydataset.images;
The OBJ.MAKE_REF function accepts a nullable authorizer to support
direct access and delegated access.
Use Cloud Storage Insights datasets
If you have a
Storage Insights dataset configured,
then the dataset already includes a
ref column
that contains ObjectRef values. Any ObjectRef values created in Storage
Insights datasets don't have an authorizer. To query these objects, you must
either have direct access to the object or add an authorizer
to the ObjectRef to use delegated access.
Authorizer and permissions
When you pass an ObjectRef value to ObjectRef functions, AI functions, or
Python UDFs, those functions need to access the object stored in
Cloud Storage. You can authorize this access based
on the value of the authorizer field in two ways: direct access and delegated
access.
Direct access
With direct access, the user who runs the query accesses the object directly
by using their own credentials. Direct access is used when the ObjectRef
value has no authorizer.
Direct access has the following restrictions:
- The user must have permission to access the objects.
- A query job using the
AI.GENERATE,AI.IF,AI.SCORE, orAI.CLASSIFYfunctions without a connection requires the user to have additional permissions. The query can only access Cloud Storage buckets and objects from the same project in which the job is executed.
For example, if you call the AI.GENERATE function on an ObjectRef value
that doesn't have an authorizer, then the function reads the object as you. If you
don't have permission to read the object, the function writes a
"permission denied" error to the status column in the result.
The following example shows a query that uses direct access:
-- Requires that the end user can read the object "gs://cloud-samples-data/vision/demo-img.jpg" and use the Vertex AI model.
SELECT AI.GENERATE(
("Describe this image:",
OBJ.GET_ACCESS_URL(OBJ.MAKE_REF("gs://cloud-samples-data/vision/demo-img.jpg"), 'r')));
Delegated access
With delegated access, the user who runs the query delegates object access
to a BigQuery Cloud resource connection,
which is specified in the authorizer field of the ObjectRef value.
Delegated access can enable cross-project data access.
To use delegated access, your data administrator must follow these steps to set up the connection and permissions:
- One-time setup. The data administrator must
set up a Cloud resource connection
to manage the Cloud Storage bucket:
- Create a new BigQuery Cloud resource connection or reuse an existing one in the project.
- Look up the service account in the connection's metadata.
- Grant the service account the
storage.objects.getpermission for reads, or thestorage.objects.createpermission for writes, in either the project or the Cloud Storage buckets. You can grant these permissions with the Storage Object Viewer or Storage Object User roles.
- Per-user setup. The data administrator must grant users the
bigquery.objectRefs.readpermission for reads, or thebigquery.objectRefs.writepermission for writes, to the BigQuery connection. You can grant these permissions with the BigQuery ObjectRef Reader or BigQuery ObjectRef Admin roles.
For example, if a user passes ObjectRef values that have an authorizer to an
AI.GENERATE function, then the function verifies that the user has the
bigquery.objectRefs.read permission, and then reads the objects by using the
connection's service account. If the user or the service account has
insufficient permissions, then the function writes a "permission denied" error
to the status column in the result.
The following example shows a query that uses delegated access. It requires the following:
- The user has the
bigquery.objectRefs.readpermission onconnection1. - The service account for
connection1has thestorage.objects.getpermission on the object. - The service account for
connection2has the Vertex AI User role.
SELECT AI.GENERATE(
("Describe this image:",
OBJ.GET_ACCESS_RUL(OBJ.MAKE_REF("gs://cloud-samples-data/vision/demo-img.jpg", "us.connection1"), 'r')),
connection_id => "us.connection2");
Best practices
Consider the following best practices when you decide whether to use direct or delegated access:
- Use direct access for a small team operating in a single project for
both data storage and analysis. The data administrator uses Identity and Access Management to
grant users access to BigQuery data and
Cloud Storage data. Users can create
ObjectRefvalues on demand without an authorizer to analyze objects by using their own credentials. - Use delegated access for a large team operating across multiple
projects, especially when data storage and analysis are decoupled. The data
administrator can set up connections and create
ObjectRefvalues for analysis ahead of time with a connection as their authorizer. This approach works with object tables or by usingOBJ.MAKE_REFon a list of URIs. Then, the data administrator can share the table storing theObjectRefvalues with analysts. The analysts don't need to access the original bucket to analyze the objects.
Errors
Functions that consume ObjectRef values report errors in two ways:
- Query failure: the query might fail with an error message and no result.
- Returned error values: the query succeeds, but the function might write errors as a part of the return value. For information about the format of the return value, see the reference page for the function you are using.
When a function returns an ObjectRef value, the details field of that
value might contain an errors field. If it does, the value of that field is
an array of errors. Each error has the following schema:
| Name | Type | Mode | Description | Example |
|---|---|---|---|---|
code |
INT64 |
REQUIRED |
Standard HTTP error code. | 400 |
message |
STRING |
REQUIRED |
A descriptive, user-friendly error message. | "Connection credential for myproject.us.nonexistent_connection cannot be used. Either the connection does not exist, or the user does not have sufficient permissions (bigquery.objectRefs.read)" |
source |
STRING |
REQUIRED |
The name of the function that triggered the error. | "OBJ.MAKE_REF" |
These are two common types of errors:
- Object error: the object URI or version provided doesn't exist.
- Authorizer error: the connection doesn't exist or the user has no permission to use it for delegated access.
The following query shows how to select ObjectRef values that contain errors
from an Objectref column:
SELECT ref
FROM mydataset.images
WHERE ref.details.errors IS NOT NULL;
What's next
- Specify
ObjectRefcolumns in table schemas. - Analyze multimodal data.
- Learn more about ObjectRef functions.