The AI.PARSE_DOCUMENT function
This document describes the AI.PARSE_DOCUMENT function, which parses
documents such as PDFs and extracts structured information, including text
chunks and page boundaries. It uses the Document AI
layout parser
as the processing backend. This parser creates semantically coherent chunks
that are augmented with contextual information.
To use this function, you must enable the Document AI API and create a layout parser.
Syntax
AI.PARSE_DOCUMENT(
{ TABLE TABLE_NAME | (QUERY_STATEMENT) },
endpoint => ENDPOINT
[, chunk_size => CHUNK_SIZE ]
[, connection_id => CONNECTION_ID ]
)
Arguments
The AI.PARSE_DOCUMENT function takes the following arguments:
TABLE_NAME: the name of the table that contains the documents to be parsed. The table must include a column namedrefof typeOBJECTREF.QUERY_STATEMENT: a GoogleSQL query that produces the documents to be parsed. The query result must include a column namedrefof typeOBJECTREF.ENDPOINT: aSTRINGvalue that specifies the Document AI layout parser to use. You can provide the endpoint in one of the following formats:- Resource URL:
projects/{project}/locations/{location}/processors/{processor_id} - Full Resource URL:
https://{region}-documentai.googleapis.com/v1/projects/{project}/locations/{location}/processors/{processor_id}:process
- Resource URL:
CHUNK_SIZE: anINT64value that specifies the chunk size for the layout parser to use. The default value is1000.CONNECTION_ID: aSTRINGvalue that specifies the Cloud resource connection to use to communicate with the model, in the format[PROJECT_ID].LOCATION.CONNECTION_ID. For example,myproject.us.myconnection. You must grant the connection the Document AI Viewer role (roles/documentai.viewer) on the project in which you run the query.If you don't specify a connection, then the query uses your end-user credentials. If you use your end-user credentials, then you must have the Document AI Viewer role (
roles/documentai.viewer) on the project in which you run the query.
Output
The AI.PARSE_DOCUMENT function returns all columns from the input table
in addition to the following columns:
chunk_id: anINT64value that contains the sequence ID of an extracted document chunk, starting from 1.start_page: anINT64value that contains the page number where the chunk starts.end_page: anINT64value that contains the page number where the chunk ends.content: aSTRINGvalue that contains the extracted text content of the chunk.status: aSTRINGvalue that contains the RPC status message. Returns an empty string on success, or an error message if the request fails.
Limitations
An input document can have at most 130 pages.
Examples
Before using AI.PARSE_DOCUMENT, you must first
create a layout parser.
The following example parses a publicly available PDF of the first three pages of the Winnie-the-Pooh story. Replace the endpoint with your own layout parser:
SELECT * EXCEPT(ref)
FROM AI.PARSE_DOCUMENT(
(
SELECT
OBJ.MAKE_REF("gs://cloud-samples-data/documentai/SampleDocuments/OCR_PROCESSOR/Winnie_the_Pooh_3_Pages.pdf") AS ref
),
endpoint => "projects/123456789/locations/us/processors/123abc456def:process",
chunk_size => 100
);
The result is similar to the following:
+----------+------------+----------+----------------------------------+--------+
| chunk_id | start_page | end_page | content | status |
+----------+------------+----------+----------------------------------+--------+
| 1 | 1 | 3 | # CHAPTER I | |
| | | | | |
| | | | IN WHICH We Are Introduced to | |
| | | | Winnie-the-Pooh and Some Bees, | |
| | | | and the Stories Begin... | |
+----------+------------+----------+----------------------------------+--------+
| 2 | 1 | 2 | When I first heard his name, I | |
| | | | said, just as you are going to | |
| | | | say, "But I thought he was a | |
| | | | boy?" "So did I," said | |
| | | | Christopher Robin. "Then you... | |
+----------+------------+----------+----------------------------------+--------+
| 3 | 2 | 2 | Sometimes Winnie-the-Pooh likes | |
| | | | a game of some sort when he | |
| | | | comes downstairs, and sometimes | |
| | | | he likes to sit quietly in front | |
| | | | of the fire and listen to a... | |
+----------+------------+----------+----------------------------------+--------+
| ... | ... | ... | ... | ... |
+----------+------------+----------+----------------------------------+--------+
The following example shows how to create an object table from Cloud Storage
using the wildcard syntax to include multiple documents.
Then you can pass the table directly as an argument to the AI.PARSE_DOCUMENT
function:
CREATE EXTERNAL TABLE mydataset.pdf_table
WITH CONNECTION DEFAULT
OPTIONS(
object_metadata = 'SIMPLE',
uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/document_chunks/*']);
SELECT *
FROM AI.PARSE_DOCUMENT(
TABLE mydataset.pdf_table,
endpoint => 'projects/123456789/locations/us/processors/123abc456def:process'
);