PDF files, such as financial documents, can be challenging to use in RAG
pipelines because of their complex structure and mix of text, figures, and
tables. This tutorial shows you how to use the
AI.PARSE_DOCUMENT function
in combination with Document AI's layout parser to build a RAG pipeline
based on key information extracted from a PDF file.
Objectives
This tutorial covers the following tasks:- Create a Cloud Storage bucket and upload a sample PDF file.
- Create a Document AI processor that you can use to parse the PDF file.
- Use the
AI.PARSE_DOCUMENTfunction to parse the PDF contents into chunks and then write that content to a BigQuery table. - Generate embeddings from the parsed PDF content, and then write those embeddings to a BigQuery table. Embeddings are numerical representations of the PDF content that enable you to perform semantic search and retrieval on the PDF content.
- Use the
VECTOR_SEARCHfunction on the embeddings to identify semantically similar PDF content. - Perform retrieval-augmented generation (RAG) by using the
AI.GENERATEfunction to generate text, using vector search results to augment the prompt input and improve results.
Costs
In this document, you use the following billable components of Google Cloud:
- BigQuery: You incur costs for the data that you process in BigQuery.
- Gemini Enterprise Agent Platform: You incur costs for calls to Agent Platform models.
- Document AI: You incur costs for calls to the Document AI API.
- Cloud Storage: You incur costs for object storage in Cloud Storage.
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.
Before you begin
Console
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the BigQuery, BigQuery Connection, Vertex AI, Document AI, and Cloud Storage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the BigQuery, BigQuery Connection, Vertex AI, Document AI, and Cloud Storage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
Make sure that you have the following role or roles on the project: Storage Admin, Document AI Editor, BigQuery Admin, Project IAM Admin
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
- Click Select a role, then search for the role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
-
gcloud
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
Install the Google Cloud CLI.
-
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
-
To initialize the gcloud CLI, run the following command:
gcloud init -
Create or select a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_IDwith a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_IDwith your Google Cloud project name.
-
Verify that billing is enabled for your Google Cloud project.
Enable the BigQuery, BigQuery Connection, Vertex AI, Document AI, and Cloud Storage APIs:
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.gcloud services enable bigquery.googleapis.com
bigqueryconnection.googleapis.com aiplatform.googleapis.com documentai.googleapis.com storage.googleapis.com -
Install the Google Cloud CLI.
-
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
-
To initialize the gcloud CLI, run the following command:
gcloud init -
Create or select a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_IDwith a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_IDwith your Google Cloud project name.
-
Verify that billing is enabled for your Google Cloud project.
Enable the BigQuery, BigQuery Connection, Vertex AI, Document AI, and Cloud Storage APIs:
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.gcloud services enable bigquery.googleapis.com
bigqueryconnection.googleapis.com aiplatform.googleapis.com documentai.googleapis.com storage.googleapis.com -
Grant roles to your user account. Run the following command once for each of the following IAM roles:
roles/storage.admin, roles/documentai.editor, roles/bigquery.admin, roles/resourcemanager.projectIamAdmingcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
Replace the following:
PROJECT_ID: Your project ID.USER_IDENTIFIER: The identifier for your user account. For example,myemail@example.com.ROLE: The IAM role that you grant to your user account.
Create a dataset
Create a BigQuery dataset to store your ML model.
Console
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click View actions > Create dataset
On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial.For Location type, select Multi-region, and then select US.
Leave the remaining default settings as they are, and click Create dataset.
bq
To create a new dataset, use the
bq mk --dataset command.
Create a dataset named
bqml_tutorialwith the data location set toUS.bq mk --dataset \ --location=US \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial
Confirm that the dataset was created:
bq ls
API
Call the datasets.insert
method with a defined dataset resource.
{ "datasetReference": { "datasetId": "bqml_tutorial" } }
Upload the sample PDF to Cloud Storage
To upload the sample PDF to Cloud Storage, follow these steps:
- Download the
scf23.pdfsample PDF by going to https://www.federalreserve.gov/publications/files/scf23.pdf and clicking download . - Create a Cloud Storage bucket.
- Upload the
scf23.pdffile to the bucket.
Create a document processor
Create a document processor
based on the layout parser processor
in the us multi-region. Copy the prediction endpoint from the
Processor details page to use in the next section.
Parse the PDF file into chunks
Use the document processor with the AI.PARSE_DOCUMENT function to parse the
PDF file into chunks, and then write that content to a table.
In the Google Cloud console, go to the BigQuery page.
In the query editor, run the following statement:
CREATE OR REPLACE TABLE bqml_tutorial.parsed_pdf AS ( SELECT * FROM AI.PARSE_DOCUMENT( ( SELECT OBJ.MAKE_REF("gs://BUCKET/scf23.pdf") AS ref ), endpoint => "PREDICTION_ENDPOINT", chunk_size => 250) );
Generate embeddings
Generate embeddings for the parsed PDF content and then write them to a table:
In the Google Cloud console, go to the BigQuery page.
In the query editor, run the following statement:
CREATE OR REPLACE TABLE `bqml_tutorial.embeddings` AS ( SELECT *, AI.EMBED(content, endpoint => 'text-embedding-005') AS embedding FROM bqml_tutorial.parsed_pdf );
Run a vector search
Run a vector search against the parsed PDF content.
The following query takes text input, creates an embedding for that input
using the AI.EMBED function, and then uses the VECTOR_SEARCH
function to match the input embedding with the most similar PDF content
embeddings. The results are the top ten PDF chunks that are most related
to changes in family net worth.
Go to the BigQuery page.
In the query editor, run the following SQL statement:
SELECT distance, base.chunk_id, base.start_page, base.end_page, base.content FROM VECTOR_SEARCH( TABLE `bqml_tutorial.embeddings`, 'embedding', query_value => AI.EMBED( 'Did the typical family net worth increase? If so, by how much?', endpoint => 'text-embedding-005').result, top_k => 3, OPTIONS => '{"fraction_lists_to_search": 0.01}') ORDER BY distance DESC;
The output is similar to the following:
+----------+----------+------------+----------+-----------------------------------+ | distance | chunk_id | start_page | end_page | content | +----------+----------+------------+----------+-----------------------------------+ | 0.645685 | 26 | 17 | 18 | 18 Between the first quarter of | | | | | | 2019 and the first quarter of... | +----------+----------+------------+----------+-----------------------------------+ | 0.602665 | 30 | 19 | 21 | ## Net Worth by Family | | | | | | Characteristics... | +----------+----------+------------+----------+-----------------------------------+ | 0.599438 | 24 | 17 | 21 | # Net Worth | | | | | | The net improvements in... | +----------+----------+------------+----------+-----------------------------------+
Generate text augmented by vector search results
Perform a vector search on the embeddings to identify semantically similar
PDF content, and then use the AI.GENERATE_TEXT function with the vector
search results to augment the prompt input and improve the text generation
results. In this case, the query uses information from the PDF chunks to answer
a question about the change in family net worth over the past decade.
In the Google Cloud console, go to the BigQuery page.
In the query editor, run the following statement:
SELECT AI.GENERATE( CONCAT('Did the typical family net worth change? How does this compare the SCF survey a decade earlier? Be concise and use the following context:', STRING_AGG(FORMAT("context: %s", base.content), ',\n') ) ).result AS response FROM VECTOR_SEARCH( TABLE `bqml_tutorial.embeddings`, 'embedding', query_value => AI.EMBED( 'Did the typical family net worth increase? If so, by how much?', endpoint => 'text-embedding-005').result, top_k => 3, OPTIONS => '{"fraction_lists_to_search": 0.01}')
The output is similar to the following:
+-------------------------------------------------------------------------+ | response | +-------------------------------------------------------------------------+ | Yes, the typical family net worth changed significantly. | | | | Real median net worth surged 37% between the 2019 and 2022 SCF surveys. | | This contrasts sharply with a decade earlier (2010-2013), when real | | median net worth decreased 2%. | +-------------------------------------------------------------------------+
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
Delete a Google Cloud project:
gcloud projects delete PROJECT_ID
What's next
- Learn more about theAI.PARSE_DOCUMENT function.
- Learn more about performing
semantic search and RAG.