Parse PDFs in a retrieval-augmented generation pipeline

This tutorial guides you through the process of creating a retrieval-augmented generation (RAG) pipeline based on parsed PDF content.

PDF files, such as financial documents, can be challenging to use in RAG pipelines because of their complex structure and mix of text, figures, and tables. This tutorial shows you how to use the AI.PARSE_DOCUMENT function in combination with Document AI's layout parser to build a RAG pipeline based on key information extracted from a PDF file.

Objectives

This tutorial covers the following tasks:

  • Create a Cloud Storage bucket and upload a sample PDF file.
  • Create a Document AI processor that you can use to parse the PDF file.
  • Use the AI.PARSE_DOCUMENT function to parse the PDF contents into chunks and then write that content to a BigQuery table.
  • Generate embeddings from the parsed PDF content, and then write those embeddings to a BigQuery table. Embeddings are numerical representations of the PDF content that enable you to perform semantic search and retrieval on the PDF content.
  • Use the VECTOR_SEARCH function on the embeddings to identify semantically similar PDF content.
  • Perform retrieval-augmented generation (RAG) by using the AI.GENERATE function to generate text, using vector search results to augment the prompt input and improve results.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator.

New Google Cloud users might be eligible for a free trial.

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

Before you begin

Console

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the BigQuery, BigQuery Connection, Vertex AI, Document AI, and Cloud Storage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the BigQuery, BigQuery Connection, Vertex AI, Document AI, and Cloud Storage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

  8. Make sure that you have the following role or roles on the project: Storage Admin, Document AI Editor, BigQuery Admin, Project IAM Admin

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click Grant access.
    4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.

    5. Click Select a role, then search for the role.
    6. To grant additional roles, click Add another role and add each additional role.
    7. Click Save.

gcloud

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. Install the Google Cloud CLI.

  3. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  4. To initialize the gcloud CLI, run the following command:

    gcloud init
  5. Create or select a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.
    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID

      Replace PROJECT_ID with your Google Cloud project name.

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the BigQuery, BigQuery Connection, Vertex AI, Document AI, and Cloud Storage APIs:

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    gcloud services enable bigquery.googleapis.com bigqueryconnection.googleapis.com aiplatform.googleapis.com documentai.googleapis.com storage.googleapis.com
  8. Install the Google Cloud CLI.

  9. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  10. To initialize the gcloud CLI, run the following command:

    gcloud init
  11. Create or select a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.
    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID

      Replace PROJECT_ID with your Google Cloud project name.

  12. Verify that billing is enabled for your Google Cloud project.

  13. Enable the BigQuery, BigQuery Connection, Vertex AI, Document AI, and Cloud Storage APIs:

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    gcloud services enable bigquery.googleapis.com bigqueryconnection.googleapis.com aiplatform.googleapis.com documentai.googleapis.com storage.googleapis.com
  14. Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/storage.admin, roles/documentai.editor, roles/bigquery.admin, roles/resourcemanager.projectIamAdmin

    gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

    Replace the following:

    • PROJECT_ID: Your project ID.
    • USER_IDENTIFIER: The identifier for your user account. For example, myemail@example.com.
    • ROLE: The IAM role that you grant to your user account.

Create a dataset

Create a BigQuery dataset to store your ML model.

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Explorer pane, click your project name.

  3. Click View actions > Create dataset

  4. On the Create dataset page, do the following:

    • For Dataset ID, enter bqml_tutorial.

    • For Location type, select Multi-region, and then select US.

    • Leave the remaining default settings as they are, and click Create dataset.

bq

To create a new dataset, use the bq mk --dataset command.

  1. Create a dataset named bqml_tutorial with the data location set to US.

    bq mk --dataset \
      --location=US \
      --description "BigQuery ML tutorial dataset." \
      bqml_tutorial
  2. Confirm that the dataset was created:

    bq ls

API

Call the datasets.insert method with a defined dataset resource.

{
  "datasetReference": {
     "datasetId": "bqml_tutorial"
  }
}

Upload the sample PDF to Cloud Storage

To upload the sample PDF to Cloud Storage, follow these steps:

  1. Download the scf23.pdf sample PDF by going to https://www.federalreserve.gov/publications/files/scf23.pdf and clicking download .
  2. Create a Cloud Storage bucket.
  3. Upload the scf23.pdf file to the bucket.

Create a document processor

Create a document processor based on the layout parser processor in the us multi-region. Copy the prediction endpoint from the Processor details page to use in the next section.

Parse the PDF file into chunks

Use the document processor with the AI.PARSE_DOCUMENT function to parse the PDF file into chunks, and then write that content to a table.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following statement:

    CREATE OR REPLACE TABLE bqml_tutorial.parsed_pdf
    AS (
      SELECT *
      FROM
        AI.PARSE_DOCUMENT(
          (
            SELECT
              OBJ.MAKE_REF("gs://BUCKET/scf23.pdf") AS ref
          ),
          endpoint => "PREDICTION_ENDPOINT",
          chunk_size => 250)
    );

Generate embeddings

Generate embeddings for the parsed PDF content and then write them to a table:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following statement:

    CREATE OR REPLACE TABLE `bqml_tutorial.embeddings` AS (
      SELECT *, AI.EMBED(content, endpoint => 'text-embedding-005') AS embedding
      FROM bqml_tutorial.parsed_pdf
    );

Run a vector search against the parsed PDF content.

The following query takes text input, creates an embedding for that input using the AI.EMBED function, and then uses the VECTOR_SEARCH function to match the input embedding with the most similar PDF content embeddings. The results are the top ten PDF chunks that are most related to changes in family net worth.

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following SQL statement:

    SELECT distance, base.chunk_id, base.start_page, base.end_page, base.content
    FROM
      VECTOR_SEARCH(
        TABLE `bqml_tutorial.embeddings`,
        'embedding',
        query_value =>
          AI.EMBED(
            'Did the typical family net worth increase? If so, by how much?',
            endpoint => 'text-embedding-005').result,
        top_k => 3,
        OPTIONS => '{"fraction_lists_to_search": 0.01}')
    ORDER BY distance DESC;

    The output is similar to the following:

    +----------+----------+------------+----------+-----------------------------------+
    | distance | chunk_id | start_page | end_page | content                           |
    +----------+----------+------------+----------+-----------------------------------+
    | 0.645685 | 26       | 17         | 18       | 18 Between the first quarter of   |
    |          |          |            |          | 2019 and the first quarter of...  |
    +----------+----------+------------+----------+-----------------------------------+
    | 0.602665 | 30       | 19         | 21       | ## Net Worth by Family            |
    |          |          |            |          | Characteristics...                |
    +----------+----------+------------+----------+-----------------------------------+
    | 0.599438 | 24       | 17         | 21       | # Net Worth                       |
    |          |          |            |          | The net improvements in...        |
    +----------+----------+------------+----------+-----------------------------------+
    

Generate text augmented by vector search results

Perform a vector search on the embeddings to identify semantically similar PDF content, and then use the AI.GENERATE_TEXT function with the vector search results to augment the prompt input and improve the text generation results. In this case, the query uses information from the PDF chunks to answer a question about the change in family net worth over the past decade.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following statement:

    SELECT
      AI.GENERATE(
        CONCAT('Did the typical family net worth change? How does this compare the SCF survey a decade earlier? Be concise and use the following context:',
                STRING_AGG(FORMAT("context: %s", base.content), ',\n')
        )
      ).result AS response
    FROM
      VECTOR_SEARCH(
        TABLE `bqml_tutorial.embeddings`,
        'embedding',
        query_value =>
          AI.EMBED(
            'Did the typical family net worth increase? If so, by how much?',
            endpoint => 'text-embedding-005').result,
        top_k => 3,
        OPTIONS => '{"fraction_lists_to_search": 0.01}')

    The output is similar to the following:

    +-------------------------------------------------------------------------+
    | response                                                                |
    +-------------------------------------------------------------------------+
    | Yes, the typical family net worth changed significantly.                |
    |                                                                         |
    | Real median net worth surged 37% between the 2019 and 2022 SCF surveys. |
    | This contrasts sharply with a decade earlier (2010-2013), when real     |
    | median net worth decreased 2%.                                          |
    +-------------------------------------------------------------------------+
    

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

    Delete a Google Cloud project:

    gcloud projects delete PROJECT_ID

What's next

- Learn more about the AI.PARSE_DOCUMENT function. - Learn more about performing semantic search and RAG.