使用 BigQuery DataFrames 在 Python 中分析多模態資料

本教學課程將說明如何使用 BigQuery DataFrames 類別和方法，在 Python 筆記本中分析多模態資料。

本教學課程使用公開的 Cymbal 寵物商店資料集中的產品目錄。

如要上傳已填入本教學課程涵蓋工作的筆記本，請參閱「BigFrames 多模態 DataFrame」。

目標

建立多模態 DataFrame。
在 DataFrame 中合併結構化和非結構化資料。
轉換圖片。
根據圖片資料生成文字和嵌入項目。
將 PDF 分塊，以利進一步分析。

費用

在本文件中，您會使用下列 Google Cloud的計費元件：

BigQuery: you incur costs for the data that you process in BigQuery.
BigQuery Python UDFs: you incur costs for using BigQuery DataFrames image transformation and chunk PDF methods.
Cloud Storage: you incur costs for the objects stored in Cloud Storage.
Vertex AI: you incur costs for calls to Vertex AI models.

您可以使用 Pricing Calculator，根據預測用量估算費用。

初次使用 Google Cloud 的使用者可能符合免費試用期資格。

如要進一步瞭解定價，請參閱下列頁面：

事前準備

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector
Verify that billing is enabled for your Google Cloud project.
Enable the BigQuery, BigQuery Connection, Cloud Storage, and Vertex AI APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the APIs

必要的角色

如要取得完成本教學課程所需的權限，請要求管理員授予您下列 IAM 角色：

建立連線： BigQuery 連線管理員 (roles/bigquery.connectionAdmin)
將權限授予連線的服務帳戶：「專案 IAM 管理員」 (roles/resourcemanager.projectIamAdmin)
建立 Cloud Storage bucket： Storage 管理員 (roles/storage.admin)
執行 BigQuery 工作： BigQuery 使用者 (roles/bigquery.user)
建立及呼叫 Python UDF： BigQuery 資料編輯者 (roles/bigquery.dataEditor)
建立可讀取及修改 Cloud Storage 物件的網址： BigQuery ObjectRef 管理員 (roles/bigquery.objectRefAdmin)
使用筆記本：
- BigQuery 讀取工作階段使用者 (roles/bigquery.readSessionUser)
- 筆記本執行階段使用者 (roles/aiplatform.notebookRuntimeUser)
- 筆記本執行階段使用者 (roles/aiplatform.notebookRuntimeUser)
- 程式碼建立者 (roles/dataform.codeCreator)

如要進一步瞭解如何授予角色，請參閱「管理專案、資料夾和組織的存取權」。

您或許也能透過自訂角色或其他預先定義的角色，取得必要權限。

設定

在本節中，您將建立本教學課程使用的 Cloud Storage bucket、連線和筆記本。

建立值區

建立 Cloud Storage bucket，用於儲存轉換後的物件：

前往 Google Cloud 控制台的「Bucket」頁面。

前往「Buckets」(值區) 頁面
點選「Create」(建立)。
在「建立 bucket」頁面的「開始使用」部分，輸入符合bucket 名稱規定的全球專屬名稱。
點選「建立」。

建立連線

建立Cloud 資源連線，並取得連線的服務帳戶。BigQuery 會使用連線存取 Cloud Storage 中的物件。

前往「BigQuery」頁面

前往「BigQuery」
在「Explorer」窗格中，點選「新增資料」圖示。

「新增資料」對話方塊隨即開啟。
在「依條件篩選」窗格的「資料來源類型」部分，選取「商用應用程式」。

或者，您也可以在「Search for data sources」(搜尋資料來源) 欄位中輸入 Vertex AI。
在「精選資料來源」部分，點選「Vertex AI」。
按一下「Vertex AI Models: BigQuery Federation」解決方案資訊卡。
在「連線類型」清單中，選取「Vertex AI 遠端模型、遠端函式、BigLake 和 Spanner (Cloud 資源)」。
在「連線 ID」欄位中輸入 bigframes-default-connection。
點選「建立連線」。
點選「前往連線」。
在「連線資訊」窗格中，複製服務帳戶 ID，以供後續步驟使用。

將權限授予連線的服務帳戶

為連線的服務帳戶授予存取 Cloud Storage 和 Vertex AI 所需的角色。您必須在「事前準備」一節中建立或選取的專案中，授予這些角色。

如要授予角色，請按照下列步驟操作：

前往「IAM & Admin」(IAM 與管理) 頁面。

前往「IAM & Admin」(IAM 與管理)
按一下「授予存取權」。
在「新增主體」欄位，輸入先前複製的服務帳戶 ID。
在「Select a role」(請選擇角色) 欄位中，依序選取「Cloud Storage」和「Storage Object User」(Storage 物件使用者)。
按一下 [Add another role] (新增其他角色)。
在「選取角色」欄位中，依序選取「Vertex AI」和「Vertex AI 使用者」。
按一下 [儲存]。

建立筆記本

建立可執行 Python 程式碼的筆記本：

前往「BigQuery」頁面

前往「BigQuery」
在編輯器窗格的分頁列中，按一下「SQL 查詢」旁邊的下拉式箭頭，然後點選「Notebook」(筆記本)。
在「從範本開始」窗格中，點選「關閉」。
依序點選「連線」> 連線到執行階段。
如果您已有執行階段，請接受預設設定，然後按一下「連線」。如果沒有現有的執行階段，請選取「建立新的執行階段」，然後按一下「連線」。

執行階段可能需要幾分鐘才能完成設定。

建立多模態 DataFrame

使用 Session 類別的 from_glob_path 方法，建立整合結構化和非結構化資料的多模態 DataFrame：

在筆記本中建立程式碼儲存格，然後將下列程式碼複製到該儲存格：

import bigframes

# Flags to control preview image/video preview size
bigframes.options.display.blob_display_width = 300

import bigframes.pandas as bpd

# Create blob columns from wildcard path.
df_image = bpd.from_glob_path(
    "gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*", name="image"
)
# Other ways are: from string uri column
# df = bpd.DataFrame({"uri": ["gs://<my_bucket>/<my_file_0>", "gs://<my_bucket>/<my_file_1>"]})
# df["blob_col"] = df["uri"].str.to_blob()

# From an existing object table
# df = bpd.read_gbq_object_table("<my_object_table>", name="blob_col")

# Take only the 5 images to deal with. Preview the content of the Mutimodal DataFrame
df_image = df_image.head(5)
df_image

按一下「執行」。

最後一次呼叫 df_image 會傳回已加入 DataFrame 的圖片。或者，您也可以呼叫 .display 方法。

在 DataFrame 中合併結構化和非結構化資料

在多模態 DataFrame 中合併文字和圖片資料：

在筆記本中建立程式碼儲存格，然後將下列程式碼複製到該儲存格：

# Combine unstructured data with structured data
df_image["author"] = ["alice", "bob", "bob", "alice", "bob"]  # type: ignore
df_image["content_type"] = df_image["image"].blob.content_type()
df_image["size"] = df_image["image"].blob.size()
df_image["updated"] = df_image["image"].blob.updated()
df_image

按一下「Run」。

程式碼會傳回 DataFrame 資料。

在筆記本中建立程式碼儲存格，然後將下列程式碼複製到該儲存格：

# Filter images and display, you can also display audio and video types. Use width/height parameters to constrain window sizes.
df_image[df_image["author"] == "alice"]["image"].blob.display()

按一下「Run」。

程式碼會從 DataFrame 傳回 author 資料欄值為 alice 的圖片。

執行圖片轉換作業

使用 Series.BlobAccessor 類別的下列方法轉換圖片資料：

轉換後的圖片會寫入 Cloud Storage。

轉換圖片：

在筆記本中建立程式碼儲存格，然後將下列程式碼複製到該儲存格：

df_image["blurred"] = df_image["image"].blob.image_blur(
    (20, 20), dst=f"{dst_bucket}/image_blur_transformed/", engine="opencv"
)
df_image["resized"] = df_image["image"].blob.image_resize(
    (300, 200), dst=f"{dst_bucket}/image_resize_transformed/", engine="opencv"
)
df_image["normalized"] = df_image["image"].blob.image_normalize(
    alpha=50.0,
    beta=150.0,
    norm_type="minmax",
    dst=f"{dst_bucket}/image_normalize_transformed/",
    engine="opencv",
)

# You can also chain functions together
df_image["blur_resized"] = df_image["blurred"].blob.image_resize(
    (300, 200), dst=f"{dst_bucket}/image_blur_resize_transformed/", engine="opencv"
)
df_image

將所有 {dst_bucket} 參照更新為您建立的 bucket，格式為 gs://mybucket。
按一下「Run」。

程式碼會傳回原始圖片和所有轉換作業。

生成文字

使用 GeminiTextGenerator 類別的 predict 方法，從多模態資料生成文字：

在筆記本中建立程式碼儲存格，然後將下列程式碼複製到該儲存格：

from bigframes.ml import llm

gemini = llm.GeminiTextGenerator(model_name="gemini-2.0-flash-001")

# Deal with first 2 images as example
df_image = df_image.head(2)

# Ask the same question on the images
df_image = df_image.head(2)
answer = gemini.predict(df_image, prompt=["what item is it?", df_image["image"]])
answer[["ml_generate_text_llm_result", "image"]]

按一下「Run」。

程式碼會傳回 df_image 中的前兩張圖片，以及針對這兩張圖片產生的文字回應 what item is it?。

在筆記本中建立程式碼儲存格，然後將下列程式碼複製到該儲存格：

# Ask different questions
df_image["question"] = [  # type: ignore
    "what item is it?",
    "what color is the picture?",
]
answer_alt = gemini.predict(
    df_image, prompt=[df_image["question"], df_image["image"]]
)
answer_alt[["ml_generate_text_llm_result", "image"]]

按一下「Run」。

程式碼會傳回 df_image 中的前兩張圖片，以及針對第一張圖片的問題 what item is it? 生成的文字，以及針對第二張圖片的問題 what color is the picture? 生成的文字。

生成嵌入項目

使用 MultimodalEmbeddingGenerator 類別的 predict 方法，為多模態資料生成嵌入：

在筆記本中建立程式碼儲存格，然後將下列程式碼複製到該儲存格：

# Generate embeddings on images
embed_model = llm.MultimodalEmbeddingGenerator()
embeddings = embed_model.predict(df_image["image"])
embeddings

按一下「Run」。

程式碼會傳回呼叫嵌入模型所產生的嵌入。

將 PDF 分塊

使用 Series.BlobAccessor 類別的 pdf_chunk 方法，將 PDF 物件分塊：

在筆記本中建立程式碼儲存格，然後將下列程式碼複製到該儲存格：

# PDF chunking
df_pdf = bpd.from_glob_path(
    "gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/documents/*", name="pdf"
)
df_pdf["chunked"] = df_pdf["pdf"].blob.pdf_chunk(engine="pypdf")
chunked = df_pdf["chunked"].explode()
chunked

按一下「Run」。

程式碼會傳回分塊的 PDF 資料。

清除所用資源

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.