使用向量搜尋功能搜尋嵌入
本教學課程說明如何使用 VECTOR_SEARCH 函式和選用的向量索引,對儲存在 BigQuery 資料表中的嵌入項目執行相似度搜尋。
搭配向量索引使用 VECTOR_SEARCH 時,VECTOR_SEARCH 會採用近似最鄰近項目方法提升向量搜尋成效,但會降低召回率,因此傳回的結果會更近似。如果沒有向量索引,VECTOR_SEARCH 會使用暴力搜尋來測量每筆記錄的距離。
所需權限
如要執行本教學課程,您需要下列 Identity and Access Management (IAM) 權限:
- 如要建立資料集,您必須具備 bigquery.datasets.create權限。
- 如要建立資料表,您必須具備下列權限: - bigquery.tables.create
- bigquery.tables.updateData
- bigquery.jobs.create
 
- 如要建立向量索引,您必須具備要建立索引的資料表 - bigquery.tables.createIndex權限。
- 如要捨棄向量索引,您必須具備要捨棄索引的資料表 - bigquery.tables.deleteIndex權限。
下列每個預先定義的 IAM 角色都包含使用向量索引所需的權限:
- BigQuery 資料擁有者 (roles/bigquery.dataOwner)
- BigQuery 資料編輯者 (roles/bigquery.dataEditor)
費用
VECTOR_SEARCH 函式會採用 BigQuery 運算定價。系統會根據以量計價或版本定價,向您收取相似性搜尋費用。
- 以量計價:系統會根據掃描的位元組數向您收費,包括基本資料表、索引和搜尋查詢。
- 方案價格:系統會根據預留方案中完成工作所需的運算單元向您收費。如果相似度計算的規模較大或較為複雜,就會產生更多費用。 
詳情請參閱 BigQuery 計價方式一文。
事前準備
- 
  
   
   
     
   
  
   
   
     
   
  
 
   
 
 
  
    
    
      In the Google Cloud console, on the project selector page, select or create a Google Cloud project. Roles required to select or create a project - Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- 
      Create a project: To create a project, you need the Project Creator
      (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
 
- 
  
   
   
  
   
   
  
 
  
    Verify that billing is enabled for your Google Cloud project. 
- 
  
   
   
  
   
   
  
 
 
  
  
    
      Enable the BigQuery API. Roles required to enable APIs To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.
建立資料集
建立 BigQuery 資料集
- 前往 Google Cloud 控制台的「BigQuery」頁面。 
- 在「Explorer」窗格中,按一下專案名稱。 
- 依序點按 「View actions」(查看動作) >「Create dataset」(建立資料集)。  
- 在「建立資料集」頁面中,執行下列操作: - 在「Dataset ID」(資料集 ID) 中輸入 - vector_search。
- 針對「Location type」(位置類型) 選取「Multi-region」(多區域),然後選取「US (multiple regions in United States)」(us (多個美國區域))。 - 公開資料集儲存在 - US多地區。為簡單起見,請將資料集存放在相同位置。
- 其餘設定保留預設值,然後點選「建立資料集」。 
 
建立測試資料表
- 根據 Google 專利公開資料集的子集,建立包含專利嵌入的 - patents資料表:- CREATE TABLE vector_search.patents AS SELECT * FROM `patents-public-data.google_patents_research.publications` WHERE ARRAY_LENGTH(embedding_v1) > 0 AND publication_number NOT IN ('KR-20180122872-A') LIMIT 1000000; 
- 建立 - patents2資料表,其中包含專利嵌入內容,可找出最近鄰:- CREATE TABLE vector_search.patents2 AS SELECT * FROM `patents-public-data.google_patents_research.publications` WHERE publication_number = 'KR-20180122872-A'; 
建立向量索引
- 在 - patents資料表的- embeddings_v1欄上建立- my_index向量索引:- CREATE OR REPLACE VECTOR INDEX my_index ON vector_search.patents(embedding_v1) STORING(publication_number, title) OPTIONS(distance_type='COSINE', index_type='IVF'); 
- 等待幾分鐘,讓系統建立向量索引,然後執行下列查詢,並確認 - coverage_percentage值為- 100:- SELECT * FROM vector_search.INFORMATION_SCHEMA.VECTOR_INDEXES; 
使用含索引的 VECTOR_SEARCH 函式
建立並填入向量索引後,請使用 VECTOR_SEARCH 函式,在 patents2 資料表的 embedding_v1 欄中,找出嵌入項目的最鄰近項目。這項查詢會使用搜尋中的向量索引,因此 VECTOR_SEARCH 會使用近似最鄰近方法,找出嵌入項目的最鄰近項目。
使用含有索引的 VECTOR_SEARCH 函式:
SELECT query.publication_number AS query_publication_number, query.title AS query_title, base.publication_number AS base_publication_number, base.title AS base_title, distance FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"fraction_lists_to_search": 0.005}');
結果類似下方:
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | query_publication_number | query_title | base_publication_number | base_title | distance | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-106599080-B | A kind of rapid generation for keeping away big vast transfer figure based on GIS | 0.14471956347590609 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-114118544-A | Urban waterlogging detection method and device | 0.17472108931171348 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-20200048143-A | Method and system for mornitoring dry stream using unmanned aerial vehicle | 0.17561990745619782 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-101721695-B1 | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same | 0.17696129365559843 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-109000731-B | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642917 | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
使用 VECTOR_SEARCH 函式進行暴力破解
使用 VECTOR_SEARCH 函式,找出 patents2 資料表中 embedding_v1 欄位內嵌入項目的最鄰近項目。這項查詢不會在搜尋中使用向量索引,因此 VECTOR_SEARCH 會找出嵌入項目的精確最鄰近項目。
SELECT query.publication_number AS query_publication_number, query.title AS query_title, base.publication_number AS base_publication_number, base.title AS base_title, distance FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"use_brute_force":true}');
結果類似下方:
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | query_publication_number | query_title | base_publication_number | base_title | distance | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-106599080-B | A kind of rapid generation for keeping away big vast transfer figure based on GIS | 0.1447195634759062 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-114118544-A | Urban waterlogging detection method and device | 0.1747210893117136 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-20200048143-A | Method and system for mornitoring dry stream using unmanned aerial vehicle | 0.17561990745619782 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-101721695-B1 | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same | 0.17696129365559843 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-109000731-B | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642928 | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
評估召回
使用索引執行向量搜尋時,系統會傳回近似結果,但會降低召回率。您可以比較向量搜尋 (使用索引) 和向量搜尋 (使用暴力搜尋) 傳回的結果,藉此計算召回率。在這個資料集中,publication_number 值可做為專利的專屬 ID,因此用於比較。
WITH approx_results AS ( SELECT query.publication_number AS query_publication_number, base.publication_number AS base_publication_number FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"fraction_lists_to_search": 0.005}') ), exact_results AS ( SELECT query.publication_number AS query_publication_number, base.publication_number AS base_publication_number FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"use_brute_force":true}') ) SELECT a.query_publication_number, SUM(CASE WHEN a.base_publication_number = e.base_publication_number THEN 1 ELSE 0 END) / 5 AS recall FROM exact_results e LEFT JOIN approx_results a ON e.query_publication_number = a.query_publication_number GROUP BY a.query_publication_number
如果召回率低於預期,可以提高 fraction_lists_to_search 值,但可能會導致延遲時間和資源用量增加。如要調整向量搜尋,您可以嘗試多次執行 VECTOR_SEARCH,使用不同的引數值,將結果儲存至表格,然後比較結果。
清除所用資源
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.