从 Vector Search 1.0 迁移

为了简化从 Vector Search 1.0 的过渡,ImportDataObjects API 中引入了一项新功能。

迁移过程涉及三个关键步骤:

  1. 创建具有匹配架构的集合。 在导入之前,您必须先创建集合。其数据架构必须经过结构化处理,以适应转换后的 Vector Search 1.0 数据。

  2. 启动导入流程。 调用 ImportDataObjects API,指定 Vector Search 1.0 数据的 Cloud Storage 位置,并启用转换标志 detect_and_convert_vs1_json

  3. 了解数据转换。 熟悉 Vector Search 1.0 数据字段如何映射到新的数据对象结构。

创建集合

首先,创建一个集合,其数据架构应与 Vector Search 1.0 数据结构相符。

curl -X POST \
  'https://vectorsearch.googleapis.com/v1alpha/projects/PROJECT_ID/locations/LOCATION/collections?collection_id=movies' \
  -H 'Bearer $(gcloud auth print-access-token)' \
  -H 'Content-Type: application/json' \
  -d '{ \
    "data_schema": { \
      "$schema": "http://json-schema.org/draft-07/schema#", \
      "type": "object", \
      "properties": { \
        "restricts": { \
          "type": "object", \
          "properties": { \
            "genres": { \
              "type": "array", \
              "items": { \
                "type": "string" \
              } \
            }, \
            "director": { \
              "type": "array", \
              "items": { \
                "type": "string" \
              } \
            } \
          } \
        }, \
        "restricts_deny": { \
          "type": "object", \
          "properties": { \
            "genres": { \
              "type": "array", \
              "items": { \
                "type": "string" \
              } \
            } \
          } \
        }, \
        "numeric_restricts": { \
          "type": "object", \
          "properties": { \
            "year": { \
              "type": "integer" \
            }, \
            "imdb_rating": { \
              "type": "number", \
              "format": "float" \
            } \
          } \
        }, \
        "embedding_metadata": { \
          "type": "object", \
          "properties": { \
            "plot": { \
              "type": "string" \
            }, \
            "customers_review_summary": { \
              "type": "string" \
            }, \
            "critics_review_summary": { \
              "type": "string" \
            } \
          }, \
        } \
      } \
    }, \
    "vector_schema": { \
      "embedding": { \
        "dense_vector": { \
          "dimensions": 768 \
        } \
      }, \
      "sparse_embedding": { \
        "sparse_vector": {} \
      } \
    } \
  }'

导入 Vector Search 1.0 数据

接下来,对新创建的集合使用 ImportDataObjects API。 将其指向包含 Vector Search 1.0 数据的 Cloud Storage 存储桶。

curl -X POST \
"https://vectorsearch.googleapis.com/v1main/projects/PROJECT_ID/locations/LOCATION/collections/COLLECTION_ID:importDataObjects" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -d '{ \
    "gcs_import": { \
      "contents_uri": "gs://your-bucket/path/to/your-data.jsonl", \
      "error_uri": "gs://your-bucket/path/to/import-errors/" \
    } \
  }'

数据转换

在导入过程中,Vector Search 1.0 数据将转换为 Vector Search 2.0 数据对象。以下示例展示了字段的映射方式。

Vector Search 1.0 Cloud Storage 文件格式

{
    "id": "movie-789",
    "embedding": [-0.23, 0.88, 0.11, ...],
    "sparse_embedding": {"values": [0.1, 0.2], "dimensions": [1, 4]},
    "restricts": [
        {"namespace": "genres", "allow": ["science-fiction", "action"], "deny": ["horror"]},
        {"namespace": "director", "allow": ["Christopher Nolan"]}
    ],
    "numeric_restricts": [
        {"namespace": "year", "value_int": 2010},
        {"namespace": "imdb_rating", "value_float": 8.8}
    ],
    "embedding_metadata": {
        "plot": "...",
        "customers_review_summary": "...",
        "critics_review_summary": "..."
    }
}

转换后的 Vector Search 2.0 数据对象

DataObject(
    name="/.../movie-789",
    data={
        "restricts": {
            "genres": ["science-fiction", "action"],
            "director": ["Christopher Nolan"],
        },
        "restricts_deny": {
            "genres": ["horror"]
        },
        "numeric_restricts": {
            "year": 2010,
            "imdb_rating": 8.8,
        },
        "embedding_metadata": {
            "plot": "...",
            "customers_review_summary": "...",
            "critics_review_summary": "...",
        }
    },
    vectors={
        "embedding": {"dense_vector": {"values": [-0.23, 0.88, 0.11, ...]}},
        "sparse_embedding": {"sparse_vector": {"values": [0.1, 0.2], "indices": [1, 4]}},
    }
)