管理文件

本文說明如何管理 Document AI 倉儲中的文件,包括建立、擷取、更新及刪除作業。

什麼是文件?

文件是 Document AI Warehouse 使用的資料模型,用於整理實體文件 (例如 PDF 或 TXT) 及其相關聯的屬性。您可以透過文件作業與 Document AI 倉儲互動。

支援的檔案類型

雖然 Document AI 倉儲的重點是文件,但也會用於管理相關聯的圖片 (例如在保險、工程、建築和研究等產業)。

  • Ingest API 支援 PDF、TIFF、JPEG 和 PNG 圖片,以及任何屬性或預先擷取的文字。
  • 上傳 UI 支援使用 Document AI OCR 和自訂處理器擷取 PDF。
  • 檢視器 UI 支援以 PDF、文字和 Microsoft Office 檔案格式呈現。

事前準備

開始前,請先完成快速入門頁面。

如要建立文件,如果資料位於您自己的 Cloud Storage bucket,您必須授予 Document AI Warehouse 服務帳戶儲存空間物件檢視者權限,才能讀取資料。

每個文件都由結構定義指定,且屬於某個文件類型。文件結構定義會定義 Document AI 倉儲中的文件結構。 您必須先建立文件結構定義,才能建立文件。

建立文件

如要建立文件,您必須向 Document AI Warehouse 提供原始文件內容。如要提供原始文件位元組內容,可以設定 Document.inline_raw_documentDocument.raw_document_path

兩者差異如下:

  • Document.raw_document_path:建議採用這種做法。並使用要擷取的檔案的 Cloud Storage 路徑 (gs://bucket/object)。請注意,呼叫者必須具備這個物件的讀取權限,呼叫才能成功。
  • Document.inline_raw_document:檔案的位元組/文字表示法,直接提供給端點。

如要建立文件,請按照下列步驟操作:

上傳 Cloud Storage 中的文件

您必須按照必要條件一節所述,授予 Document AI Warehouse 服務帳戶 Cloud Storage bucket 的存取權。

您必須按照操作說明,將檔案上傳至 Cloud Storage bucket。

REST

要求:

curl --location --request POST --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header "Content-Type: application/json; charset=utf-8" \
--data '{
"document": {
  "display_name": "TestDoc3",
  "document_schema_name": "projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas/DOCUMENT_SCHEMA_ID",
  "raw_document_path": "gs://BUCKET_URI/FILE_URI",
  "properties": [
    {
      "name": "supplier_name",
      "text_values": {
        "values": "Stanford Plumbing & Heating"
      }
    },
    {
      "name": "total_amount",
      "float_values": {
        "values": "1091.81"
      }
    },
  ]
}, 
"requestMetadata":{
  "userInfo":{
    "id": "user:USER_EMAIL_ID"
  }
}
}'

從本機上傳

REST

要求:

curl --location --request POST --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents/ \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header "Content-Type: application/json; charset=utf-8" \
--data '{
"document": {
  "display_name": "TestDoc3",
  "document_schema_name": "projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas/DOCUMENT_SCHEMA_ID",
  "inline_raw_document": "<bytes>",
  "properties": [
    {
      "name": "supplier_name",
      "text_values": {
        "values": "Stanford Plumbing & Heating"
      }
    },
    {
      "name": "total_amount",
      "float_values": {
        "values": "1091.81"
      }
    },
  ]
},
"requestMetadata": {
  "userInfo": {
    "id": "user:USER_EMAIL_ID"
  }
}
}'

取得文件

document_id 提供:

REST

curl --request POST \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header "Content-Type: application/json; charset=UTF-8" -d '{
  "requestMetadata":{
    "userInfo":{
      "id": "user:USER_EMAIL"
    }
  }
}' \
"https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents/DOCUMENT_ID:get"

Python

詳情請參閱 Document AI Warehouse Python API 參考文件

如要向 Document AI Warehouse 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。


from google.cloud import contentwarehouse


def sample_get_document(document_name: str, user_id: str) -> contentwarehouse.Document:
    """Gets a document.

    Args:
        document_name: The resource name of the document.
                Format: 'projects/{project_number}/
                locations/{location}/documents/{document_id}'.
        user_id: user:YOUR_SERVICE_ACCOUNT_ID or user:USER_EMAIL.
    Returns:
        Response object
    """
    # Create a client
    client = contentwarehouse.DocumentServiceClient()

    request_metadata = contentwarehouse.RequestMetadata(
        user_info=contentwarehouse.UserInfo(id=user_id)
    )

    # Initialize request argument(s)
    request = contentwarehouse.GetDocumentRequest(
        name=document_name, request_metadata=request_metadata
    )

    # Make the request
    response = client.get_document(request=request)

    return response

Java

詳情請參閱 Document AI Warehouse Java API 參考文件

如要向 Document AI Warehouse 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。


import com.google.cloud.contentwarehouse.v1.Document;
import com.google.cloud.contentwarehouse.v1.DocumentName;
import com.google.cloud.contentwarehouse.v1.DocumentServiceClient;
import com.google.cloud.contentwarehouse.v1.DocumentServiceSettings;
import com.google.cloud.contentwarehouse.v1.GetDocumentRequest;
import com.google.cloud.contentwarehouse.v1.RequestMetadata;
import com.google.cloud.contentwarehouse.v1.UserInfo;
import com.google.cloud.resourcemanager.v3.Project;
import com.google.cloud.resourcemanager.v3.ProjectName;
import com.google.cloud.resourcemanager.v3.ProjectsClient;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeoutException;

public class GetDocument {

  public static void getDocument() throws IOException, 
        InterruptedException, ExecutionException, TimeoutException {
    String projectId = "your-project-id";
    String location = "your-region"; // Format is "us" or "eu".
    String documentId = "your-document-id";
    String userId = "your-user-id"; // Format is user:<user-id>
    getDocument(projectId, location, documentId, userId);
  }

  // Retrieves details about existing Document using the document Id
  public static void getDocument(String projectId, String location, 
        String documentId, String userId) throws IOException, 
            InterruptedException, ExecutionException, TimeoutException {
    String projectNumber = getProjectNumber(projectId);

    String endpoint = "contentwarehouse.googleapis.com:443";
    if (!"us".equals(location)) {
      endpoint = String.format("%s-%s", location, endpoint);
    }
    DocumentServiceSettings documentServiceSettings = 
         DocumentServiceSettings.newBuilder().setEndpoint(endpoint).build(); 

    // Create a Document Service client
    try (DocumentServiceClient documentServiceClient =
        DocumentServiceClient.create(documentServiceSettings)) {
      /* The full resource name of the location, e.g.: 
       projects/{project_number}/location/{location}/documents/{document_id} */
      DocumentName documentName = 
          DocumentName.of(projectNumber, location, documentId);

      // Define Request Metadata for enforcing access control
      RequestMetadata requestMetadata = RequestMetadata.newBuilder()
            .setUserInfo(
            UserInfo.newBuilder()
              .setId(userId).build()).build();

      // Define request to get details of a specific Document Schema
      GetDocumentRequest getDocumentRequest = 
          GetDocumentRequest.newBuilder()
          .setName(documentName.toString())
          .setRequestMetadata(requestMetadata).build();

      // Get details of the Document 
      Document document = documentServiceClient.getDocument(getDocumentRequest);

      System.out.println(document.getName());
    }
  }

  private static String getProjectNumber(String projectId) throws IOException { 
    /* Initialize client that will be used to send requests. 
    * This client only needs to be created once, and can be reused for multiple requests. */
    try (ProjectsClient projectsClient = ProjectsClient.create()) { 
      ProjectName projectName = ProjectName.of(projectId); 
      Project project = projectsClient.getProject(projectName);
      String projectNumber = project.getName(); // Format returned is projects/xxxxxx
      return projectNumber.substring(projectNumber.lastIndexOf("/") + 1);
    } 
  }
}

Node.js

詳情請參閱 Document AI Warehouse Node.js API 參考文件

如要向 Document AI Warehouse 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。


/**
 * TODO(developer): Uncomment these variables before running the sample.
 * const projectNumber = 'YOUR_PROJECT_NUMBER';
 * const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu'
 * documentId = 'YOUR_DOCUMENT_ID';
 * const userId = 'user:xxx@example.com'; // Format is "user:xxx@example.com"
 */

// Import from google cloud
const {DocumentServiceClient} = require('@google-cloud/contentwarehouse').v1;

const apiEndpoint =
  location === 'us'
    ? 'contentwarehouse.googleapis.com'
    : `${location}-contentwarehouse.googleapis.com`;

// Create service client
const serviceClient = new DocumentServiceClient({
  apiEndpoint: apiEndpoint,
});

// Get Document Schema
async function getDocument() {
  // Initialize request argument(s)
  const documentRequest = {
    // The full resource name of the document, e.g.:
    // projects/{project_number}/locations/{location}/documents/{document_id}
    name: serviceClient.projectLocationDocumentPath(
      projectNumber,
      location,
      documentId
    ),
    requestMetadata: {userInfo: {id: userId}},
  };

  // Make Request
  const response = serviceClient.getDocument(documentRequest);

  // Print out response
  response.then(
    result => console.log(`Document Found: ${JSON.stringify(result)}`),
    error => console.log(`${error}`)
  );
}

reference_id 提供:

  curl --request POST \
    --header "Authorization: Bearer $(gcloud auth print-access-token)" \
    --header "Content-Type: application/json; charset=UTF-8" -d '{
      "requestMetadata":{
        "userInfo":{
          "id": "user:USER_EMAIL"
        }
      }
    }' \
    "https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents/referenceId/REFERENCE_ID:get"

更新文件

document_id 提供:

REST

posix-terminal curl --location --request POST --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents \ --header "Authorization: Bearer $(gcloud auth print-access-token)" \ --header "Content-Type: application/json; charset=utf-8" \ --data '{ "document": { "display_name": "TestDoc3", "document_schema_name": "projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas/DOCUMENT_SCHEMA_ID", "raw_document_path": "gs://BUCKET_URI/FILE_URI", "properties": [ { "name": "supplier_name", "text_values": { "values": "Stanford Plumbing & Heating" } }, { "name": "total_amount", "float_values": { "values": "1091.81" } }, { "name": "invoice_id", "text_values": { "values": "invoiceid" } }, ] }, "requestMetadata": { "userInfo": { "id": "user:USER_EMAIL" } } }'

Python

詳情請參閱 Document AI Warehouse Python API 參考文件

如要向 Document AI Warehouse 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。


from google.cloud import contentwarehouse


def sample_update_document(
    document_name: str, document: contentwarehouse.Document, user_id: str
) -> contentwarehouse.CreateDocumentResponse:
    """Updates a document.

    Args:
        document_name: The resource name of the document.
                    Format: 'projects/{project_number}/
                    locations/{location}/documents/{document_id}'.
        document: Document AI Warehouse Document object..
        user_id: user_id: user:YOUR_SERVICE_ACCOUNT_ID or user:USER_EMAIL.
    Returns:
        Response object.
    """
    # Create a client
    client = contentwarehouse.DocumentServiceClient()

    # Update document fields
    # For fields which can be updated, refer
    # https://cloud.google.com/python/docs/reference/contentwarehouse/
    # latest/google.cloud.contentwarehouse_v1.types.Document
    document.display_name = "Updated Order Invoice"

    request_metadata = contentwarehouse.RequestMetadata(
        user_info=contentwarehouse.UserInfo(id=user_id)
    )

    request = contentwarehouse.UpdateDocumentRequest(
        name=document_name, document=document, request_metadata=request_metadata
    )

    # Make the request
    response = client.update_document(request=request)

    return response

Java

詳情請參閱 Document AI Warehouse Java API 參考文件

如要向 Document AI Warehouse 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

import com.google.cloud.contentwarehouse.v1.Document;
import com.google.cloud.contentwarehouse.v1.DocumentName;
import com.google.cloud.contentwarehouse.v1.DocumentServiceClient;
import com.google.cloud.contentwarehouse.v1.DocumentServiceSettings;
import com.google.cloud.contentwarehouse.v1.GetDocumentRequest;
import com.google.cloud.contentwarehouse.v1.RequestMetadata;
import com.google.cloud.contentwarehouse.v1.UpdateDocumentRequest;
import com.google.cloud.contentwarehouse.v1.UpdateDocumentResponse;
import com.google.cloud.contentwarehouse.v1.UserInfo;
import com.google.cloud.resourcemanager.v3.Project;
import com.google.cloud.resourcemanager.v3.ProjectName;
import com.google.cloud.resourcemanager.v3.ProjectsClient;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeoutException;

public class UpdateDocument {
  public static void updateDocument() throws IOException, 
        InterruptedException, ExecutionException, TimeoutException { 
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String location = "your-region"; // Format is "us" or "eu".
    String documentId = "your-document-id";
    String userId = "your-user-id"; // Format is user:<user-id>
    /* The below method call retrieves details about the document you are about to update.
    * It is important to note that some properties cannot be edited or removed. 
    * For more information on managing documents, please see the below documentation.
    * https://cloud.google.com/document-warehouse/docs/manage-documents */
    GetDocument.getDocument(projectId, location, documentId, userId);
    updateDocument(projectId, location, documentId, userId);
  }

  // Updates an existing Document
  public static void updateDocument(String projectId, String location,
        String documentId, String userId) throws IOException, InterruptedException,
          ExecutionException, TimeoutException { 
    String projectNumber = getProjectNumber(projectId);

    String endpoint = "contentwarehouse.googleapis.com:443";
    if (!"us".equals(location)) {
      endpoint = String.format("%s-%s", location, endpoint);
    }

    DocumentServiceSettings documentServiceSettings = 
             DocumentServiceSettings.newBuilder().setEndpoint(endpoint).build(); 

    /* Create the Document Service Client 
     * Initialize client that will be used to send requests. 
     * This client only needs to be created once, and can be reused for multiple requests. */
    try (DocumentServiceClient documentServiceClient = 
            DocumentServiceClient.create(documentServiceSettings)) {

      /* The full resource name of the location, e.g.: 
       projects/{project_number}/location/{location}/documentSchemas/{document_schema_id} */
      DocumentName documentName = 
          DocumentName.of(projectNumber, location, documentId);

      // Define RequestMetadata object for context of the user making the API call
      RequestMetadata requestMetadata = RequestMetadata.newBuilder()
          .setUserInfo(
          UserInfo.newBuilder()
            .setId(userId).build()).build();

      // Get the document to retreive the document schema associated with the object
      GetDocumentRequest getDocumentRequest = GetDocumentRequest.newBuilder()
          .setName(documentName.toString())
          .setRequestMetadata(requestMetadata)
          .build(); 

      // Execute the request and store response in a document object
      Document document = documentServiceClient.getDocument(getDocumentRequest);

      // Define the updates to the document that will be passed in the request
      Document updatedDocument = Document.newBuilder()
          .setDisplayName("Updated Document Display Name")
          .setDocumentSchemaName(document.getDocumentSchemaName()).build();

      // Create the request to Update the Document
      UpdateDocumentRequest updateDocumentRequest = 
            UpdateDocumentRequest.newBuilder()
              .setName(documentName.toString())
              .setDocument(updatedDocument)
              .setRequestMetadata(requestMetadata)
              .build();

      // Update Document and receive response
      UpdateDocumentResponse updateDocumentResponse = 
          documentServiceClient.updateDocument(updateDocumentRequest);

      // Read the output of Updated Document Name
      System.out.println(updateDocumentResponse.getDocument().getDisplayName());
    }
  }

  private static String getProjectNumber(String projectId) throws IOException { 
    /* Initialize client that will be used to send requests. 
    * This client only needs to be created once, and can be reused for multiple requests. */
    try (ProjectsClient projectsClient = ProjectsClient.create()) { 
      ProjectName projectName = ProjectName.of(projectId); 
      Project project = projectsClient.getProject(projectName);
      String projectNumber = project.getName(); // Format returned is projects/xxxxxx
      return projectNumber.substring(projectNumber.lastIndexOf("/") + 1);
    } 
  }
}

reference_id 提供:

  curl --location --request POST --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header "Content-Type: application/json; charset=utf-8" \
--data '{
  "document": {
    "display_name": "TestDoc3",
    "document_schema_name": "projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas/referenceId/REFERENCE_ID",
    "raw_document_path": "gs://BUCKET_URI/FILE_URI",
    "properties": [
      {
        "name": "supplier_name",
        "text_values": {
          "values": "Stanford Plumbing & Heating"
        }
      },
      {
        "name": "total_amount",
        "float_values": {
          "values": "1091.81"
        }
      },
      {
        "name": "invoice_id",
        "text_values": {
          "values": "invoiceid"
        }
      },
    ]
  },
  "requestMetadata": {
    "userInfo": {
      "id": "user:USER_EMAIL"
    }
  }
}'

刪除文件

REST

document_id 提供:

curl --request POST \
  --header "Authorization: Bearer $(gcloud auth print-access-token)" \
  --header "Content-Type: application/json; charset=UTF-8" -d '{
    "requestMetadata":{
      "userInfo":{
        "id": "user:USER_EMAIL"
      }
    }
  }' \
  "https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents/DOCUMENT_ID:delete"

reference_id 提供:

curl --request POST \
  --header "Authorization: Bearer $(gcloud auth print-access-token)" \
  --header "Content-Type: application/json; charset=UTF-8" -d '{
    "requestMetadata":{
      "userInfo":{
        "id": "user:USER_EMAIL"
      }
    }
  }' \
  "https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents/referenceId/REFERENCE_ID":delete"

後續步驟