Gerenciar documentos

Este documento descreve como gerenciar os documentos no Document AI Warehouse, incluindo operações de criação, busca, atualização e exclusão.

O que são documentos

Um documento é o modelo de dados usado no Document AI Warehouse para organizar um documento real (por exemplo, PDF ou TXT) e as propriedades associadas a ele. Você interage com o Document AI Warehouse por operações nos documentos.

Tipos de arquivo com suporte

Embora o foco do Document AI Warehouse sejam os documentos, ele também é usado para gerenciar imagens associadas (por exemplo, em setores como seguros, engenharia, construção e pesquisa).

  • A API Ingest oferece suporte a PDFs e imagens TIFF, JPEG e PNG, além de propriedades ou texto pré-extraído.
  • A interface de upload oferece suporte à extração de PDFs usando o OCR da Document AI e processadores personalizados.
  • A interface do visualizador oferece suporte à renderização em arquivos PDF, de texto e do Microsoft Office.

Antes de começar

Antes de começar, conclua a página de início rápido.

Para a criação de documentos, se os dados estiverem no seu próprio bucket do Cloud Storage, você precisará conceder à conta de serviço do Document AI Warehouse a permissão de leitor de objetos do Storage para ler seus dados.

Cada documento é especificado por um esquema e pertence a um tipo de documento. Um esquema de documento define a estrutura do documento no Document AI Warehouse. Antes de criar documentos, você precisa criar um esquema de documento.

Criar um documento

Para criar um documento, você precisa fornecer conteúdo de documento bruto ao Document AI Warehouse. As duas maneiras de fornecer conteúdo de byte de documento bruto são definindo Document.inline_raw_document ou Document.raw_document_path.

As diferenças são as seguintes:

  • Document.raw_document_path: essa é a abordagem preferencial. Ela usa o caminho do Cloud Storage (gs://bucket/object) do arquivo a ser ingerido. Observação: o autor da chamada precisa ter permissão de leitura nesse objeto para que a chamada seja bem-sucedida.
  • Document.inline_raw_document: representação de byte/texto do arquivo, fornecida diretamente ao endpoint.

Para criar um documento, faça o seguinte:

Fazer upload de um documento do Cloud Storage

Você precisa conceder à conta de serviço do Document AI Warehouse acesso ao bucket do Cloud Storage, conforme descrito na seção de pré-requisitos.

Você precisa fazer upload do arquivo para um bucket do Cloud Storage, seguindo as instruções.

REST

Solicitação:

curl --location --request POST --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header "Content-Type: application/json; charset=utf-8" \
--data '{
"document": {
  "display_name": "TestDoc3",
  "document_schema_name": "projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas/DOCUMENT_SCHEMA_ID",
  "raw_document_path": "gs://BUCKET_URI/FILE_URI",
  "properties": [
    {
      "name": "supplier_name",
      "text_values": {
        "values": "Stanford Plumbing & Heating"
      }
    },
    {
      "name": "total_amount",
      "float_values": {
        "values": "1091.81"
      }
    },
  ]
}, 
"requestMetadata":{
  "userInfo":{
    "id": "user:USER_EMAIL_ID"
  }
}
}'

Fazer upload de uma máquina local

REST

Solicitação:

curl --location --request POST --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents/ \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header "Content-Type: application/json; charset=utf-8" \
--data '{
"document": {
  "display_name": "TestDoc3",
  "document_schema_name": "projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas/DOCUMENT_SCHEMA_ID",
  "inline_raw_document": "<bytes>",
  "properties": [
    {
      "name": "supplier_name",
      "text_values": {
        "values": "Stanford Plumbing & Heating"
      }
    },
    {
      "name": "total_amount",
      "float_values": {
        "values": "1091.81"
      }
    },
  ]
},
"requestMetadata": {
  "userInfo": {
    "id": "user:USER_EMAIL_ID"
  }
}
}'

Receber um documento

Por document_id:

REST

curl --request POST \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header "Content-Type: application/json; charset=UTF-8" -d '{
  "requestMetadata":{
    "userInfo":{
      "id": "user:USER_EMAIL"
    }
  }
}' \
"https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents/DOCUMENT_ID:get"

Python

Para mais informações, consulte a Document AI Warehouse Python API documentação de referência.

Para autenticar no Document AI Warehouse, configure o Application Default Credentials. Se quiser mais informações, consulte Configurar a autenticação para um ambiente de desenvolvimento local.


from google.cloud import contentwarehouse


def sample_get_document(document_name: str, user_id: str) -> contentwarehouse.Document:
    """Gets a document.

    Args:
        document_name: The resource name of the document.
                Format: 'projects/{project_number}/
                locations/{location}/documents/{document_id}'.
        user_id: user:YOUR_SERVICE_ACCOUNT_ID or user:USER_EMAIL.
    Returns:
        Response object
    """
    # Create a client
    client = contentwarehouse.DocumentServiceClient()

    request_metadata = contentwarehouse.RequestMetadata(
        user_info=contentwarehouse.UserInfo(id=user_id)
    )

    # Initialize request argument(s)
    request = contentwarehouse.GetDocumentRequest(
        name=document_name, request_metadata=request_metadata
    )

    # Make the request
    response = client.get_document(request=request)

    return response

Java

Para mais informações, consulte a documentação de referência da API Document AI Warehouse Java.

Para autenticar no Document AI Warehouse, configure o Application Default Credentials. Se quiser mais informações, consulte Configurar a autenticação para um ambiente de desenvolvimento local.


import com.google.cloud.contentwarehouse.v1.Document;
import com.google.cloud.contentwarehouse.v1.DocumentName;
import com.google.cloud.contentwarehouse.v1.DocumentServiceClient;
import com.google.cloud.contentwarehouse.v1.DocumentServiceSettings;
import com.google.cloud.contentwarehouse.v1.GetDocumentRequest;
import com.google.cloud.contentwarehouse.v1.RequestMetadata;
import com.google.cloud.contentwarehouse.v1.UserInfo;
import com.google.cloud.resourcemanager.v3.Project;
import com.google.cloud.resourcemanager.v3.ProjectName;
import com.google.cloud.resourcemanager.v3.ProjectsClient;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeoutException;

public class GetDocument {

  public static void getDocument() throws IOException, 
        InterruptedException, ExecutionException, TimeoutException {
    String projectId = "your-project-id";
    String location = "your-region"; // Format is "us" or "eu".
    String documentId = "your-document-id";
    String userId = "your-user-id"; // Format is user:<user-id>
    getDocument(projectId, location, documentId, userId);
  }

  // Retrieves details about existing Document using the document Id
  public static void getDocument(String projectId, String location, 
        String documentId, String userId) throws IOException, 
            InterruptedException, ExecutionException, TimeoutException {
    String projectNumber = getProjectNumber(projectId);

    String endpoint = "contentwarehouse.googleapis.com:443";
    if (!"us".equals(location)) {
      endpoint = String.format("%s-%s", location, endpoint);
    }
    DocumentServiceSettings documentServiceSettings = 
         DocumentServiceSettings.newBuilder().setEndpoint(endpoint).build(); 

    // Create a Document Service client
    try (DocumentServiceClient documentServiceClient =
        DocumentServiceClient.create(documentServiceSettings)) {
      /* The full resource name of the location, e.g.: 
       projects/{project_number}/location/{location}/documents/{document_id} */
      DocumentName documentName = 
          DocumentName.of(projectNumber, location, documentId);

      // Define Request Metadata for enforcing access control
      RequestMetadata requestMetadata = RequestMetadata.newBuilder()
            .setUserInfo(
            UserInfo.newBuilder()
              .setId(userId).build()).build();

      // Define request to get details of a specific Document Schema
      GetDocumentRequest getDocumentRequest = 
          GetDocumentRequest.newBuilder()
          .setName(documentName.toString())
          .setRequestMetadata(requestMetadata).build();

      // Get details of the Document 
      Document document = documentServiceClient.getDocument(getDocumentRequest);

      System.out.println(document.getName());
    }
  }

  private static String getProjectNumber(String projectId) throws IOException { 
    /* Initialize client that will be used to send requests. 
    * This client only needs to be created once, and can be reused for multiple requests. */
    try (ProjectsClient projectsClient = ProjectsClient.create()) { 
      ProjectName projectName = ProjectName.of(projectId); 
      Project project = projectsClient.getProject(projectName);
      String projectNumber = project.getName(); // Format returned is projects/xxxxxx
      return projectNumber.substring(projectNumber.lastIndexOf("/") + 1);
    } 
  }
}

Node.js

Para mais informações, consulte a documentação de referência da API do Document AI Warehouse.Node.js

Para autenticar no Document AI Warehouse, configure o Application Default Credentials. Se quiser mais informações, consulte Configurar a autenticação para um ambiente de desenvolvimento local.


/**
 * TODO(developer): Uncomment these variables before running the sample.
 * const projectNumber = 'YOUR_PROJECT_NUMBER';
 * const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu'
 * documentId = 'YOUR_DOCUMENT_ID';
 * const userId = 'user:xxx@example.com'; // Format is "user:xxx@example.com"
 */

// Import from google cloud
const {DocumentServiceClient} = require('@google-cloud/contentwarehouse').v1;

const apiEndpoint =
  location === 'us'
    ? 'contentwarehouse.googleapis.com'
    : `${location}-contentwarehouse.googleapis.com`;

// Create service client
const serviceClient = new DocumentServiceClient({
  apiEndpoint: apiEndpoint,
});

// Get Document Schema
async function getDocument() {
  // Initialize request argument(s)
  const documentRequest = {
    // The full resource name of the document, e.g.:
    // projects/{project_number}/locations/{location}/documents/{document_id}
    name: serviceClient.projectLocationDocumentPath(
      projectNumber,
      location,
      documentId
    ),
    requestMetadata: {userInfo: {id: userId}},
  };

  // Make Request
  const response = serviceClient.getDocument(documentRequest);

  // Print out response
  response.then(
    result => console.log(`Document Found: ${JSON.stringify(result)}`),
    error => console.log(`${error}`)
  );
}

Por reference_id:

  curl --request POST \
    --header "Authorization: Bearer $(gcloud auth print-access-token)" \
    --header "Content-Type: application/json; charset=UTF-8" -d '{
      "requestMetadata":{
        "userInfo":{
          "id": "user:USER_EMAIL"
        }
      }
    }' \
    "https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents/referenceId/REFERENCE_ID:get"

Atualizar um documento

Por document_id:

REST

posix-terminal curl --location --request POST --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents \ --header "Authorization: Bearer $(gcloud auth print-access-token)" \ --header "Content-Type: application/json; charset=utf-8" \ --data '{ "document": { "display_name": "TestDoc3", "document_schema_name": "projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas/DOCUMENT_SCHEMA_ID", "raw_document_path": "gs://BUCKET_URI/FILE_URI", "properties": [ { "name": "supplier_name", "text_values": { "values": "Stanford Plumbing & Heating" } }, { "name": "total_amount", "float_values": { "values": "1091.81" } }, { "name": "invoice_id", "text_values": { "values": "invoiceid" } }, ] }, "requestMetadata": { "userInfo": { "id": "user:USER_EMAIL" } } }'

Python

Para mais informações, consulte a Document AI Warehouse Python API documentação de referência.

Para autenticar no Document AI Warehouse, configure o Application Default Credentials. Se quiser mais informações, consulte Configurar a autenticação para um ambiente de desenvolvimento local.


from google.cloud import contentwarehouse


def sample_update_document(
    document_name: str, document: contentwarehouse.Document, user_id: str
) -> contentwarehouse.CreateDocumentResponse:
    """Updates a document.

    Args:
        document_name: The resource name of the document.
                    Format: 'projects/{project_number}/
                    locations/{location}/documents/{document_id}'.
        document: Document AI Warehouse Document object..
        user_id: user_id: user:YOUR_SERVICE_ACCOUNT_ID or user:USER_EMAIL.
    Returns:
        Response object.
    """
    # Create a client
    client = contentwarehouse.DocumentServiceClient()

    # Update document fields
    # For fields which can be updated, refer
    # https://cloud.google.com/python/docs/reference/contentwarehouse/
    # latest/google.cloud.contentwarehouse_v1.types.Document
    document.display_name = "Updated Order Invoice"

    request_metadata = contentwarehouse.RequestMetadata(
        user_info=contentwarehouse.UserInfo(id=user_id)
    )

    request = contentwarehouse.UpdateDocumentRequest(
        name=document_name, document=document, request_metadata=request_metadata
    )

    # Make the request
    response = client.update_document(request=request)

    return response

Java

Para mais informações, consulte a documentação de referência da API Document AI Warehouse Java.

Para autenticar no Document AI Warehouse, configure o Application Default Credentials. Se quiser mais informações, consulte Configurar a autenticação para um ambiente de desenvolvimento local.

import com.google.cloud.contentwarehouse.v1.Document;
import com.google.cloud.contentwarehouse.v1.DocumentName;
import com.google.cloud.contentwarehouse.v1.DocumentServiceClient;
import com.google.cloud.contentwarehouse.v1.DocumentServiceSettings;
import com.google.cloud.contentwarehouse.v1.GetDocumentRequest;
import com.google.cloud.contentwarehouse.v1.RequestMetadata;
import com.google.cloud.contentwarehouse.v1.UpdateDocumentRequest;
import com.google.cloud.contentwarehouse.v1.UpdateDocumentResponse;
import com.google.cloud.contentwarehouse.v1.UserInfo;
import com.google.cloud.resourcemanager.v3.Project;
import com.google.cloud.resourcemanager.v3.ProjectName;
import com.google.cloud.resourcemanager.v3.ProjectsClient;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeoutException;

public class UpdateDocument {
  public static void updateDocument() throws IOException, 
        InterruptedException, ExecutionException, TimeoutException { 
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String location = "your-region"; // Format is "us" or "eu".
    String documentId = "your-document-id";
    String userId = "your-user-id"; // Format is user:<user-id>
    /* The below method call retrieves details about the document you are about to update.
    * It is important to note that some properties cannot be edited or removed. 
    * For more information on managing documents, please see the below documentation.
    * https://cloud.google.com/document-warehouse/docs/manage-documents */
    GetDocument.getDocument(projectId, location, documentId, userId);
    updateDocument(projectId, location, documentId, userId);
  }

  // Updates an existing Document
  public static void updateDocument(String projectId, String location,
        String documentId, String userId) throws IOException, InterruptedException,
          ExecutionException, TimeoutException { 
    String projectNumber = getProjectNumber(projectId);

    String endpoint = "contentwarehouse.googleapis.com:443";
    if (!"us".equals(location)) {
      endpoint = String.format("%s-%s", location, endpoint);
    }

    DocumentServiceSettings documentServiceSettings = 
             DocumentServiceSettings.newBuilder().setEndpoint(endpoint).build(); 

    /* Create the Document Service Client 
     * Initialize client that will be used to send requests. 
     * This client only needs to be created once, and can be reused for multiple requests. */
    try (DocumentServiceClient documentServiceClient = 
            DocumentServiceClient.create(documentServiceSettings)) {

      /* The full resource name of the location, e.g.: 
       projects/{project_number}/location/{location}/documentSchemas/{document_schema_id} */
      DocumentName documentName = 
          DocumentName.of(projectNumber, location, documentId);

      // Define RequestMetadata object for context of the user making the API call
      RequestMetadata requestMetadata = RequestMetadata.newBuilder()
          .setUserInfo(
          UserInfo.newBuilder()
            .setId(userId).build()).build();

      // Get the document to retreive the document schema associated with the object
      GetDocumentRequest getDocumentRequest = GetDocumentRequest.newBuilder()
          .setName(documentName.toString())
          .setRequestMetadata(requestMetadata)
          .build(); 

      // Execute the request and store response in a document object
      Document document = documentServiceClient.getDocument(getDocumentRequest);

      // Define the updates to the document that will be passed in the request
      Document updatedDocument = Document.newBuilder()
          .setDisplayName("Updated Document Display Name")
          .setDocumentSchemaName(document.getDocumentSchemaName()).build();

      // Create the request to Update the Document
      UpdateDocumentRequest updateDocumentRequest = 
            UpdateDocumentRequest.newBuilder()
              .setName(documentName.toString())
              .setDocument(updatedDocument)
              .setRequestMetadata(requestMetadata)
              .build();

      // Update Document and receive response
      UpdateDocumentResponse updateDocumentResponse = 
          documentServiceClient.updateDocument(updateDocumentRequest);

      // Read the output of Updated Document Name
      System.out.println(updateDocumentResponse.getDocument().getDisplayName());
    }
  }

  private static String getProjectNumber(String projectId) throws IOException { 
    /* Initialize client that will be used to send requests. 
    * This client only needs to be created once, and can be reused for multiple requests. */
    try (ProjectsClient projectsClient = ProjectsClient.create()) { 
      ProjectName projectName = ProjectName.of(projectId); 
      Project project = projectsClient.getProject(projectName);
      String projectNumber = project.getName(); // Format returned is projects/xxxxxx
      return projectNumber.substring(projectNumber.lastIndexOf("/") + 1);
    } 
  }
}

Por reference_id:

  curl --location --request POST --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header "Content-Type: application/json; charset=utf-8" \
--data '{
  "document": {
    "display_name": "TestDoc3",
    "document_schema_name": "projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas/referenceId/REFERENCE_ID",
    "raw_document_path": "gs://BUCKET_URI/FILE_URI",
    "properties": [
      {
        "name": "supplier_name",
        "text_values": {
          "values": "Stanford Plumbing & Heating"
        }
      },
      {
        "name": "total_amount",
        "float_values": {
          "values": "1091.81"
        }
      },
      {
        "name": "invoice_id",
        "text_values": {
          "values": "invoiceid"
        }
      },
    ]
  },
  "requestMetadata": {
    "userInfo": {
      "id": "user:USER_EMAIL"
    }
  }
}'

Exclusão de um documento

REST

Por document_id:

curl --request POST \
  --header "Authorization: Bearer $(gcloud auth print-access-token)" \
  --header "Content-Type: application/json; charset=UTF-8" -d '{
    "requestMetadata":{
      "userInfo":{
        "id": "user:USER_EMAIL"
      }
    }
  }' \
  "https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents/DOCUMENT_ID:delete"

Por reference_id:

curl --request POST \
  --header "Authorization: Bearer $(gcloud auth print-access-token)" \
  --header "Content-Type: application/json; charset=UTF-8" -d '{
    "requestMetadata":{
      "userInfo":{
        "id": "user:USER_EMAIL"
      }
    }
  }' \
  "https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documents/referenceId/REFERENCE_ID":delete"

Próximas etapas