CEL dialect for document validation

Document AI validation and correction leverages the Common Expression Language (CEL) to allow for flexible data validation and manipulation within your document processing workflows. Document AI offers a set of custom functions, macros, and behavioral modifications tailored for document entity data.

Access entities in CEL expression

All expressions are evaluated against a root variable named doc, which consists of entities that are phrases or properties belonging to the document. These entities closely follow the structure of the entities of your extracted document.

While an extracted entity contains many properties, only three of them are available for CEL evaluation.

mention_text: Raw extracted text as present in an extracted entity. It defaults to an empty string.
normalized_value: Normalized mention text as present in an extracted entity. It defaults to null. Read more on Normalization.
bounding_poly: A special object containing a representation of the extracted entity's placement in the document and used for alignment checks. It defaults to null.

Data model

The exact structure of an extracted entity within the doc map depends on two factors. The first is whether its structure is a concrete value, such as a number or plain text, or if it's a complex object. The second factor is whether its occurrence type is single or multiple. For more information, see OccurrenceType.

A key feature of the validation data model is that any entity defined in the schema but not extracted from the document is automatically populated with default values. This design lets you skip most of the explicit null checks in your CEL expressions, significantly simplifying your validation expressions. You only need to explicitly write null checks to ensure a selected entity was indeed extracted.

Leaf entities example cases

The following sections describe how to access the entities in a leaf entity, which is one without nested child entities. Leaf entities directly hold a value.

Leaf entity with a single occurrence

This is the most basic case, using OccurrenceType of OPTIONAL_ONCE or REQUIRED_ONCE. The entity is represented as an object containing the three standard properties.

An example of how to access these values is doc.invoice_date.normalized_value.

It has the structure:

  "invoice_date": {
    "mention_text": "1",
    "normalized_value": 1.0,
    "bounding_poly": bounding_poly_object
  }

And default value:

  "invoice_date": {
    "mention_text": "",
    "normalized_value": null,
    "bounding_poly": null
  }

Leaf entity with multiple occurrences

This case applies to leaf entities which can occur multiple times and have an OccurrenceType of OPTIONAL_MULTIPLE or REQUIRED_MULTIPLE. For example, in a list of payment due dates, it's represented as an object where each property holds a list of the corresponding values from all occurrences. So properties like mention_text, normalized_value, and bounding_poly might have multiple entities.

An example of how to access these values is doc.payment_due_dates.normalized_value[0].

It has the structure:

  "payment_due_dates": {
    "mention_text": ["Mar 1, 2024", "Apr 1, 2024"],
    "normalized_value": [null, proto.timestamp(2024-04-01)],
    // Note: If a value is not normalized, it is stored as a null.
    "bounding_poly": [bounding_poly_object,bounding_poly_object]
  }

And default value:

  "payment_due_dates": {
    "mention_text": [],
    "normalized_value": []
    "bounding_poly": []
  }

Nested entities

A nested entity is a container for other entities, which are its "children."

Nested entity with one occurrence

If a nested entity occurs only once, for example a single receiver_address, it's represented as an object where the keys are the names of its child entities.

An example of how to access these values is doc.receiver_address.city.mention_text.

It has the structure:

  "receiver_address": {
    "street": {
      "mention_text": "123 Main St",
      "normalized_value": "123 Main St",
      "bounding_poly": bounding_poly_object
    }
    }

And default value:

  "receiver_address": {
    "street": {
      "mention_text": "",
      "normalized_value": null,
      "bounding_poly": null
    }
    }

Nested entity with multiple occurrences

When a nested entity can occur multiple times, it's represented as a list of objects. Each object in the list represents one full instance of the nested entity and contains its children.

An example of how to access these values is doc.line_items[1].description.normalized_value.

It has the structure:

  "line_items": [
    {
      "description": { "mention_text": "Product A", ... },
      "quantity": { "mention_text": "2", ... }
    },
    {
      "description": { "mention_text": "Service B", ... },
      "quantity": { "mention_text": "5", ... }
    }
  ]

And default value:

  "line_items": []

Conversion table of normalized value

This table shows how the selected schema entity data type translates to the CEL normalized_value data type.

Schema data type	CEL data type
Currency, Address	`string`
Number, Money	`double`
Datetime	`proto.Timestamp`
Checkbox, Signature	`bool`
Plain Text	N/A

Example expressions

Here are some example CEL expressions.

// Leaf entity with a single occurrence: Get the invoice ID string
doc.invoice_id.normalized_value == "INV-12345"

// Leaf entity with multiple occurrences: Get the first payment term from the list
doc.payments.mention_text[0].matches('^\d+$')

// Nested entity with one occurrence: Access a child entity of a single nested entity
doc.receiver_address.name.normalized_value.star
tsWith("John")
// Nested entity with multiple occurrences: Access a child of a specific item in a list of nested entities
doc.line_items[1].description.normalized_value == "Premium Gadget"

// Advanced: Sum the total of all line items
doc.line_items.map(item, item.total.normalized_value).sum() == 275.0

// Advanced: Check if any line item has a quantity greater than 1
doc.line_items.exists(item, item.quantity.normalized_value > 1.0)

Here are other examples of CEL logic you can use.

// Ensure due date is after invoice date
doc.due_date.normalized_value > doc.invoice_date.normalized_value

// Cross list validation: ensure that each employer_contribution 
// has a corresponding employee deduction
 doc.employer_contribution.size() == doc.employee_deduction.size() && lists.range(doc.employer_contribution.size()).all(i,doc.employee_deduction[i].current_amount.mention_text != "" && doc.employer_contribution[i].current_amount.mention_text != "")

Behavioral changes

Document AI's CEL dialect modifies some standard behaviors to better suit document processing use cases.

Epsilon-based equality

To address the lack of decimal type in CEL and account for common floating-point inaccuracies in financial and numerical data, the equality (==) and inequality (!=) operators are modified for numeric types (double, int). Instead of exact equality, they use an epsilon-based comparison with a tolerance of 1e-2 (0.01).

For example, consider a total_amount with a normalized value of 100.005.

With the expression doc.total_amount.normalized_value == 100.0, the result is true. This is because abs(100.005 - 100.0) is less than 0.01. Standard CEL would return false.

For the expression doc.total_amount.normalized_value == 100.02 the result is false. This is because abs(100.005 - 100.02) is greater than 0.01.

String function restrictions

The standard CEL string manipulation functions are restricted to operate only on an entity's mention_text property. This restriction ensures that validation rules are consistently applied to the literal text string as it appears in the document.

The affected functions are:

contains()
endsWith()
startsWith()
matches()

Here is a valid example:

// Checks if the extracted currency symbol is a dollar sign.
doc.total_amount.mention_text.startsWith("$")

Here is an example of an invalid command:

// This will produce a validation error upon save because contains() is not
// being used on a .mention_text field.
doc.supplier_name.normalized_value.contains("Inc.")

Additional functions

This section describes non-standard global and member functions available in the Document AI CEL evaluation environment.

Sum function

### sum()

Calculates the sum of a list of numeric elements. It can be called on lists of integers or doubles. Not that null values in the list are ignored. If the list contains any non-numeric elements, the function will trigger an error.

Signature: <list>.sum()

Here's an example summing up the normalized value of several line items.

// Calculates the sum of all line item amounts.
doc.line_item_amounts.normalized_value.sum()

Or function

### or()

Returns the original value if it is not null; otherwise, it returns the second (default) argument.

Signature: <value>.or(<;default_value>)

As an example, you can use it to provide a default value when an optional field might be missing.

// If the tax amount is not found, use a default of 0.0 for calculation.
doc.tax_amount.normalized_value.or(0.0) > 5.0

Check list vertical alignment function

### checkVerticalAlignment()

Returns true if all the bounding_poly objects are vertically aligned to a degree specified by the tolerance. Otherwise, returns an error describing which entities are misaligned.

The tolerance is an optional argument that specifies the overlap of a bounding poly with the automatically selected best aligned bounding poly, 0 meaning bounding polys are non-intersecting, 1 meaning that the bounding polys are identical.

Signature: checkVerticalAlignment([<;bounding_poly>,...],tolerance=0.8)

As an example, you can use this to help keep all extracted entities within a single column.

// Make sure that all extracted quantities are in the same column
checkVerticalAlignment(doc.line_items.map(li,li.quantity.bounding_poly))

Check horizontal alignment function

### checkHorizontalAlignment()

Returns true if all the bounding_poly objects are horizontally aligned to a degree specified by the tolerance. Otherwise, returns an error describing which entities are misaligned.

The tolerance is an optional argument that specifies the overlap of a bounding poly with the automatically selected best aligned bounding poly. The meaning of 0 is that the bounding polys are non-intersecting, and 1 means bounding polys are identical.

Signature: checkHorizontalAlignment([<;bounding_poly>,...],tolerance=0.6)

As an example, you can use this to help keep all extracted entities within a single row.

// For all line items make sure that each extracted line item's
// children are in the same row. For example, if line item has
// quantity and description properties, it makes sure they are in 
// a single row for each respective line item.
doc.line_items.all(li, checkHorizontalAlignment(li.map(col,li[col].bounding_poly)))

Additional macros

This section describes non-standard macros available in the Document AI CEL evaluation environment.

For more information about macros, see Language definition: Macros.

Reduce macro

### reduce()

Use the reduce macro to perform cumulative operations over a list.

Signature: <list>.reduce(<;iterator_name>, <;accumulator_name>, <;initial_value>, <;loop_expression>)

iterator_name: A variable name for the element being processed in each iteration.
accumulator_name: A variable name for the value that accumulates the result.
initial_value: The initial value of the accumulator.
loop_expression: An expression that defines how the accumulator is updated in each step.

Here is an example of macro use:

// Use reduce to sum up the quantities from all line items.
doc.line_items.reduce(item, running_total, 0.0, running_total + item.quantity.normalized_value)

// Result: 3.0
// Tip: this could also be done by concatenating map and sum:
doc.line_items.map(item,item.quantity.normalized_value).sum()

CEL extension libraries

In addition to the custom features, the following standard CEL extension libraries are available.

Lists: Provides functions for list manipulation: range, distinct, flatten, reverse, sort, slice. For more information, see Lists.
Math: Provides extended mathematical functions like math.greatest, math.least, math.abs. For more information, see Math.

What's next

Learn about pretrained processors.

Derived field and signature detection

Pretrained overview

CEL dialect for document validation Stay organized with collections Save and categorize content based on your preferences.

Access entities in CEL expression

Data model

Leaf entities example cases

Leaf entity with a single occurrence

Leaf entity with multiple occurrences

Nested entities

Nested entity with one occurrence

Nested entity with multiple occurrences

Conversion table of normalized value

Example expressions

Behavioral changes

Epsilon-based equality

String function restrictions

Additional functions

Sum function

Or function

Check list vertical alignment function

Check horizontal alignment function

Additional macros

Reduce macro

CEL extension libraries

What's next

CEL dialect for document validation