This page explains the schema of the BigQuery table that is created when exporting DICOM metadata to BigQuery.
Terminology
To understand the schema and its components, familiarize yourself with the DICOM terminology. In particular, this page uses several terms found in 3.10 DICOM Data Structures and Encoding Definitions.
Overview
The Cloud Healthcare API automatically generates the BigQuery schema using the data you are exporting and the DICOM dictionary. The schema only contains columns for DICOM data elements that exist in the metadata. The only exception is the Person Name VR.
When exporting DICOM metadata, the Cloud Healthcare API attempts to export all data elements in the metadata. For information on what happens if an issue arises, see Conflicting and mismatching types.
Standard and private data elements
DICOM provides standard data elements that conform to a predefined specification. For a list of these data elements, see Registry of DICOM Data Elements.
In cases where you must communicate data that doesn't conform to the standard elements, you can use private data elements.
Standard data elements
The following behaviors apply to standard data elements. For private data element behavior, see Private data elements.
Column names
Columns in the generated BigQuery schema are named according to the keyword of the data element. For example, if the DICOM metadata contains a data element whose keyword is InstanceCreationDate, then the generated schema has a corresponding column named InstanceCreationDate.
Standard DICOM data element behavior
The following table shows a list of Value Representations (VRs) and their abbreviations. For any data element exported to BigQuery that contains one of these VRs, the data element uses the BigQuery data type found under "Data type":
| VR | Data type |
|---|---|
|
String |
| Date (DA) | Date |
| Time (TM) | Time |
| Date Time (DT) | Timestamp |
|
String |
| Person Name (PN) | Struct (Record) |
|
Floating point |
|
Integer |
| Sequence of Items (SQ) | Struct (Record) |
Nullable and repeated modes
Depending on the Value Multiplicity (VM) value of a data element, its
BigQuery column has one of two
modes: NULLABLE or REPEATED.
If a data element has a VM value of 1, which indicates that the data element
is unique, the data element uses the NULLABLE mode. For any other VM value,
the data element uses the REPEATED mode.
For example, as shown in the Registry of DICOM Data Elements,
the SOPInstanceUID keyword has a VM value of 1. As a result,
when it is exported to BigQuery, its mode is NULLABLE,
and its representation in the table looks like the following (when represented as
JSON):
"SOPInstanceUID": "0.0.000.000000.0.000.0000.0000000.0000.0000000000.000",
Conversely, the ImageType keyword has a VM value of 2-n.
As a result, when it is exported to BigQuery, its mode is
REPEATED, and its representation in the table looks like the following (when
represented as JSON):
"ImageType": [
"ORIGINAL",
"PRIMARY",
"OTHER",
"..."
],
Excluded VRs
Binary and long-form data is not exported to the generated BigQuery
table, so data elements containing the following VRs are not exported. Instead,
the following VRs are included in a separate column (called
DroppedTags.TagName) in the destination BigQuery table.
- Other Double (OD)
- Other Float (OF)
- Other Long (OL)
- Other Byte (OB)
- Other Word (OW)
- Unknown (UN)
- Sequence (SQ) tags containing more than approximately 1 MiB of data
- Attribute (AT), Floating Point Double (FD), Floating Point Single (FL),
Unsigned Long (UL), or Unsigned Short (US), if the Value Multiplicity (VM) is
greater than 512.
- For legacy reasons, tags of instances already ingested into the
Cloud Healthcare API might be included in the
DroppedTags.TagNamecolumn if the Value Multiplicity is greater than 64.
- For legacy reasons, tags of instances already ingested into the
Cloud Healthcare API might be included in the
Person Name VR
Each column in the BigQuery schema with a Person Name (PN) VR always contains three subcolumns, regardless of whether the subcolumns contain any data. The three subcolumns are:
- Alphabetic
- Ideographic
- Phonetic
Each of the three subcolumns has its own five subcolumns:
- FamilyName
- GivenName
- MiddleName
- NamePrefix
- NameSuffix
For example, consider the public tag "OperatorsName (0008,1070)," which has a VR of Person Name (PN). Suppose that the value of OperatorsName is "Darcy Smith." The schema will contain an OperatorsName column containing the subcolumns listed previously, but only Alphabetic.FamilyName (Smith) and Alphabetic.GivenName (Darcy) will contain values.
Private data elements
Some clinical implementations might require you to store custom data that does not fit within the structure of public data elements. As an alternative, you can use private data elements.
Private data elements with a VR of SQ (Sequence of Items) have the same behavior as standard data elements. Private data elements with a VR of SQ are called private data sequences.
Private data elements that do not have a VR of SQ are nested under a column
called OtherElements and are converted to strings. These private data
elements are called non-sequence private data. To
query non-sequence private data elements, your query must search within the
OtherElements column of the element.
The OtherElements column contains two subcolumns, "Data" and "Tag." The
Data column is the string representation of the private data element's value.
It is always type REPEATED. The Tag column uses the format
"Tag_HEX" where HEX is a hex string of the tag number.
LastUpdated and Type columns
The LastUpdated and Type columns are added to the BigQuery
table created when you export DICOM metadata. These columns are not standard
or private data elements, and they do not correspond to the Registry of
DICOM Data Elements.
The behavior of these columns is as follows:
- The
LastUpdatedcolumn contains a timestamp value that shows when the DICOM instance was inserted into or deleted from the DICOM store. - The
Typecolumn contains a string that shows what type of operation occurred. The possible values areCREATEorDELETE.
Conflicting and mismatching types
If a type conflict occurs, such as when a public tag is used with an incorrect
type, the public tag is treated as though it were a private tag. The value of
the data element is
nested under a column called OtherElements and the value is converted to a string.
For example, say that the DICOM metadata contains a tag with:
- A tag number "(4010,1017)"
- A VR of SL (Signed Long)
- A value of 32
(4010,1017) is same tag number as "Mass," which is a public tag name in the DICOM specification that has a VR of FL. The export operation expects a data element with the tag number of "(4010,1017)" to be the "Mass" public tag name with a VR of FL. Therefore, the export operation expects to convert the data element's value to a float (as shown in the table in Standard DICOM data element behavior
A type conflict occurs because any tags with a VR of SL use the integer
data type. The tag is therefore converted to a private tag and added to the
OtherElements column.
If a non-sequence public tag name is used for sequence data, a type mismatch occurs. As a result, the sequence is treated as though it were a private data element. Instead of using the public tag name as the column name in the BigQuery schema, the public tag name's hex number is used. The hex number is of type string.
Examples: Querying public and private data elements
Consider the following snippet of a schema represented as JSON. The schema was created after exporting DICOM data to BigQuery.
[
...
{
"name": "SOPInstanceUID",
"type": "STRING"
},
{
"fields": [
{
"fields": [
{
"mode": "REQUIRED",
"name": "Tag",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "Data",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "OtherElements",
"type": "RECORD"
}
],
"mode": "REPEATED",
"name": "Tag_12345678",
"type": "RECORD"
}
...
]
The following sample shows how to query for the SOPInstanceUID public data element.
To access the value of the column, run the following query:
#standardSQL SELECT SOPInstanceUID FROM `PROJECT_ID.DATASET_ID.TABLE_ID`
Running the query returns an output similar to the following:
[ ... { "SOPInstanceUID": "0.0.000.000000.0.000.0000.0000000.0000.0000000000.000" }, ... ]
The following sample shows how to query for non-sequence private data.
Run the following query against the OtherElements column which is
inside the Tag_12345678 column. Note the use of the UNNEST
operator, which is required because you are querying a RECORD.
#standardSQL SELECT Tag_12345678.OtherElements AS OtherElements FROM `PROJECT_ID.DATASET_ID.TABLE_ID`, UNNEST(Tag_12345678) AS Tag_12345678
Running the query returns an output similar to the following, depending
on the amount and type of data in the Tag_12345678.OtherElements column:
[ { "OtherElements": [ { "Tag": "Tag_12345678", "Data": [ "DATA" ] } ] }, { "OtherElements": [ { "Tag": "Tag_12345678", "Data": [ "DATA" ] } ] }, { "OtherElements": [ { "Tag": "Tag_12345678", "Data": [ "DATA" ] } ] } ]
JSON Schema
You can export DICOM instance metadata into a BigQuery table with a simpler schema where the full instance metadata is stored as a single JSON column. This can be useful when you prefer to work with the raw JSON structure of the DICOM metadata within BigQuery. It is also useful when the number of unique tags across all DICOM instances exceeds the maximum number of columns allowed in a BigQuery table.
To enable the JSON schema, set the schema_json field in the BigQueryDestination. For more details on configuring the schema, see BigQueryDestination.
When the JSON schema is enabled, the BigQuery table will have the following columns:
| Field Name | Type | Description |
|---|---|---|
StudyInstanceUID |
STRING | DICOM Tag (0020,000D). |
SeriesInstanceUID |
STRING | DICOM Tag (0020,000E). |
SOPInstanceUID |
STRING | DICOM Tag (0008,0018). |
SourceDicomStore |
STRING | The name of the source DICOM store. This field is only included if the includeSourceStore option is set to true in the export configuration. |
Type |
STRING | Indicates the type of operation that triggered the export (e.g., CREATE, DELETE). |
LastUpdated |
TIMESTAMP | Timestamp of the last update to the instance in the DICOM store. |
Metadata |
JSON | All DICOM tags for the instance, stored in a single JSON object. |
DroppedTags |
REPEATED STRING | List of tags that were dropped during the conversion, typically binary or very large tags. |
StorageClass |
STRING | The storage class of the instance (e.g., STANDARD, NEARLINE). |
BlobStorageSize |
INTEGER | Size of the blob storage in bytes. |
StructuredStorageSize |
INTEGER | Size of the structured storage in bytes. |
Example: Querying the Metadata JSON column
Here is an example of how to query a specific DICOM tag from the Metadata column following BigQuery's JSON data support. This query selects the PatientID and PatientAge tags:
#standardSQL SELECT Metadata.PatientID AS patient_id, Metadata.PatientAge AS patient_age FROM `PROJECT_ID.DATASET_ID.TABLE_ID`;
Running the query returns an output similar to the following:
/*-------------+---------------*
| patient_id | patient_age |
+--------------+---------------+
| "John-Doe" | "042Y" |
*--------------+---------------*/
You can adapt query to extract other DICOM tags from the Metadata JSON column as needed.
What's next
Learn more about BigQuery standard SQL operations and view examples.