This page shows you how to create and manage Storage Insights datasets and dataset configurations. Learn more about Storage Insights datasets.
Before you begin
Before you begin creating and managing datasets and dataset configurations, follow the instructions in the subsequent subsections.
Get the required roles
To get the permissions that you need to create and manage datasets, ask your administrator to grant you the following IAM roles on your source projects:
-
To create, manage, and view dataset configurations:
Storage Insights Admin (
roles/storageinsights.admin) -
To view, link, and unlink datasets:
-
Storage Insights Analyst (
roles/storageinsights.analyst) -
BigQuery Admin (
roles/bigquery.admin)
-
Storage Insights Analyst (
-
To delete linked datasets:
BigQuery Admin (
roles/bigquery.admin) -
To view and query datasets in BigQuery:
-
Storage Insights Viewer (
roles/storageinsights.viewer) -
BigQuery Job User (
roles/bigquery.jobUser) -
BigQuery Data Viewer (
roles/bigquery.dataViewer)
-
Storage Insights Viewer (
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to create and manage datasets. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create and manage datasets:
-
Create dataset configuration:
storageinsights.datasetConfigs.create -
View dataset configuration:
-
storageinsights.datasetConfigs.get -
storageinsights.datasetConfigs.list
-
-
Manage dataset configuration:
-
storageinsights.datasetConfigs.update -
storageinsights.datasetConfigs.delete
-
-
Link to BigQuery dataset:
storageinsights.datasetConfigs.linkDataset -
Unlink to BigQuery dataset:
storageinsights.datasetConfigs.unlinkDataset -
Query BigQuery linked datasets:
bigquery.jobs.create or bigquery.jobs.*
You might also be able to get these permissions with custom roles or other predefined roles.
Enable the Storage Insights API
Console
Command line
To enable the Storage Insights API in your current project, run the following command:
gcloud services enable storageinsights.googleapis.com
For more details about enabling services for a Google Cloud project, see Enabling and disabling services.
Configure Storage Intelligence
Ensure that Storage Intelligence is configured on the project, folder, or organization that you want to analyze with datasets.
Create a dataset configuration
To create a dataset configuration and generate a dataset, follow these steps. For more information about the fields that you can specify as you create the dataset configuration, see Dataset configuration properties.
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click Configure dataset.
In the Name your dataset section, enter a name for your dataset. Optionally, enter a description for the dataset.
In the Define dataset scope section, do the following:
Select one of the following options:
To get storage metadata for all projects in the current organization, select Include the organization.
To get storage metadata for all projects in the selected folders, select Include folders(Sub-organization/departments). For information about how to get folder IDs, see Viewing or listing folders and projects. To add folders, do the following:
- In the Folder 1 field, enter the folder ID.
- Optionally, to add multiple folder IDs, click + Add another folder.
To get storage metadata for the selected projects, select Include projects by providing project numbers. To learn how to find the project numbers, see Find the project name, number, and ID. To add projects, do the following:
- In the Project 1 field, enter the project number.
- Optionally, to add multiple project numbers, click + Add another project.
To add projects or folders in bulk, select Upload a list of projects /folders via CSV file. The CSV file must contain the project numbers or folder IDs that you want to include in the dataset.
Specify if you want to automatically include future buckets in the selected resource.
Optionally, to specify filters on buckets based on regions and bucket prefixes, expand the Filters (optional) section. Filters are applied additively on buckets.
You can include or exclude buckets from specific regions. For example, you can exclude buckets that are in the
me-central1andme-central2regions. You can also include or exclude buckets by prefix. For example, if you want to exclude buckets that start withmy-bucket, enter themy-bucket*prefix.
Click Continue.
In the Select retention period section, select a retention period for the data in the dataset.
In the Select location to store configured dataset section, select a location to store the dataset and dataset configuration.
In the Select service account type section, select a service agent type for your dataset. This service agent is created on your behalf when you create the dataset configuration. You can select one of the following service agents:
- Configuration-scoped service account: This service agent can only access and write the dataset generated by the particular dataset configuration.
- Project-scoped service account: This service agent can access and write datasets that are generated from all the dataset configurations in the project.
Upon creation of the service agent, you must grant the service agent the required permissions. For more information about these service agent, see the Dataset configuration properties.
Click Configure. It can take up to 48 hours for you to view the first load of data in the linked datasets after you configure the dataset.
Command line
To create a dataset configuration, run the
gcloud alpha storage insights dataset-configs createcommand with the required flags:gcloud alpha storage insights dataset-configs create DATASET_CONFIG_ID \ --location=LOCATION \ --organization=SOURCE_ORG_NUMBER \ --retention-period-days=RETENTION_PERIOD_DAYS \ (SCOPE_FLAG)
Where:
DATASET_CONFIG_IDis the name you want to give to your dataset configuration. Names are used as the identifier of dataset configurations and are mutable. The name can contain up to 128 characters using letters, numbers, and underscores.LOCATIONis the location the dataset configuration and dataset will be stored in.SOURCE_ORG_NUMBERis the ID of the organization the source projects belong to. To learn how to find your organization ID, see Getting your organization resource ID.RETENTION_PERIOD_DAYSis the retention period for the data in the dataset.SCOPE_FLAGis one of the following flags that defines the scope of the data you want to collect:--enable-organization-scope: Enables the dataset to collect insights from all buckets within the entire organization.--source-folders=[SOURCE_FOLDER_NUMBERS,...]: Specifies a list of folder numbers to include in the dataset. To learn how to find a folder number, see Listing all projects and folders in your hierarchy.--source-folders-file=FILE_PATH: Specifies multiple folder numbers by uploading a CSV file to a bucket.--source-projects=[SOURCE_PROJECT_NUMBERS,...]: Specifies a list of project numbers to include in the dataset. For example,464036093014. To learn how to find your project number, see Find the project name, number, and ID.--source-projects-file=FILE_PATH: Specifies multiple project numbers by uploading a CSV file to a bucket.
Optionally, you can use additional flags to configure the dataset:
Use
--include-buckets=BUCKET_NAMES_OR_REGEXto include specific buckets by name or regular expression. If this flag is used,--exclude-bucketscan't be used.Use
--exclude-buckets=BUCKET_NAMES_OR_REGEXto exclude specific buckets by name or regular expression. If this flag is used,--include-bucketscan't be used.Use
--project=DESTINATION_PROJECT_IDto specify a project to use for storing your dataset configuration and generated dataset. If this flag is unused, the destination project will be your active project. For more information about project IDs, see Creating and managing projects.Use
--auto-add-new-bucketsto automatically include any buckets that get added to source projects in the future.Use
--skip-verificationto skip checks and failures from the verification process, which includes checks for required IAM permissions. If used, some or all buckets might be excluded from the dataset.Use
--identity=IDENTITY_TYPEto specify the type of service agent that gets created alongside the dataset configuration. Values areIDENTITY_TYPE_PER_CONFIGorIDENTITY_TYPE_PER_PROJECT. If unspecified, defaults toIDENTITY_TYPE_PER_CONFIG.Use
--description=DESCRIPTIONto write a description for the dataset configuration.Use
--organization=ORGANIZATION_IDwith the resource ID of the organization the source projects belongs to. The dataset configuration excludes source projects outside of the specified location. To learn how to find your organization ID, see Getting your organization resource ID. If unspecified, defaults to the source project's organization ID.
The following example creates a dataset configuration named
my-datasetin theus-central1region, for the organization with the ID123456789, with a retention period of30days, and a scope limited to the projects987654321and123123123:gcloud alpha storage insights dataset-configs create my-dataset \ --location=us-central1 \ --organization=123456789 \ --retention-period-days=30 \ --source-projects=987654321,123123123
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorizationheader.Create a JSON file that contains the following information:
{ "sourceProjects": { "project_numbers": ["PROJECT_NUMBERS", ...] }, "retentionPeriodDays": "RETENTION_PERIOD_DAYS", "identity": { "type": "IDENTITY_TYPE" } }
Replace:
PROJECT_NUMBERSwith the numbers of the projects you want to include in the dataset. You can specify one project or multiple projects. Projects must be specified as a list of strings.Alternatively, you can add an organization, or one or multiple folders containing buckets and objects you want to update the metadata for. To include folders or organizations, use the
sourceFoldersororganizationScopefield respectively. For more information, see theDatasetConfigreference.RETENTION_PERIOD_DAYSwith the number of days of data to capture in the dataset snapshot. For example,90.IDENTITY_TYPEwith the type of service account that gets created alongside the dataset configuration. Values areIDENTITY_TYPE_PER_CONFIGorIDENTITY_TYPE_PER_PROJECT.
To create the dataset configuration, use
cURLto call the JSON API with aCreateDatasetConfig request:curl -X POST --data-binary @JSON_FILE_NAME \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs?datasetConfigId=DATASET_CONFIG_ID" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json"
Replace:
JSON_FILE_NAMEwith the path to the JSON file you created in the previous step. Alternatively, you can pass an instance ofDatasetConfigin the request body.PROJECT_IDwith the ID of the project that the dataset configuration and dataset will belong to.LOCATIONwith the location the dataset and dataset configuration will reside in. For example,us-central1.DATASET_CONFIG_IDwith the name you want to give to your dataset configuration. Names are used as the identifier of dataset configurations and are not immutable. The name can contain up to 128 characters using letters, numbers, and underscores. The name must begin with a letter.ACCESS_TOKENwith the access token you generated when you installed and initialized the Google Cloud CLI.
To troubleshoot snapshot processing errors that are logged in
error_attributes_view, see Troubleshoot dataset errors.
Grant the required permissions to the service agent
Google Cloud creates a configuration-scoped or project-scoped service
agent on your behalf when you create a dataset configuration. The service
agent follows the naming format
service-PROJECT_NUMBER@gcp-sa-storageinsights.iam.gserviceaccount.com and appears on the
IAM page of the Google Cloud console
when you select the Include Google-provided role grants checkbox.
You can also find the name of the service agent by
viewing the DatasetConfig resource using the JSON API.
To enable Storage Insights to generate and write datasets, ask your
administrator to grant the service agent the Storage Insights Collector
Service role (roles/storage.insightsCollectorService) on the organization
that contains the source projects.
This role must be granted to every configuration-scoped service agent
that gets created for each dataset configuration you want data from. If you're
using a project-scoped service agent, this role must only be
granted once for the service agent to be able to read and write datasets
for all dataset configurations within the project.
For instructions about granting roles on projects, see Manage access.
Link a dataset
To link a dataset to BigQuery, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click the name of the dataset configuration that generated the dataset you want to link.
In the BigQuery linked dataset section, click Link dataset to link your dataset.
Command line
To link a dataset to BigQuery, run the
gcloud alpha storage insights dataset-configs create-linkcommand:gcloud alpha storage insights dataset-configs create-link DATASET_CONFIG_ID --location=LOCATION
Replace:
DATASET_CONFIG_IDwith the name of the dataset configuration that generated the dataset you want to link.LOCATIONwith the location of your dataset. For example,us-central1.
As an alternative to specifying
DATASET_CONFIG_IDandLOCATION, you can specify a full dataset configuration path. For example:gcloud alpha storage insights dataset-configs create-link projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID_
Replace:
DESTINATION_PROJECT_IDwith the ID of the project that contains the dataset configuration. For more information about project IDs, see Creating and managing projects.DATASET_CONFIG_IDwith the name of the dataset configuration that generated the dataset you want to link.LOCATIONwith the location of your dataset and dataset configuration. For example,us-central1.
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorizationheader.Create a JSON file that contains the following information:
{ "name": "DATASET_NAME" }
Replace:
DATASET_NAMEwith the name of the dataset you want to link. For example,my_project.my_dataset276daa7e_2991_4f4f_b9d4_e354b48426a2.
Use
cURLto call the JSON API with alinkDatasetDatasetConfig request:curl --request POST --data-binary @JSON_FILE_NAME \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigsDATASET_CONFIG_ID:linkDataset?" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
JSON_FILE_NAMEwith the path to the JSON file you created in the previous step.PROJECT_IDwith the ID of the project that the dataset configuration belongs to.LOCATIONwith the location in which the dataset and dataset configuration reside. For example,us-central1.DATASET_CONFIG_IDwith the name of the dataset configuration that generated the dataset you want to link.ACCESS_TOKENwith the access token you generated when you installed and initialized the Google Cloud CLI.
View and query linked datasets
To view and query linked datasets, follow these steps:
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
A list of dataset configurations that are created in your project appears.
Click the BigQuery linked dataset of the dataset configuration that you want to view.
The BigQuery linked dataset appears in the Google Cloud console. For information about the dataset schema of metadata, see Dataset schema of metadata.
You can query tables and views in your linked datasets in the same way you would query any other BigQuery table.
Unlink a dataset
To stop the dataset configuration from publishing to the BigQuery dataset, unlink the dataset. To unlink a dataset, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click the name of the dataset configuration that generated the dataset you want to unlink.
In the BigQuery linked dataset section, click Unlink dataset to unlink your dataset.
Command line
To unlink the dataset, run the
gcloud alpha storage insights dataset-configs delete-linkcommand:gcloud alpha storage insights dataset-configs delete-link DATASET_CONFIG_ID --location=LOCATION
Replace:
DATASET_CONFIG_IDwith the name of the dataset configuration that generated the dataset you want to unlink.LOCATIONwith the location of your dataset and dataset configuration. For example,us-central1.
As an alternative to specifying
DATASET_CONFIG_IDandLOCATION, you can specify a full dataset configuration path. For example:gcloud alpha storage insights dataset-configs delete-link projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID
Replace:
DESTINATION_PROJECT_IDwith the ID of the project that contains the dataset configuration. For more information about project IDs, see Creating and managing projects.DATASET_CONFIG_IDwith the name of the dataset configuration that generated the dataset you want to unlink.LOCATIONwith the location of your dataset and dataset configuration. For example,us-central1.
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorizationheader.Create a JSON file that contains the following information:
{ "name": "DATASET_NAME" }
Replace:
DATASET_NAMEwith the name of the dataset you want to unlink. For example,my_project.my_dataset276daa7e_2991_4f4f_b9d4_e354b48426a2.
Use
cURLto call the JSON API with anunlinkDatasetDatasetConfig request:curl --request POST --data-binary @JSON_FILE_NAME \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigsDATASET_CONFIG_ID:unlinkDataset?" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
JSON_FILE_NAMEwith the path to the JSON file you created in the previous step.PROJECT_IDwith the ID of the project that the dataset configuration belongs to.LOCATIONwith the location of the dataset and dataset configuration. For example,us-central1.DATASET_CONFIG_IDwith the name of the dataset configuration that generated the dataset you want to unlink.ACCESS_TOKENwith the access token you generated when you installed and initialized the Google Cloud CLI.
View a dataset configuration
To view a dataset configuration, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click the name of the dataset configuration you want to view.
The dataset configuration details are displayed.
Command line
To describe a dataset configuration, run the
gcloud alpha storage insights dataset-configs describecommand:gcloud alpha storage insights dataset-configs describe DATASET_CONFIG_ID \ --location=LOCATION
Replace:
DATASET_CONFIG_IDwith the name of the dataset configuration.LOCATIONwith the location of the dataset and dataset configuration.
As an alternative to specifying
DATASET_CONFIG_IDandLOCATION, you can specify a full dataset configuration path. For example:gcloud alpha storage insights dataset-configs describe projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID
Replace:
DESTINATION_PROJECT_IDwith the ID of the project that contains the dataset configuration. For more information about project IDs, see Creating and managing projects.DATASET_CONFIG_IDwith the name of the dataset configuration that generated the dataset you want to view.LOCATIONwith the location of your dataset and dataset configuration. For example,us-central1.
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorizationheader.Use
cURLto call the JSON API with anGetDatasetConfig request:curl -X GET \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
PROJECT_IDwith the ID of the project that the dataset configuration belongs to.LOCATIONwith the location of the dataset and dataset configuration. For example,us-central1.DATASET_CONFIG_IDwith the name of the dataset configuration.ACCESS_TOKENwith the access token you generated when you installed and initialized the Google Cloud CLI.
List dataset configurations
To list the dataset configurations in a project, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
The list of dataset configurations is displayed.
Command line
To list dataset configurations in a project, run the
gcloud alpha storage insights dataset-configs listcommand:gcloud alpha storage insights dataset-configs list --location=LOCATION
Replace:
LOCATIONwith the location of the dataset and dataset configuration. For example,us-central1.
You can use the following optional flags to specify the behavior of the listing call:
Use
--page-sizeto specify the maximum number of results to return per page.Use
--filter=FILTERto filter results. For more information on how to use the--filterflag, rungcloud topic filtersand refer to the documentation.Use
--sort-by=SORT_BY_VALUEto specify a comma-separated list of resource field key names to sort by. For example,--sort-by=DATASET_CONFIG_ID.
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorizationheader.Use
cURLto call the JSON API with anGetDatasetConfig request:curl -X GET \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
PROJECT_IDwith the ID of the project that the dataset configuration belongs to.LOCATIONwith the location of the dataset and dataset configuration. For example,us-central1.ACCESS_TOKENwith the access token you generated when you installed and initialized the Google Cloud CLI.
Update a dataset configuration
To update a dataset configuration, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click the name of the dataset configuration you want to update.
In the Dataset configuration tab that appears, click Edit to update the fields.
Command line
To update a dataset configuration, run the
gcloud alpha storage insights dataset-configs updatecommand:gcloud alpha storage insights dataset-configs update DATASET_CONFG_ID \ --location=LOCATION
Replace:
DATASET_CONFIG_IDwith the name of the dataset configuration.LOCATIONwith the location of the dataset and dataset configuration.
Use the following flags to update properties of the dataset configuration:
Use
--skip-verificationto skip checks and failures from the verification process, which includes checks for required IAM permissions. If used, some or all buckets might be excluded from the dataset.Use
--retention-period-days=DAYSto specify the moving number of days of data to capture in the dataset snapshot. For example,90.Use
--description=DESCRIPTIONto write a description for the dataset configuration.Use
--organization=ORGANIZATION_IDto specify the organization ID of the source project. If unspecified, defaults to the source project's organization ID.
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorizationheader.Create a JSON file that contains the following optional information:
{ "organization_number": "ORGANIZATION_ID", "source_projects": { "project_numbers": "PROJECT_NUMBERS" }, "retention_period_days": RETENTION_PERIOD" }
Replace:
ORGANIZATION_IDwith the resource ID of the organization to which the source projects belong to. If unspecified, defaults to the source project's organization ID.PROJECT_NUMBERSwith the project numbers you want to include in the dataset. You can specify one project or multiple projects. Projects must be specified in a list format.RETENTION_PERIODwith the moving number of days of data to capture in the dataset snapshot. For example,90.
To update the dataset configuration, use
cURLto call the JSON API with anPatchDatasetConfig request:curl -X PATCH --data-binary @JSON_FILE_NAME \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID?updateMask=RETENTION_PERIOD" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
JSON_FILE_NAMEwith the path to the JSON file you created in the previous step.PROJECT_IDwith the ID of the project that the dataset configuration belongs to.LOCATIONwith the location of the dataset and dataset configuration. For example,us-central1.DATASET_CONFIG_IDwith the name of the dataset configuration you want to update.RETENTION_PERIODwith the moving number of days of data to capture in the dataset snapshot. For example,90.ACCESS_TOKENwith the access token you generated when you installed and initialized the Google Cloud CLI.
Delete a dataset configuration
To delete a dataset configuration, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click the name of the dataset configuration you want to delete.
Click Delete .
Command line
To delete a dataset configuration, run the
gcloud alpha storage insights dataset-configs deletecommand:gcloud alpha storage insights dataset-configs delete DATASET_CONFIG_ID \ --location=LOCATION
Replace:
DATASET_CONFIG_IDwith the name of the dataset configuration you want to delete.LOCATIONwith the location of the dataset and dataset configuration. For example,us-central1.
Use the following flags to update properties of the dataset configuration:
Use
--auto-delete-linkto unlink the dataset that was generated from the dataset configuration you want to delete. You must unlink a dataset before you can delete the dataset configuration that generated the dataset.Use
--retention-period-days=DAYSto specify the number of days of data to capture in the dataset snapshot. For example,90.
As an alternative to specifying
DATASET_CONFIG_IDandLOCATION, you can specify a full dataset configuration path. For example:gcloud alpha storage insights dataset-configs delete projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorizationheader.Use
cURLto call the JSON API with anDeleteDatasetConfig request:curl -X DELETE \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
PROJECT_IDwith the ID of the project that the dataset configuration belongs to.LOCATIONwith the location of the dataset and dataset configuration. For example,us-central1.DATASET_CONFIG_IDwith the name of the dataset configuration you want to delete.ACCESS_TOKENwith the access token you generated when you installed and initialized the Google Cloud CLI.