This document shows you how to configure Storage Insights datasets.
Before you begin
Before you configure a dataset, complete the following steps.
Get the required roles
To get the permissions that you need to configure datasets, ask your administrator to grant you the following IAM roles on your source projects:
-
To configure a dataset:
Storage Insights Admin (
roles/storageinsights.admin) -
To link a dataset:
-
Storage Insights Analyst (
roles/storageinsights.analyst) -
BigQuery Admin (
roles/bigquery.admin)
-
Storage Insights Analyst (
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to configure datasets. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to configure datasets:
-
Configure a dataset:
-
storageinsights.datasetConfigs.create -
storage.buckets.getObjectInsights
-
-
Link to BigQuery dataset:
storageinsights.datasetConfigs.linkDataset
You might also be able to get these permissions with custom roles or other predefined roles.
Enable the Storage Insights API
Console
Command line
To enable the Storage Insights API in your
current project, run the gcloud services enable command:
gcloud services enable storageinsights.googleapis.com
For more information about enabling services for a Google Cloud project, see Enabling and disabling services.
Configure Storage Intelligence
Ensure that Storage Intelligence is configured for the project, folder, or organization that you want to analyze with datasets.
Create a dataset configuration
To create a dataset configuration, follow these steps. For more information about the fields you can specify for the dataset configuration, see How Storage Insights datasets work.
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click Configure dataset.
In the Name your dataset section, enter a name for your dataset. Optionally, enter a description for the dataset. Names identify dataset configurations and are immutable. The name can contain up to 128 characters, including letters, numbers, and underscores, and must start with a letter.
In the Define dataset scope section, do the following:
Select one of the following options:
To get storage metadata for all projects in the current organization, select Include the organization.
To get storage metadata for all projects in the selected folders, select Include folders (Sub-organization/departments). For information about getting folder IDs, see Viewing or listing folders and projects. To add folders:
- In the Folder 1 field, enter the folder ID.
- Optionally, to add multiple folder IDs, click + Add another folder.
To get storage metadata for the selected projects, select Include projects by providing project numbers. To learn how to find project numbers, see Find the project name, number, and ID. To add projects, do the following:
- In the Project 1 field, enter the project number.
- Optionally, to add multiple project numbers, click + Add another project.
To add projects or folders in bulk, select Upload a list of projects/folders via CSV file. The CSV file must contain the project numbers or folder IDs to include in the dataset. You can specify up to 10,000 projects or folders in one dataset configuration.
Specify whether to automatically include future buckets in the selected resource.
Optionally, to specify filters on buckets based on regions and bucket prefixes, expand the Filters (optional) section. Filters are applied additively on buckets.
You can include or exclude buckets from specific regions. For example, you can exclude buckets in the
me-central1andme-central2regions. You can also include or exclude buckets by prefix. For example, to exclude buckets that start withmy-bucket, enter themy-bucket*prefix.
Click Continue.
In the Select retention period section, select a retention period for the data in the dataset.
Activity data is included in the dataset by default, and inherits the retention period of the dataset. To override the dataset retention period, select Specify a retention period for activity data, and then select the number of days to retain activity data for. To disable activity data, set the retention period to
0days.In the Select location to store configured dataset section, select a location to store the dataset. For example,
us-central1.In the Select service account type section, select a service agent type for your dataset. Choose either a configuration-scoped or project-scoped service agent for your dataset.
Click Configure.
Command line
To create a dataset configuration, run the
gcloud storage insights dataset-configs createcommand with the required flags:gcloud storage insights dataset-configs create DATASET_CONFIG_ID \ --location=LOCATION \ --organization=SOURCE_ORG_NUMBER \ --retention-period-days=DATASET_RETENTION_PERIOD_DAYS \ (SCOPE_FLAG)
Replace:
DATASET_CONFIG_IDwith the name for your dataset configuration. Names identify dataset configurations and are immutable. The name can contain up to 128 characters, including letters, numbers, and underscores, and must start with a letter.LOCATIONwith the location to store the dataset. For example,us-central1.SOURCE_ORG_NUMBERwith the ID of the organization to which the source projects belong. To find your organization ID, see Getting your organization resource ID.DATASET_RETENTION_PERIOD_DAYSwith the retention period for the data in the dataset.SCOPE_FLAGwith any one of the following flags that defines the scope of the data to collect:--enable-organization-scope: Enables the dataset to collect insights from all buckets within the organization.--source-folders=[SOURCE_FOLDER_NUMBERS,...]: Specifies a list of folder numbers to include in the dataset. To learn how to find a folder number, see Listing all projects and folders in your hierarchy.--source-folders-file=FILE_PATH: Specifies multiple folder numbers by uploading a CSV file to a bucket.--source-projects=[SOURCE_PROJECT_NUMBERS,...]: Specifies a list of project numbers to include in the dataset. For example,464036093014. To find your project number, see Find the project name, number, and ID.--source-projects-file=FILE_PATH: Specifies multiple project numbers by uploading a CSV file to a bucket.
Optionally, use the following additional flags to configure the dataset:
Use
--include-buckets=BUCKET_NAMES_OR_REGEXto include specific buckets by name or regular expression. You can't use this flag with--exclude-buckets.Use
--exclude-buckets=BUCKET_NAMES_OR_REGEXto exclude specific buckets by name or regular expression. You can't use this flag with--include-buckets.Use
--project=DESTINATION_PROJECT_IDto specify a project for storing your dataset configuration and generated dataset. If you don't use this flag, the destination project is your active project. For more information about project IDs, see Creating and managing projects.Use
--auto-add-new-bucketsto automatically include any buckets added to source projects in the future.Use
--skip-verificationto skip checks and failures from the verification process, which includes checks for required IAM permissions. If you use this flag, some or all buckets might be excluded from the dataset.Use
--identity=IDENTITY_TYPEto specify the scope of the service agent created with the dataset configuration. Values areIDENTITY_TYPE_PER_CONFIGorIDENTITY_TYPE_PER_PROJECT. If unspecified, the default isIDENTITY_TYPE_PER_CONFIG. For details, see Service agent type.Use
--description=DESCRIPTIONto add a description for the dataset configuration.Use
--activity-data-retention-period-days=ACTIVITY_RETENTION_PERIOD_DAYSto specify the retention period for the activity data in the dataset. By default, activity data is included in the dataset, and inherits the retention period of the dataset. To override the dataset retention period, specify the number of days to retain activity data for. To exclude activity data, set the ACTIVITY_RETENTION_PERIOD_DAYS to0.
The following example creates a dataset configuration named
my-datasetin theus-central1region, for the organization with the ID123456789, with a retention period of30days, and a scope limited to the projects987654321and123123123:gcloud storage insights dataset-configs create my-dataset \ --location=us-central1 \ --organization=123456789 \ --retention-period-days=30 \ --source-projects=987654321,123123123
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorizationheader.Create a JSON file that contains the following information:
{ "sourceProjects": { "project_numbers": ["PROJECT_NUMBERS", ...] }, "retentionPeriodDays": "RETENTION_PERIOD_DAYS", "activityDataRetentionPeriodDays": "ACTIVITY_DATA_RETENTION_PERIOD_DAYS", "identity": { "type": "IDENTITY_TYPE" } }
Replace:
PROJECT_NUMBERSwith the numbers of the projects you want to include in the dataset. You can specify one project or multiple projects. Projects must be specified as a list of strings.Alternatively, you can add an organization, or one or multiple folders that contain buckets and objects for which you want to update the metadata. To include folders or organizations, use the
sourceFoldersororganizationScopefields. For more information, see theDatasetConfigreference.RETENTION_PERIOD_DAYSwith the number of days of data to capture in the dataset snapshot. For example,90.ACTIVITY_DATA_RETENTION_PERIOD_DAYSwith the number of days of activity data to capture in the dataset snapshot. By default, activity data is included in the dataset, and inherits the retention period of the dataset. To override the dataset retention period, specify the number of days to retain activity data for. To exclude activity data, set the ACTIVITY_RETENTION_PERIOD_DAYS to0.IDENTITY_TYPEwith the type of service account that gets created alongside the dataset configuration. Values areIDENTITY_TYPE_PER_CONFIGorIDENTITY_TYPE_PER_PROJECT. For details, see Service agent type.
To create the dataset configuration, use
cURLto call the JSON API with aCreateDatasetConfig request:curl -X POST --data-binary @JSON_FILE_NAME \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs?datasetConfigId=DATASET_CONFIG_ID" \ --header "Authorization: Bearer $(gcloud auth print-access-token --impersonate-service-account=SERVICE_ACCOUNT)" \ --header "Accept: application/json" \ --header "Content-Type: application/json"
Replace:
JSON_FILE_NAMEwith the path to the JSON file you created in the previous step. Alternatively, you can pass an instance ofDatasetConfigin the request body.PROJECT_IDwith the ID of the project that the dataset configuration and dataset will belong to.LOCATIONwith the location where the dataset and dataset configuration will reside. For example,us-central1.DATASET_CONFIG_IDwith the name of your dataset configuration. Names identify dataset configurations and are immutable. The name can contain up to 128 characters, including letters, numbers, and underscores, and must start with a letter.SERVICE_ACCOUNTwith the service account. For example,test-service-account@test-project.iam.gserviceaccount.com.
To troubleshoot snapshot processing errors that are logged in
error_attributes_view, see Storage Insights dataset errors.
Grant the required permissions to the service agent
Google Cloud creates a configuration-scoped or project-scoped service
agent when you create a dataset configuration. The service
agent follows the naming format
service-PROJECT_NUMBER@gcp-sa-storageinsights.iam.gserviceaccount.com and appears on the
IAM page in the Google Cloud console
when you select the Include Google-provided role grants checkbox.
You can also find the name of the service agent by
viewing the DatasetConfig resource using the JSON API.
To enable Storage Insights to generate and write datasets, ask your
administrator to grant the service agent the Storage Insights Collector
Service role (roles/storage.insightsCollectorService) on the organization
that contains the source projects.
You must grant this role to every configuration-scoped service agent
created for each dataset configuration from which you want data. If you use
a project-scoped service agent, you must grant this role only once on the service agent to read and write datasets
for all dataset configurations within the project.
For instructions about granting roles for projects, see Manage access.
Link a dataset
To link a dataset to BigQuery, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click the name of the dataset configuration that generated the dataset you want to link.
In the BigQuery linked dataset section, click Link dataset to link your dataset.
Command line
To link a dataset to BigQuery, run the
gcloud storage insights dataset-configs create-linkcommand:gcloud storage insights dataset-configs create-link DATASET_CONFIG_ID --location=LOCATION
Replace:
DATASET_CONFIG_IDwith the name of the dataset configuration that generated the dataset to link.LOCATIONwith the location of your dataset. For example,us-central1.
You can also specify a full dataset configuration path. For example:
gcloud storage insights dataset-configs create-link projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID
Replace:
DESTINATION_PROJECT_IDwith the ID of the project that contains the dataset configuration. For more information about project IDs, see Creating and managing projects.DATASET_CONFIG_IDwith the name of the dataset configuration that generated the dataset to link.LOCATIONwith the location of your dataset and dataset configuration. For example,us-central1.
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorizationheader.Use
cURLto call the JSON API with alinkDatasetDatasetConfig request:curl -X POST \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigsDATASET_CONFIG_ID:linkDataset?" \ --header "Authorization: Bearer $(gcloud auth print-access-token --impersonate-service-account=SERVICE_ACCOUNT)" \ --header "Accept: application/json" \ --header "Content-Type: application/json"Replace:
JSON_FILE_NAMEwith the path to the JSON file you created.PROJECT_IDwith the ID of the project to which the dataset configuration belongs.LOCATIONwith the location where the dataset and dataset configuration reside. For example,us-central1.DATASET_CONFIG_IDwith the name of the dataset configuration that generated the dataset to link.SERVICE_ACCOUNTwith the service account. For example,test-service-account@test-project.iam.gserviceaccount.com.
What's next
- View linked datasets.
- Query a linked dataset.
- Analyze your stored data with Gemini Cloud Assist.
- Manage your dataset configurations, including updating, viewing, listing, and deleting them.