Configure Storage Insights datasets

This document shows you how to configure Storage Insights datasets.

Before you begin

Before you configure a dataset, complete the following steps.

Get the required roles

To get the permissions that you need to configure datasets, ask your administrator to grant you the following IAM roles on your source projects:

To configure a dataset: Storage Insights Admin (roles/storageinsights.admin)
To link a dataset:
- Storage Insights Analyst (roles/storageinsights.analyst)
- BigQuery Admin (roles/bigquery.admin)

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to configure datasets. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to configure datasets:

Configure a dataset:
- storageinsights.datasetConfigs.create
- storage.buckets.getObjectInsights
Link to BigQuery dataset: storageinsights.datasetConfigs.linkDataset

You might also be able to get these permissions with custom roles or other predefined roles.

Enable the Storage Insights API

Console

Enable the storageinsights.googleapis.com API

Command line

To enable the Storage Insights API in your current project, run the gcloud services enable command:

gcloud services enable storageinsights.googleapis.com

For more information about enabling services for a Google Cloud project, see Enabling and disabling services.

Configure Storage Intelligence

Ensure that Storage Intelligence is configured for the project, folder, or organization that you want to analyze with datasets.

Create a dataset configuration

To create a dataset configuration, follow these steps. For more information about the fields you can specify for the dataset configuration, see How Storage Insights datasets work.

Console

In the Google Cloud console, go to the Cloud Storage Storage Insights page.

Go to Storage Insights
Click Configure dataset.
In the Name your dataset section, enter a name for your dataset. Optionally, enter a description for the dataset. Names identify dataset configurations and are immutable. The name can contain up to 128 characters, including letters, numbers, and underscores, and must start with a letter.
In the Define dataset scope section, do the following:
- Select one of the following options:
  - To get storage metadata for all projects in the current organization, select Include the organization.
  - To get storage metadata for all projects in the selected folders, select Include folders (Sub-organization/departments). For information about getting folder IDs, see Viewing or listing folders and projects. To add folders:
    1. In the Folder 1 field, enter the folder ID.
    2. Optionally, to add multiple folder IDs, click + Add another folder.
  - To get storage metadata for the selected projects, select Include projects by providing project numbers. To learn how to find project numbers, see Find the project name, number, and ID. To add projects, do the following:
    1. In the Project 1 field, enter the project number.
    2. Optionally, to add multiple project numbers, click + Add another project.
  - To add projects or folders in bulk, select Upload a list of projects/folders via CSV file. The CSV file must contain the project numbers or folder IDs to include in the dataset. You can specify up to 10,000 projects or folders in one dataset configuration.
- Specify whether to automatically include future buckets in the selected resource.
- Optionally, to specify filters on buckets based on regions and bucket prefixes, expand the Filters (optional) section. Filters are applied additively on buckets.
  
  You can include or exclude buckets from specific regions. For example, you can exclude buckets in the me-central1 and me-central2 regions. You can also include or exclude buckets by prefix. For example, to exclude buckets that start with my-bucket, enter the my-bucket* prefix.
Click Continue.
In the Select retention period section, select a retention period for the data in the dataset.
Activity data is included in the dataset by default, and inherits the retention period of the dataset. To override the dataset retention period, select Specify a retention period for activity data, and then select the number of days to retain activity data for. To disable activity data, set the retention period to 0 days.
In the Select location to store configured dataset section, select a location to store the dataset. For example, us-central1.
In the Select service account type section, select a service agent type for your dataset. Choose either a configuration-scoped or project-scoped service agent for your dataset.
Click Configure.

Command line

To create a dataset configuration, run the gcloud storage insights dataset-configs create command with the required flags:
```
gcloud storage insights dataset-configs create DATASET_CONFIG_ID \
  --location=LOCATION \
  --organization=SOURCE_ORG_NUMBER \
  --retention-period-days=DATASET_RETENTION_PERIOD_DAYS \
  (SCOPE_FLAG)
```
Replace:
- DATASET_CONFIG_ID with the name for your dataset configuration. Names identify dataset configurations and are immutable. The name can contain up to 128 characters, including letters, numbers, and underscores, and must start with a letter.
- LOCATION with the location to store the dataset. For example, us-central1.
- SOURCE_ORG_NUMBER with the ID of the organization to which the source projects belong. To find your organization ID, see Getting your organization resource ID.
- DATASET_RETENTION_PERIOD_DAYS with the retention period for the data in the dataset.
- SCOPE_FLAG with any one of the following flags that defines the scope of the data to collect:
  - --enable-organization-scope: Enables the dataset to collect insights from all buckets within the organization.
  - --source-folders=[SOURCE_FOLDER_NUMBERS,...]: Specifies a list of folder numbers to include in the dataset. To learn how to find a folder number, see Listing all projects and folders in your hierarchy.
  - --source-folders-file=FILE_PATH: Specifies multiple folder numbers by uploading a CSV file to a bucket.
  - --source-projects=[SOURCE_PROJECT_NUMBERS,...]: Specifies a list of project numbers to include in the dataset. For example, 464036093014. To find your project number, see Find the project name, number, and ID.
  - --source-projects-file=FILE_PATH: Specifies multiple project numbers by uploading a CSV file to a bucket.
Optionally, use the following additional flags to configure the dataset:
- Use --include-buckets=BUCKET_NAMES_OR_REGEX to include specific buckets by name or regular expression. You can't use this flag with --exclude-buckets.
- Use --exclude-buckets=BUCKET_NAMES_OR_REGEX to exclude specific buckets by name or regular expression. You can't use this flag with --include-buckets.
- Use --project=DESTINATION_PROJECT_ID to specify a project for storing your dataset configuration and generated dataset. If you don't use this flag, the destination project is your active project. For more information about project IDs, see Creating and managing projects.
- Use --auto-add-new-buckets to automatically include any buckets added to source projects in the future.
- Use --skip-verification to skip checks and failures from the verification process, which includes checks for required IAM permissions. If you use this flag, some or all buckets might be excluded from the dataset.
- Use --identity=IDENTITY_TYPE to specify the scope of the service agent created with the dataset configuration. Values are IDENTITY_TYPE_PER_CONFIG or IDENTITY_TYPE_PER_PROJECT. If unspecified, the default is IDENTITY_TYPE_PER_CONFIG. For details, see Service agent type.
- Use --description=DESCRIPTION to add a description for the dataset configuration.
- Use --activity-data-retention-period-days=ACTIVITY_RETENTION_PERIOD_DAYS to specify the retention period for the activity data in the dataset. By default, activity data is included in the dataset, and inherits the retention period of the dataset. To override the dataset retention period, specify the number of days to retain activity data for. To exclude activity data, set the ACTIVITY_RETENTION_PERIOD_DAYS to 0.
The following example creates a dataset configuration named my-dataset in the us-central1 region, for the organization with the ID 123456789, with a retention period of 30 days, and a scope limited to the projects 987654321 and 123123123:
```
gcloud storage insights dataset-configs create my-dataset \
--location=us-central1 \
--organization=123456789 \
--retention-period-days=30 \
--source-projects=987654321,123123123
```

JSON API

Have gcloud CLI installed and initialized , which lets you generate an access token for the Authorization header.

Create a JSON file that contains the following information:

{
  "sourceProjects": {
    "project_numbers": ["PROJECT_NUMBERS", ...]
  },
  "retentionPeriodDays": "RETENTION_PERIOD_DAYS",
  "activityDataRetentionPeriodDays": "ACTIVITY_DATA_RETENTION_PERIOD_DAYS",
  "identity": {
    "type": "IDENTITY_TYPE"
  }
}

Replace:

PROJECT_NUMBERS with the numbers of the projects you want to include in the dataset. You can specify one project or multiple projects. Projects must be specified as a list of strings.

Alternatively, you can add an organization, or one or multiple folders that contain buckets and objects for which you want to update the metadata. To include folders or organizations, use the sourceFolders or organizationScope fields. For more information, see the DatasetConfig reference.
RETENTION_PERIOD_DAYS with the number of days of data to capture in the dataset snapshot. For example, 90.
ACTIVITY_DATA_RETENTION_PERIOD_DAYS with the number of days of activity data to capture in the dataset snapshot. By default, activity data is included in the dataset, and inherits the retention period of the dataset. To override the dataset retention period, specify the number of days to retain activity data for. To exclude activity data, set the ACTIVITY_RETENTION_PERIOD_DAYS to 0.
IDENTITY_TYPE with the type of service account that gets created alongside the dataset configuration. Values are IDENTITY_TYPE_PER_CONFIG or IDENTITY_TYPE_PER_PROJECT. For details, see Service agent type.

To create the dataset configuration, use cURL to call the JSON API with a Create DatasetConfig request:
```
curl -X POST --data-binary @JSON_FILE_NAME \
"https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs?datasetConfigId=DATASET_CONFIG_ID" \
  --header "Authorization: Bearer $(gcloud auth print-access-token --impersonate-service-account=SERVICE_ACCOUNT)" \
  --header "Accept: application/json" \
  --header "Content-Type: application/json"
```
Replace:
- JSON_FILE_NAME with the path to the JSON file you created in the previous step. Alternatively, you can pass an instance of DatasetConfig in the request body.
- PROJECT_ID with the ID of the project that the dataset configuration and dataset will belong to.
- LOCATION with the location where the dataset and dataset configuration will reside. For example, us-central1.
- DATASET_CONFIG_ID with the name of your dataset configuration. Names identify dataset configurations and are immutable. The name can contain up to 128 characters, including letters, numbers, and underscores, and must start with a letter.
- SERVICE_ACCOUNT with the service account. For example, test-service-account@test-project.iam.gserviceaccount.com.

To troubleshoot snapshot processing errors that are logged in error_attributes_view, see Storage Insights dataset errors.

Grant the required permissions to the service agent

Google Cloud creates a configuration-scoped or project-scoped service agent when you create a dataset configuration. The service agent follows the naming format service-PROJECT_NUMBER@gcp-sa-storageinsights.iam.gserviceaccount.com and appears on the IAM page in the Google Cloud console when you select the Include Google-provided role grants checkbox. You can also find the name of the service agent by viewing the DatasetConfig resource using the JSON API.

To enable Storage Insights to generate and write datasets, ask your administrator to grant the service agent the Storage Insights Collector Service role (roles/storage.insightsCollectorService) on the organization that contains the source projects. You must grant this role to every configuration-scoped service agent created for each dataset configuration from which you want data. If you use a project-scoped service agent, you must grant this role only once on the service agent to read and write datasets for all dataset configurations within the project.

For instructions about granting roles for projects, see Manage access.

Link a dataset

To link a dataset to BigQuery, complete the following steps:

Console

In the Google Cloud console, go to the Cloud Storage Storage Insights page.

Go to Storage Insights
Click the name of the dataset configuration that generated the dataset you want to link.
In the BigQuery linked dataset section, click Link dataset to link your dataset.

Command line

To link a dataset to BigQuery, run the gcloud storage insights dataset-configs create-link command:
```
gcloud storage insights dataset-configs create-link DATASET_CONFIG_ID --location=LOCATION
```
Replace:
- DATASET_CONFIG_ID with the name of the dataset configuration that generated the dataset to link.
- LOCATION with the location of your dataset. For example, us-central1.
You can also specify a full dataset configuration path. For example:
```
gcloud storage insights dataset-configs create-link projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID
```
Replace:
- DESTINATION_PROJECT_ID with the ID of the project that contains the dataset configuration. For more information about project IDs, see Creating and managing projects.
- DATASET_CONFIG_ID with the name of the dataset configuration that generated the dataset to link.
- LOCATION with the location of your dataset and dataset configuration. For example, us-central1.

JSON API

Have gcloud CLI installed and initialized , which lets you generate an access token for the Authorization header.

Use cURL to call the JSON API with a linkDataset DatasetConfig request:
```
curl -X POST \
  "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigsDATASET_CONFIG_ID:linkDataset?" \
    --header "Authorization: Bearer $(gcloud auth print-access-token --impersonate-service-account=SERVICE_ACCOUNT)" \
    --header "Accept: application/json" \
    --header "Content-Type: application/json"
```
Replace:
- JSON_FILE_NAME with the path to the JSON file you created.
- PROJECT_ID with the ID of the project to which the dataset configuration belongs.
- LOCATION with the location where the dataset and dataset configuration reside. For example, us-central1.
- DATASET_CONFIG_ID with the name of the dataset configuration that generated the dataset to link.
- SERVICE_ACCOUNT with the service account. For example, test-service-account@test-project.iam.gserviceaccount.com.

What's next

View linked datasets.
Query a linked dataset.
Analyze your stored data with Gemini Cloud Assist.
Manage your dataset configurations, including updating, viewing, listing, and deleting them.

Configure Storage Insights datasets Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Get the required roles

Required permissions

Enable the Storage Insights API

Console

Command line

Configure Storage Intelligence

Create a dataset configuration

Console

Command line

JSON API

Grant the required permissions to the service agent

Link a dataset

Console

Command line

JSON API

What's next

Configure Storage Insights datasets