使用多变量时序预测模型执行异常值检测
本教程介绍了如何执行以下任务:
- 创建 ARIMA_PLUS_XREG时序预测模型。
- 通过对模型运行 ML.DETECT_ANOMALIES函数来检测时序数据中的异常值。
本教程使用公共 epa_historical_air_quality 数据集中的以下表,该数据集包含从美国多个城市收集的每日 PM 2.5、温度和风速信息:
- epa_historical_air_quality.pm25_nonfrm_daily_summary
- epa_historical_air_quality.wind_daily_summary
- epa_historical_air_quality.temperature_daily_summary
所需权限
- 如需创建数据集,您需要拥有 - bigquery.datasets.createIAM 权限。
- 如需创建模型,您需要以下权限: - bigquery.jobs.create
- bigquery.models.create
- bigquery.models.getData
- bigquery.models.updateData
 
- 如需运行推理,您需要以下权限: - bigquery.models.getData
- bigquery.jobs.create
 
如需详细了解 BigQuery 中的 IAM 角色和权限,请参阅 IAM 简介。
费用
在本文档中,您将使用 Google Cloud的以下收费组件:
- BigQuery: You incur costs for the data you process in BigQuery.
 
 
 
  如需根据您的预计使用量来估算费用,请使用价格计算器。
  
如需了解详情,请参阅 BigQuery 价格。
准备工作
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
- 
    
    
      In the Google Cloud console, on the project selector page, select or create a Google Cloud project. Roles required to select or create a project - Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- 
      Create a project: To create a project, you need the Project Creator
      (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
 
- 
  
    Verify that billing is enabled for your Google Cloud project. 
- 
  
  
    
      Enable the BigQuery API. Roles required to enable APIs To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.
- 
    
    
      In the Google Cloud console, on the project selector page, select or create a Google Cloud project. Roles required to select or create a project - Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- 
      Create a project: To create a project, you need the Project Creator
      (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
 
- 
  
    Verify that billing is enabled for your Google Cloud project. 
- 
  
  
    
      Enable the BigQuery API. Roles required to enable APIs To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.
- 在 Google Cloud 控制台中,前往 BigQuery 页面。 
- 在探索器窗格中,点击您的项目名称。 
- 点击 查看操作 > 创建数据集 
- 在 创建数据集 页面上,执行以下操作: - 在数据集 ID 部分,输入 - bqml_tutorial。
- 在位置类型部分,选择多区域,然后选择 US (multiple regions in United States)(美国[美国的多个区域])。 
- 保持其余默认设置不变,然后点击创建数据集。 
 
- 创建一个名为 - bqml_tutorial的数据集,并将数据位置设置为- US,说明为- BigQuery ML tutorial dataset:- bq --location=US mk -d \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial - 该命令使用的不是 - --dataset标志,而是- -d快捷方式。如果省略- -d和- --dataset,该命令会默认创建一个数据集。
- 确认已创建数据集: - bq ls
- date:观察的日期
- PM2.5:每天的平均 PM2.5 值
- wind_speed:每天的平均风速
- temperature:每天的最高温度
- 转到 BigQuery 页面。 
- 在 SQL 编辑器窗格中,运行以下 SQL 语句: - CREATE TABLE `bqml_tutorial.seattle_air_quality_daily` AS WITH pm25_daily AS ( SELECT avg(arithmetic_mean) AS pm25, date_local AS date FROM `bigquery-public-data.epa_historical_air_quality.pm25_nonfrm_daily_summary` WHERE city_name = 'Seattle' AND parameter_name = 'Acceptable PM2.5 AQI & Speciation Mass' GROUP BY date_local ), wind_speed_daily AS ( SELECT avg(arithmetic_mean) AS wind_speed, date_local AS date FROM `bigquery-public-data.epa_historical_air_quality.wind_daily_summary` WHERE city_name = 'Seattle' AND parameter_name = 'Wind Speed - Resultant' GROUP BY date_local ), temperature_daily AS ( SELECT avg(first_max_value) AS temperature, date_local AS date FROM `bigquery-public-data.epa_historical_air_quality.temperature_daily_summary` WHERE city_name = 'Seattle' AND parameter_name = 'Outdoor Temperature' GROUP BY date_local ) SELECT pm25_daily.date AS date, pm25, wind_speed, temperature FROM pm25_daily JOIN wind_speed_daily USING (date) JOIN temperature_daily USING (date) 
- 转到 BigQuery 页面。 
- 在 SQL 编辑器窗格中,运行以下 SQL 语句: - CREATE OR REPLACE MODEL `bqml_tutorial.arimax_model` OPTIONS ( model_type = 'ARIMA_PLUS_XREG', auto_arima=TRUE, time_series_data_col = 'temperature', time_series_timestamp_col = 'date' ) AS SELECT * FROM `bqml_tutorial.seattle_air_quality_daily` WHERE date < "2023-02-01"; - 查询需要几秒钟才能完成,之后模型 - arimax_model会显示在- bqml_tutorial数据集中,并且可以在探索器窗格中访问。- 由于查询使用 - CREATE MODEL语句来创建模型,因此没有查询结果。
- 转到 BigQuery 页面。 
- 在 SQL 编辑器窗格中,运行以下 SQL 语句: - SELECT * FROM ML.DETECT_ANOMALIES ( MODEL `bqml_tutorial.arimax_model`, STRUCT(0.6 AS anomaly_prob_threshold) ) ORDER BY date ASC; - 结果类似于以下内容: - +-------------------------+-------------+------------+--------------------+--------------------+---------------------+ | date | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | +--------------------------------------------------------------------------------------------------------------------+ | 2009-08-11 00:00:00 UTC | 70.1 | false | 67.647370742988727 | 72.552629257011262 | 0 | +--------------------------------------------------------------------------------------------------------------------+ | 2009-08-12 00:00:00 UTC | 73.4 | false | 71.7035428351283 | 76.608801349150838 | 0.20478819992561115 | +--------------------------------------------------------------------------------------------------------------------+ | 2009-08-13 00:00:00 UTC | 64.6 | true | 67.740408724826068 | 72.6456672388486 | 0.945588334903206 | +-------------------------+-------------+------------+--------------------+--------------------+---------------------+ 
- 转到 BigQuery 页面。 
- 在 SQL 编辑器窗格中,运行以下 SQL 语句: - SELECT * FROM ML.DETECT_ANOMALIES ( MODEL `bqml_tutorial.arimax_model`, STRUCT(0.6 AS anomaly_prob_threshold), ( SELECT * FROM UNNEST( [ STRUCT<date TIMESTAMP, pm25 FLOAT64, wind_speed FLOAT64, temperature FLOAT64> ('2023-02-01 00:00:00 UTC', 8.8166665, 1.6525, 44.0), ('2023-02-02 00:00:00 UTC', 11.8354165, 1.558333, 40.5), ('2023-02-03 00:00:00 UTC', 10.1395835, 1.6895835, 46.5), ('2023-02-04 00:00:00 UTC', 11.439583500000001, 2.0854165, 45.0), ('2023-02-05 00:00:00 UTC', 9.7208335, 1.7083335, 46.0), ('2023-02-06 00:00:00 UTC', 13.3020835, 2.23125, 43.5), ('2023-02-07 00:00:00 UTC', 5.7229165, 2.377083, 47.5), ('2023-02-08 00:00:00 UTC', 7.6291665, 2.24375, 44.5), ('2023-02-09 00:00:00 UTC', 8.5208335, 2.2541665, 40.5), ('2023-02-10 00:00:00 UTC', 9.9086955, 7.333335, 39.5) ] ) ) ); - 结果类似于以下内容: - +-------------------------+-------------+------------+--------------------+--------------------+---------------------+------------+------------+ | date | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | pm25 | wind_speed | +----------------------------------------------------------------------------------------------------------------------------------------------+ | 2023-02-01 00:00:00 UTC | 44.0 | true | 36.89918003713138 | 41.8044385511539 | 0.88975675709801583 | 8.8166665 | 1.6525 | +----------------------------------------------------------------------------------------------------------------------------------------------+ | 2023-02-02 00:00:00 UTC | 40.5 | false | 34.439946284051572 | 40.672021330796483 | 0.57358239699845348 | 11.8354165 | 1.558333 | +--------------------------------------------------------------------------------------------------------------------+-------------------------+ | 2023-02-03 00:00:00 UTC | 46.5 | true | 33.615139992931191 | 40.501364463964549 | 0.97902867696346974 | 10.1395835 | 1.6895835 | +-------------------------+-------------+------------+--------------------+--------------------+---------------------+-------------------------+ 
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
创建数据集
创建 BigQuery 数据集以存储机器学习模型。
控制台
bq
如需创建新数据集,请使用带有 --location 标志的 bq mk 命令。 如需查看完整的潜在参数列表,请参阅 bq mk --dataset 命令参考文档。
API
使用已定义的数据集资源调用 datasets.insert 方法。
{ "datasetReference": { "datasetId": "bqml_tutorial" } }
BigQuery DataFrame
在尝试此示例之前,请按照《BigQuery 快速入门:使用 BigQuery DataFrames》中的 BigQuery DataFrames 设置说明进行操作。如需了解详情,请参阅 BigQuery DataFrames 参考文档。
如需向 BigQuery 进行身份验证,请设置应用默认凭证。如需了解详情,请参阅为本地开发环境设置 ADC。
准备训练数据
PM2.5、温度和风速数据位于不同的表中。通过组合这些公开表中的数据来创建 bqml_tutorial.seattle_air_quality_daily 训练数据表。bqml_tutorial.seattle_air_quality_daily 包含以下列:
新表包含 2009 年 8 月 11 日至 2022 年 1 月 31 日的每日数据。
创建模型
使用 bqml_tutorial.seattle_air_quality_daily 中的数据作为训练数据,创建多变量时序模型。
对历史数据执行异常值检测
针对用于训练模型的历史数据运行异常值检测。
对新数据执行异常值检测
对您生成的新数据运行异常值检测。