创建 k-means 模型以对伦敦自行车租赁数据集进行聚类

本教程介绍了如何在 BigQuery ML 中使用 k-means 模型识别一组数据中的聚簇。

将数据分组为聚簇的 k-means 算法是非监督式机器学习的一种形式。监督式机器学习与预测分析有关，与此不同的是，非监督式机器学习与描述性分析有关。非监督式机器学习可帮助您了解数据，以便您根据数据做出决策。

本教程中的查询使用地理空间分析中提供的地理位置函数。如需了解详情，请参阅地理空间分析简介。

本教程使用伦敦自行车租赁公共数据集。数据包括起始和停止时间戳、车站名称和骑行时长。

目标

本教程将指导您完成以下任务：

检查用于训练模型的数据。
创建 k-means 聚簇模型。
使用 BigQuery ML 的聚簇可视化，解读生成的数据聚簇。
对 k-means 模型运行 ML.PREDICT 函数，以预测一组自行车租赁站的可能聚簇。

费用

本教程使用 Google Cloud的可计费组件，包括以下组件：

BigQuery
BigQuery ML

如需了解 BigQuery 费用，请参阅 BigQuery 价格页面。

如需了解 BigQuery ML 费用，请参阅 BigQuery ML 价格。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

新项目会自动启用 BigQuery。如需在预先存在的项目中激活 BigQuery，请前往
Enable the BigQuery API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the API

所需权限

如需创建数据集，您需要拥有 bigquery.datasets.create IAM 权限。
如需创建模型，您需要以下权限：
- bigquery.jobs.create
- bigquery.models.create
- bigquery.models.getData
- bigquery.models.updateData
如需运行推理，您需要以下权限：
- bigquery.models.getData
- bigquery.jobs.create

如需详细了解 BigQuery 中的 IAM 角色和权限，请参阅 IAM 简介。

创建数据集

创建 BigQuery 数据集以存储 k-means 模型：

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery 页面
在左侧窗格中，点击 探索器：

如果您没有看到左侧窗格，请点击 展开左侧窗格以打开该窗格。
在探索器窗格中，点击您的项目名称。
点击 查看操作 > 创建数据集。
在创建数据集页面上，执行以下操作：
- 在数据集 ID 部分，输入 bqml_tutorial。
- 在位置类型部分，选择多区域，然后选择 EU (multiple regions in European Union)（欧盟[欧盟的多个区域]）。
  
  伦敦自行车租赁公共数据集存储在 EU 多区域。数据集必须位于同一位置。
- 保持其余默认设置不变，然后点击创建数据集。

检查训练数据

检查您将用于训练 k-means 模型的数据。在本教程中，您根据以下特性为自行车站划分聚簇：

租赁时长
每天的行程数量
与市中心的距离

SQL

此查询提取有关自行车租赁的数据（包括 start_station_name 和 duration 列），并将此数据与车站信息联接。其中包括创建一个包含车站距离市中心的计算列。然后，查询会在 stationstats 列中计算车站的特性（包括平均骑行时长和行程数量），以及计算出的 distance_from_city_center 列。

请按照以下步骤检查训练数据：

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery

在查询编辑器中，粘贴以下查询，然后点击运行：

WITH
hs AS (
  SELECT
    h.start_station_name AS station_name,
    IF(
      EXTRACT(DAYOFWEEK FROM h.start_date) = 1
        OR EXTRACT(DAYOFWEEK FROM h.start_date) = 7,
      'weekend',
      'weekday') AS isweekday,
    h.duration,
    ST_DISTANCE(ST_GEOGPOINT(s.longitude, s.latitude), ST_GEOGPOINT(-0.1, 51.5)) / 1000
      AS distance_from_city_center
  FROM
    `bigquery-public-data.london_bicycles.cycle_hire` AS h
  JOIN
    `bigquery-public-data.london_bicycles.cycle_stations` AS s
    ON
      h.start_station_id = s.id
  WHERE
    h.start_date
    BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2016-01-01 00:00:00' AS TIMESTAMP)
),
stationstats AS (
  SELECT
    station_name,
    isweekday,
    AVG(duration) AS duration,
    COUNT(duration) AS num_trips,
    MAX(distance_from_city_center) AS distance_from_city_center
  FROM
    hs
  GROUP BY
    station_name, isweekday
)
SELECT *
FROM
stationstats
ORDER BY
distance_from_city_center ASC;

结果应如下所示：

查询结果

BigQuery DataFrame

在尝试此示例之前，请按照《BigQuery 快速入门：使用 BigQuery DataFrames》中的 BigQuery DataFrames 设置说明进行操作。如需了解详情，请参阅 BigQuery DataFrames 参考文档。

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置 ADC。

import datetime
import typing

import pandas as pd
from shapely.geometry import Point

import bigframes
import bigframes.bigquery as bbq
import bigframes.geopandas
import bigframes.pandas as bpd

bigframes.options.bigquery.project = your_gcp_project_id
# Compute in the EU multi-region to query the London bicycles dataset.
bigframes.options.bigquery.location = "EU"

# Extract the information you'll need to train the k-means model in this
# tutorial. Use the read_gbq function to represent cycle hires
# data as a DataFrame.
h = bpd.read_gbq(
    "bigquery-public-data.london_bicycles.cycle_hire",
    col_order=["start_station_name", "start_station_id", "start_date", "duration"],
).rename(
    columns={
        "start_station_name": "station_name",
        "start_station_id": "station_id",
    }
)

# Use GeoSeries.from_xy and BigQuery.st_distance to analyze geographical
# data. These functions determine spatial relationships between
# geographical features.
cycle_stations = bpd.read_gbq("bigquery-public-data.london_bicycles.cycle_stations")
s = bpd.DataFrame(
    {
        "id": cycle_stations["id"],
        "xy": bigframes.geopandas.GeoSeries.from_xy(
            cycle_stations["longitude"], cycle_stations["latitude"]
        ),
    }
)
s_distance = bbq.st_distance(s["xy"], Point(-0.1, 51.5), use_spheroid=False) / 1000
s = bpd.DataFrame({"id": s["id"], "distance_from_city_center": s_distance})

# Define Python datetime objects in the UTC timezone for range comparison,
# because BigQuery stores timestamp data in the UTC timezone.
sample_time = datetime.datetime(2015, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)
sample_time2 = datetime.datetime(2016, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)

h = h.loc[(h["start_date"] >= sample_time) & (h["start_date"] <= sample_time2)]

# Replace each day-of-the-week number with the corresponding "weekday" or
# "weekend" label by using the Series.map method.
h = h.assign(
    isweekday=h.start_date.dt.dayofweek.map(
        {
            0: "weekday",
            1: "weekday",
            2: "weekday",
            3: "weekday",
            4: "weekday",
            5: "weekend",
            6: "weekend",
        }
    )
)

# Supplement each trip in "h" with the station distance information from
# "s" by merging the two DataFrames by station ID.
merged_df = h.merge(
    right=s,
    how="inner",
    left_on="station_id",
    right_on="id",
)

# Engineer features to cluster the stations. For each station, find the
# average trip duration, number of trips, and distance from city center.
stationstats = typing.cast(
    bpd.DataFrame,
    merged_df.groupby(["station_name", "isweekday"]).agg(
        {"duration": ["mean", "count"], "distance_from_city_center": "max"}
    ),
)
stationstats.columns = pd.Index(
    ["duration", "num_trips", "distance_from_city_center"]
)
stationstats = stationstats.sort_values(
    by="distance_from_city_center", ascending=True
).reset_index()

# Expected output results: >>> stationstats.head(3)
# station_name	isweekday duration  num_trips	distance_from_city_center
# Borough Road...	weekday	    1110	    5749	    0.12624
# Borough Road...	weekend	    2125	    1774	    0.12624
# Webber Street...	weekday	    795	        6517	    0.164021
#   3 rows × 5 columns