試用 BigQuery DataFrames

在本快速入門導覽課程中,您將在 BigQuery 筆記本中使用 BigQuery DataFrames API,執行下列分析和機器學習 (ML) 工作:

  • 在公開資料集上建立 DataFrame。bigquery-public-data.ml_datasets.penguins
  • 計算企鵝的平均體重。
  • 建立線性迴歸模型
  • 在企鵝資料的子集上建立 DataFrame,做為訓練資料。
  • 清理訓練資料。
  • 設定模型參數。
  • 調整模型。
  • 評估模型。

事前準備

  1. 登入 Google Cloud 帳戶。如果您是 Google Cloud新手,歡迎 建立帳戶,親自評估產品在實際工作環境中的成效。新客戶還能獲得價值 $300 美元的免費抵免額,可用於執行、測試及部署工作負載。
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  4. 確認專案已啟用計費功能 Google Cloud

  5. 確認已啟用 BigQuery API。

    啟用 API

    如果您建立新專案,系統會自動啟用 BigQuery API。

所需權限

如要建立及執行 Notebook,您必須具備下列 Identity and Access Management (IAM) 角色:

建立筆記本

按照「從 BigQuery 編輯器建立筆記本」一文中的操作說明,建立新的筆記本。

試用 BigQuery DataFrames

如要試用 BigQuery DataFrames,請按照下列步驟操作:

  1. 在筆記本中建立新的程式碼儲存格。
  2. 在程式碼儲存格中新增下列程式碼:

    import bigframes.pandas as bpd
    
    # Set BigQuery DataFrames options
    # Note: The project option is not required in all environments.
    # On BigQuery Studio, the project ID is automatically detected.
    bpd.options.bigquery.project = your_gcp_project_id
    
    # Use "partial" ordering mode to generate more efficient queries, but the
    # order of the rows in DataFrames may not be deterministic if you have not
    # explictly sorted it. Some operations that depend on the order, such as
    # head() will not function until you explictly order the DataFrame. Set the
    # ordering mode to "strict" (default) for more pandas compatibility.
    bpd.options.bigquery.ordering_mode = "partial"
    
    # Create a DataFrame from a BigQuery table
    query_or_table = "bigquery-public-data.ml_datasets.penguins"
    df = bpd.read_gbq(query_or_table)
    
    # Efficiently preview the results using the .peek() method.
    df.peek()
    
  3. 修改 bpd.options.bigquery.project = your_gcp_project_id 行,指定專案 ID。 Google Cloud 例如:bpd.options.bigquery.project = "myProjectID"

  4. 執行程式碼儲存格。

    程式碼會傳回 DataFrame 物件,其中包含企鵝的相關資料。

  5. 在筆記本中建立新的程式碼儲存格,並加入下列程式碼:

    # Use the DataFrame just as you would a pandas DataFrame, but calculations
    # happen in the BigQuery query engine instead of the local system.
    average_body_mass = df["body_mass_g"].mean()
    print(f"average_body_mass: {average_body_mass}")
    
  6. 執行程式碼儲存格。

    這段程式碼會計算企鵝的平均體重,並將結果列印到控制台。Google Cloud

  7. 在筆記本中建立新的程式碼儲存格,並加入下列程式碼:

    # Create the Linear Regression model
    from bigframes.ml.linear_model import LinearRegression
    
    # Filter down to the data we want to analyze
    adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]
    
    # Drop the columns we don't care about
    adelie_data = adelie_data.drop(columns=["species"])
    
    # Drop rows with nulls to get our training data
    training_data = adelie_data.dropna()
    
    # Pick feature columns and label column
    X = training_data[
        [
            "island",
            "culmen_length_mm",
            "culmen_depth_mm",
            "flipper_length_mm",
            "sex",
        ]
    ]
    y = training_data[["body_mass_g"]]
    
    model = LinearRegression(fit_intercept=False)
    model.fit(X, y)
    model.score(X, y)
    
  8. 執行程式碼儲存格。

    程式碼會傳回模型的評估指標。

清除所用資源

如要避免付費,最簡單的方法就是刪除您為了本教學課程所建立的專案。

刪除專案的方法如下:

  1. 前往 Google Cloud 控制台的「Manage resources」(管理資源) 頁面。

    前往「Manage resources」(管理資源)

  2. 在專案清單中選取要刪除的專案,然後點選「Delete」(刪除)
  3. 在對話方塊中輸入專案 ID,然後按一下 [Shut down] (關閉) 以刪除專案。

後續步驟