This tutorial teaches you how to use a boosted trees classifier model to predict the income range of individuals based on their demographic data. The model predicts whether a value falls into one of two categories, in this case whether an individual's annual income falls above or below $50,000.
This tutorial uses the
bigquery-public-data.ml_datasets.census_adult_income
dataset. This dataset contains the demographic and income information of US
residents from 2000 and 2010.
Create a dataset
Create a BigQuery dataset to store your ML model.
Console
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create datasetOn the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
Leave the remaining default settings as they are, and click Create dataset.
bq
To create a new dataset, use the
bq mk
command
with the --location
flag. For a full list of possible parameters, see the
bq mk --dataset
command
reference.
Create a dataset named
bqml_tutorial
with the data location set toUS
and a description ofBigQuery ML tutorial dataset
:bq --location=US mk -d \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial
Instead of using the
--dataset
flag, the command uses the-d
shortcut. If you omit-d
and--dataset
, the command defaults to creating a dataset.Confirm that the dataset was created:
bq ls
API
Call the datasets.insert
method with a defined dataset resource.
{ "datasetReference": { "datasetId": "bqml_tutorial" } }
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
Prepare the sample data
The model you create in this tutorial predicts the income bracket for census respondents, based on the following features:
- Age
- Type of work performed
- Marital status
- Level of education
- Occupation
- Hours worked per week
The education
column isn't included in the training data, because
the education
and education_num
columns both express the respondent's level
of education in different formats.
You separate the data into training, evaluation, and prediction sets by creating
a new dataframe
column that is derived from the functional_weight
column.
Eighty percent of the data is used for training the model, and the remaining
twenty percent of the data is used for evaluation and prediction.
SQL
To prepare your sample data, create a view to
contain the training data. This view is used by the CREATE MODEL
statement
later in this tutorial.
Run the query that prepares the sample data:
In the Google Cloud console, go to the BigQuery page.
In the query editor, run the following query:
CREATE OR REPLACE VIEW `bqml_tutorial.input_data` AS SELECT age, workclass, marital_status, education_num, occupation, hours_per_week, income_bracket, CASE WHEN MOD(functional_weight, 10) < 8 THEN 'training' WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation' WHEN MOD(functional_weight, 10) = 9 THEN 'prediction' END AS dataframe FROM `bigquery-public-data.ml_datasets.census_adult_income`;
In the left pane, click
Explorer:If you don't see the left pane, click
Expand left pane to open the pane.In the Explorer pane, search for the
bqml_tutorial
dataset.Click the dataset, and then click Overview > Tables.
Click the
input_data
view to open the information pane. The view schema appears in the Schema tab.
BigQuery DataFrames
Create a DataFrame called input_data
. You use input_data
later in this tutorial to use to train the model, evaluate it, and make predictions.
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
Create the boosted trees model
Create a boosted trees model to predict census respondents' income bracket, and train it on the census data. The query takes about 30 minutes to complete.
SQL
Follow these steps to create the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
CREATE MODEL `bqml_tutorial.tree_model` OPTIONS(MODEL_TYPE='BOOSTED_TREE_CLASSIFIER', BOOSTER_TYPE = 'GBTREE', NUM_PARALLEL_TREE = 1, MAX_ITERATIONS = 50, TREE_METHOD = 'HIST', EARLY_STOP = FALSE, SUBSAMPLE = 0.85, INPUT_LABEL_COLS = ['income_bracket']) AS SELECT * EXCEPT(dataframe) FROM `bqml_tutorial.input_data` WHERE dataframe = 'training';
After the query completes, the
tree_model
model can be accessed through the Explorer pane. Because the query uses aCREATE MODEL
statement to create a model, you don't see query results.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
Evaluate the model
SQL
Follow these steps to evaluate the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.EVALUATE (MODEL `bqml_tutorial.tree_model`, ( SELECT * FROM `bqml_tutorial.input_data` WHERE dataframe = 'evaluation' ) );
The results should look similar to the following:
+---------------------+---------------------+---------------------+-------------------+---------------------+---------------------+ | precision | recall | accuracy | f1_score | log_loss | roc_auc | +---------------------+---------------------+---------------------+-------------------+-------------------------------------------+ | 0.67192429022082023 | 0.57880434782608692 | 0.83942963422194672 | 0.621897810218978 | 0.34405456040833338 | 0.88733566433566435 | +---------------------+---------------------+ --------------------+-------------------+---------------------+---------------------+
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
The evaluation metrics indicate good model performance, in particular,
the fact that the
roc_auc
score is greater than 0.8
.
For more information about the evaluation metrics, see Output.
Use the model to predict classifications
SQL
Follow these steps to forecast data with the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.PREDICT (MODEL `bqml_tutorial.tree_model`, ( SELECT * FROM `bqml_tutorial.input_data` WHERE dataframe = 'prediction' ) );
The first few columns of the results should look similar to the following:
+---------------------------+--------------------------------------+-------------------------------------+ | predicted_income_bracket | predicted_income_bracket_probs.label | predicted_income_bracket_probs.prob | +---------------------------+--------------------------------------+-------------------------------------+ | <=50K | >50K | 0.05183430016040802 | +---------------------------+--------------------------------------+-------------------------------------+ | | <50K | 0.94816571474075317 | +---------------------------+--------------------------------------+-------------------------------------+ | <=50K | >50K | 0.00365859130397439 | +---------------------------+--------------------------------------+-------------------------------------+ | | <50K | 0.99634140729904175 | +---------------------------+--------------------------------------+-------------------------------------+ | <=50K | >50K | 0.037775970995426178 | +---------------------------+--------------------------------------+-------------------------------------+ | | <50K | 0.96222406625747681 | +---------------------------+--------------------------------------+-------------------------------------+
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
The predicted_income_bracket
contains the predicted value from the model.
The predicted_income_bracket_probs.label
shows the two labels that the
model had to choose between, and the predicted_income_bracket_probs.prob
column shows the probability of the given label being the
correct one.
For more information about the output columns, see Classification models.