Bigquery Studio for Training the model
I'm preparing to train a logistic regression model using a publicly available dataset in Bigquery. I will use inbuilt functions availabe in BigQuery
Steps:
- Log in to Google Cloud Platform and Open BigQuery.
- Create the dataset in BigQuery
- Create and Train the ML Model
- Evaluate the ML Model
- Predict Using the ML Model
Create the BigQuery Dataset
- From the Google Cloud console main navigation (hamburger menu on the top left), select BigQuery. (Click DONE to close the "Welcome" dialog box if it is displayed.)
- Click the vertical three-dot menu next to the pinned project (in the Explorer) to View actions.
- Select Create dataset.
- For Dataset ID, enter acg_bqml.
- For Data location, select us.
- Click CREATE DATASET.
Create and Train the ML Model
- Run a SQL function to create and train a logistic regression model.
- Add the following query to the Query editor.
- This query first creates and trains the model with the dataset/table first_bqml.transactions_model.
- The query then declares the model_type as a logistic regression.
- The query selects the columns declared within the SELECT statement.
- Data is selected from the public bigquery-public-data.google_analytics_sample.ga_sessions_* dataset, where the * indicates you'll query a set of tables sharded by date.
- The WHERE clause indicates that tables between August 1, 2016 and June 30, 2017 will be queried:
CREATE MODEL `first_bqml.transactions_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
- At the top, click RUN. It will take a minute or two for the query to process.
- In the Results tab at the bottom, click GO TO MODEL.
- Select the TRAINING tab. Observe the Loss is decreasing, while the Learn rate is increasing.
Evaluate the ML Model
- Evaluate the predicted values against the current data via SQL.
- Go back to the Query editor, and replace the existing query with the query below.
- This query will evaluate the model and is similar to the previous query.
- Note that the SELECT statement will include all fields from the evaluated model.
- Additionally, observe the inclusion of the ML.EVALUATE statement, which will evaluate the newly created model:
SELECT
*
FROM
ML.EVALUATE(MODEL `first_bqml.transactions_model`, (
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
In the Results tab at the bottom, observe the fields returned, which evaluate the model's accuracy and precision.
- [{
- "precision": "0.468503937007874",
- "recall": "0.11080074487895716",
- "accuracy": "0.98534315834767638",
- "f1_score": "0.17921686746987953",
- "log_loss": "0.046242211011772519",
- "roc_auc": "0.98174825174825175"
- }]
Predict Using the ML Model
- Use BigQuery ML SQL to predict the number of transactions for various countries.
- Back up in the Query editor, replace the existing query with the query below to make predictions.
- The initial SELECT statement retrieves the country column and computes a sum of the predicted_label column from the ML.PREDICT function.
- The nested SELECT, FROM, and WHERE clauses are repeated, with the results grouped by country and ordered by total_predicted_purchases:
SELECT
country,
SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `first_bqml.transactions_model`, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 15
country,
SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `first_bqml.transactions_model`, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 15
In the Results tab at the bottom, observe the total_predicted_purchases column in the output. Note the United States has the most predicted purchases, followed by Taiwan, and then Canada.
No comments:
Post a Comment