Machine Learning with Big Query

 

Bigquery Studio for Training the model


I'm preparing to train a logistic regression model using a publicly available dataset in Bigquery. I will use inbuilt functions availabe in BigQuery


Steps:

  • Log in to Google Cloud Platform and Open BigQuery.
  • Create the dataset in BigQuery
  • Create and Train the ML Model
  • Evaluate the ML Model
  • Predict Using the ML Model


Create the BigQuery Dataset

  • From the Google Cloud console main navigation (hamburger menu on the top left), select BigQuery. (Click DONE to close the "Welcome" dialog box if it is displayed.)
  • Click the vertical three-dot menu next to the pinned project (in the Explorer) to View actions.
  • Select Create dataset.
  • For Dataset ID, enter acg_bqml.
  • For Data location, select us.
  • Click CREATE DATASET.





Create and Train the ML Model

  • Run a SQL function to create and train a logistic regression model.
  • Add the following query to the Query editor. 
    • This query first creates and trains the model with the dataset/table first_bqml.transactions_model. 
    • The query then declares the model_type as a logistic regression. 
    • The query selects the columns declared within the SELECT statement. 
    • Data is selected from the public bigquery-public-data.google_analytics_sample.ga_sessions_* dataset, where the * indicates you'll query a set of tables sharded by date. 
    • The WHERE clause indicates that tables between August 1, 2016 and June 30, 2017 will be queried:


CREATE MODEL `first_bqml.transactions_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170630'


bigquery Machinelearning


  • At the top, click RUN. It will take a minute or two for the query to process.
  • In the Results tab at the bottom, click GO TO MODEL.
  • Select the TRAINING tab. Observe the Loss is decreasing, while the Learn rate is increasing.


bigquery lodss function




Evaluate the ML Model

  • Evaluate the predicted values against the current data via SQL.
  • Go back to the Query editor, and replace the existing query with the query below. 
  • This query will evaluate the model and is similar to the previous query. 
  • Note that the SELECT statement will include all fields from the evaluated model. 
  • Additionally, observe the inclusion of the ML.EVALUATE statement, which will evaluate the newly created model:


SELECT
  *
FROM
  ML.EVALUATE(MODEL `first_bqml.transactions_model`, (
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))






In the Results tab at the bottom, observe the fields returned, which evaluate the model's accuracy and precision.




  • [{
  •   "precision": "0.468503937007874",
  •   "recall": "0.11080074487895716",
  •   "accuracy": "0.98534315834767638",
  •   "f1_score": "0.17921686746987953",
  •   "log_loss": "0.046242211011772519",
  •   "roc_auc": "0.98174825174825175"
  • }]



Predict Using the ML Model


  • Use BigQuery ML SQL to predict the number of transactions for various countries.
  • Back up in the Query editor, replace the existing query with the query below to make predictions. 
  • The initial SELECT statement retrieves the country column and computes a sum of the predicted_label column from the ML.PREDICT function. 
  • The nested SELECT, FROM, and WHERE clauses are repeated, with the results grouped by country and ordered by total_predicted_purchases:


SELECT
  country,
  SUM(predicted_label) as total_predicted_purchases
FROM
  ML.PREDICT(MODEL `first_bqml.transactions_model`, (
SELECT
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(totals.pageviews, 0) AS pageviews,
  IFNULL(geoNetwork.country, "") AS country
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 15





In the Results tab at the bottom, observe the total_predicted_purchases column in the output. Note the United States has the most predicted purchases, followed by Taiwan, and then Canada.






No comments:

Post a Comment