Google Machine Learning - Study Notes - Data, Information and Analytics

Overview

In the study note series, this post covers Google Cloud Machine Learning. All details are accurate at the time of writing, please refer to Google for current details.

Definitions

These are general definitions that are frequently used:

Term	Description
Label	True answer
Input	Predictor variable(s), what you can use to predict the label
Example	Input + corresponding label
Model	Math function that takes input variables and creates approximation to label
Prediction	Using model on unlabelled data
Regression	Continuous labels (i.e. Size of tip) are
Classification	Discrete labels (i.e. Gender)
Linear Model	Neural Network with no hidden layers
Gradient Descent	Used to find the best I/P parameters
Weights/bias	Parameters we optimize
Batch size	The amount of data we compute error on
Epoch	one pass through entire dataset Gradient descent = process of reducing error
Evaluation	Is the model good enough? Has to be done on full dataset
Training	Process of optimizing the weights; includes gradient descent + evaluation
Mean Square Error	The loss measure for regression problems Cross-entropy: the loss measure for classification problems
Accuracy	A more intuitive measure of skill for classifiers
Precision	Accuracy when classifier says “yes” (useful for unbalanced classes where there are many more yes-es than no-es)
Recall	Accuracy when the truth is “yes” (useful for unbalanced classes where there are very few yes-es)
ROC curve	A way to pick the threshold (of the probability that is output by the classifier) at which a specific precision or recall is reached. The area under the curve (AUC) is a threshold-independent measure of skill.
DG	Directed Graph
DNN	Deep Neural Network

Definitions

Models

There are two types of Models, supervised and unsupervised

Supervised	Has labels, i.e. The correct answer
Unsupervised	No labels

Model types

Unsupervised ML is all about discovery, not prediction

Confusion

For business users, use a Confusion Matrix as it is more intuitive. A confusion matrix represents the percentage of times each label was predicted for each label in the training set during evaluation.

Error

To calculate the Regression Error, use a Mean, Square Error approach
To calculate the Classification Error, use Cross Entropy approach

Regularisation

This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

Accuracy

Data Sets need to be balanced and have a similar number of each scenario. If it doesn’t you need to use Precision and Recall

Precision	Positive Predictive Value (TP/(TP + FP))
Recall	True Positive Rate (TP/(TP + FN))

Precision vs Recall

Effectiveness

Precision is the formula to check how accurate the model is when most of the output are positives. In other words, if most of the output is yes.
Recall is the formula to check how accurate the model is when most of the output are negatives. In other words, if most of the output is no.
Gradient Descent is an optimization algorithm to find the minimal value of a function. Gradient descent is used to find the minimal RMSE or cost function.
Dropout is a regularization method to remove random selection of fixed number of units in a neural network layer. More units dropped out, the stronger the regularization.
To increase the Area Under the Curve (AUC) you need to Increase Regularization.

Combining Approaches

High confusion, low AUC scores, or low precision and recall scores can indicate that your model needs additional training data or has inconsistent labels. A very high AUC score and perfect precision and recall can indicate that the data is too easy and may not generalize well.

Fitting

Make sure models are not tuned to Under Fit or Over Fit. To solve overfitting, the following would help improve the model’s quality:

Increase the number of examples, the more data a model is trained with, the more use cases the model can be training on and better improves its predictions.
Tune hyperparameters which is related to number and size of hidden layers (for neural networks), and regularization, which means using techniques to make your model simpler such as using dropout method to remove neuron networks or adding “penalty” parameters to the cost function.
Remove features by removing irrelevant features. Feature engineering is a wide subject and feature selection is a critical part of building and training a model. Some algorithms have built- in feature selection, but in some cases, data scientists need to cherry-pick or manually select or remove features for debugging and finding the best model output.

Integer Encoding

As a first step, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.

This is called a label encoding or an integer encoding an example is available here.

From https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

One-Hot Encoding

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

For example:

red	Green	blue
1	0	0
0	1	0
0	0	1

One-Hot Encoding Example

From https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

Batch Size define the Learning Rate, there is a goldilocks value, often hard to find

Feature Crosses

Feature crosses are engineered based on our understanding of the problem. Eg. Combine Y and NYC to get that Yellow cars in NYC are always cabs

Images

DNNs are good for image based issues where the data points are dense and correlated

SDLC

When looking at a problem

Choose one attribute that needs to be predicted (i.e. Label)
Choose another attribute that describe the label (i.e. Features)

Split into

Training Data
Validation Data
Test Data (Independent Test Data)

If this is not possible

Training Data
Validation Data (cross validate)

Cloud AutoML

Cloud AutoML is a new tech to auto create ML Models. The API includes:

Vision
Speech
Jobs
Translation
Natural Language

Cloud AutoML is a suite of machine learning products that enables developers with limited machine learning expertise to train high-quality models specific to their business needs. It relies on Google’s state-of-the-art transfer learning and neural architecture search technology

From https://cloud.google.com/automl/

Training Models

Open the AutoML Vision UI and click the lightbulb icon in the left navigation bar to display the available models.
To view the models for a different project, select the project from the drop-down list in the upper right of the title bar.

Click the row for the model you want to evaluate.
If necessary, click the Evaluate tab just below the title bar.
If training has been completed for the model, AutoML Vision shows its evaluation metrics.
To view the metrics for a specific label, select the label name from the list of labels in the lower part of the page.

Iterate

If you’re not happy with the quality levels, you can go back to earlier steps to improve the quality:

AutoML Vision allows you to sort the images by how “confused” the model is, by the true label and its predicted label. Look through these images and make sure they’re labelled correctly.
Consider adding more images to any labels with low quality.
You may need to add different types of images (e.g. wider angle, higher or lower resolution, different points of view).
Consider removing labels altogether if you don’t have enough training images.
Remember that machines can’t read your label name; it’s just a random string of letters to them. If you have one label that says “door” and another that says “door_with_knob” the machine has no way of figuring out the nuance other than the images you provide it.
Augment your data with more examples of true positives and negatives. Especially important examples are the ones that are close to the decision boundary (i.e. likely to produce confusion, but still correctly labelled).
Specify your own TRAIN, TEST, VALIDATION split. The tool randomly assigns images, but near-duplicates may end up in TRAIN and VALIDATION which could lead to overfitting and then poor performance on the TEST set.
Once you’ve made changes, train and evaluate a new model until you reach a high enough quality level.

TensorFlow

TensorFlow does lazy evaluation by default. You write a Directed Graph (DG) and then run the DG in a session to get a result. Often used in production
TensorFlow can do eager evaluation (tf.eager). You write a DG and get a result. Often used development
TensorFlow allows for auto scalling

Numpy is the default language for programming with TF. Numpy is quicker as evaluation is immediate. Two options to code with are

Np.arrays, Np.add
Tf.constant, tf.add (lazy)

TF can:

Distribute computation. Like JAVA it can run on any hardware
To read shared data use a TextLineDataset
When using many workers, make sure they don’t all see the same data by using dataset.shuffle

train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=1000) eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn) tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

From https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate

Train

Think “steps”, not “epochs” with production-ready, distributed models.

Gradient updates from slow workers could get ignored
When retraining a model with fresh data, we’ll resume from earlier number of steps (and corresponding hyper-parameters)

Eval

The EvalSpec controls the evaluation and the checkpointing of the model since they happen at the same time CheckPointing is an essential part of eval. Think of eval as exporting to TensorBoard

TensorBoard

TensorBoard is a collection of visualization tools designed specifically to help you visualize TensorFlow.

TensorFlow graph
Plot quantitative metrics
Pass and graph additional data

HyperParameters

Your HyperParameters are the variables that govern the training process itself. For example, part of setting up a deep neural network is deciding how many hidden layers of nodes to use between the input layer and the output layer, and how many nodes each layer should use. These variables are not directly related to the training data. They are configuration variables. Note that parameters change during a training job, while hyperparameters are usually constant during a job.

From <https://cloud.google.com/ml-engine/docs/hyperparameter-tuning-overview>

HyperParameter tuning is a vital part of tuning a model. Often input values are chosen arbitrarily. Create task.py to parse command-line parameters and send along to train_and_evaluate:

No Of Hidden Layers
No of Nodes in Hidden Layers

Can also parameterise the output directory so that the results are not overwritten

Online Prediction	Batch prediction
Optimized to minimize the latency of serving predictions.	Optimized to handle a high volume of instances in a job and to run more complex models.
Can process one or more instances per request.	Can process one or more instances per request.
Predictions returned in the response message.	Predictions written to output files in a Cloud Storage location that you specify.
Input data passed directly as a JSON string.	Input data passed indirectly as one or more URIs of files in Cloud Storage locations.
Returns as soon as possible.	Asynchronous request.
Accounts with the following IAM roles can request online predictions: Legacy Editor or Viewer AI Platform Admin or Developer	Accounts with the following IAM roles can request batch predictions: Legacy Editor AI Platform Admin or Developer
Runs on the runtime version and in the region selected when you deploy the model.	Can run in any available region, using any available runtime version. Though you should run with the defaults for deployed model versions.
Runs models deployed to AI Platform.	Runs models deployed to AI Platform or models stored in accessible Google Cloud Storage locations.
Can serve predictions from a TensorFlow SavedModel or a custom prediction routine (beta).	Can serve predictions from a TensorFlow SavedModel.
$0.0401 to $0.1349 per node hour (Americas). Price depends on machine type selection.	$0.0791 per node hour (Americas).

From https://cloud.google.com/ml-engine/docs/tensorflow/online-vs-batch-prediction

Saving your configuration

How you specify your cluster configuration depends on how you plan to run your training job:

PYTHON

Create a YAML configuration file representing the TrainingInput object, and specify the scale tier identifier and machine types in the configuration file. You can name this file whatever you want. By convention the name is config.yaml.

The following example shows the contents of the configuration file, config.yaml, for a job with a custom processing cluster.

trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: large_model
workerCount: 9
parameterServerCount: 3

Datalab

In Datalab, start locally on sampled dataset then, scale it out to GCP using serverless technology

Dialogflow

Powered by Google’s machine learning. Dialogflow incorporates Google’s machine learning expertise and products such as Google Cloud Speech-to-Text.

Kubeflow

Kubeflow is a free and open-source software platform developed by Google and first released in 2018. Kubeflow is designed to develop machine learning applications e.g. using TensorFlow and to deploy these to Kubernetes. Wikipedia

Cloud to Speech

Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API, performs recognition on that data, and returns results after all audio has been processed. Synchronous recognition requests are limited to audio data of 1 minute or less in duration.

Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API and initiates a Long Running Operation. Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 480 minutes.

Streaming Recognition (gRPC only) performs recognition on audio data provided within a gRPC bi-directional stream. Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results while audio is being captured, allowing result to appear, for example, while a user is still speaking.

Google Machine Learning – Study Notes

Overview

Definitions

Models

Confusion

Error

Regularisation

Accuracy

Effectiveness

Combining Approaches

Fitting

Integer Encoding

One-Hot Encoding

Feature Crosses

Images

SDLC

Cloud AutoML

Training Models

Iterate

TensorFlow

Train

Eval

TensorBoard

HyperParameters

Saving your configuration

PYTHON

Datalab

Dialogflow

Kubeflow

Cloud to Speech

Leave a Reply Cancel reply