Overview
In the study note series, this post covers Google Cloud Machine Learning
Definitions
These are general definitions that are frequently used:
Term | Description |
Label | True answer |
Input | Predictor variable(s), what you can use to predict the label |
Example | Input + corresponding label |
Model | Math function that takes input variables and creates approximation to label |
Prediction | Using model on unlabelled data |
Regression | Continuous labels (i.e. Size of tip) are |
Classification | Discrete labels (i.e. Gender) |
Linear Model | Neural Network with no hidden layers |
Gradient Descent | Used to find the best I/P parameters |
Weights/bias | Parameters we optimize |
Batch size | The amount of data we compute error on |
Epoch | one pass through entire dataset Gradient descent = process of reducing error |
Evaluation | Is the model good enough? Has to be done on full dataset |
Training | Process of optimizing the weights; includes gradient descent + evaluation |
Mean Square Error | The loss measure for regression problems Cross-entropy: the loss measure for classification problems |
Accuracy | A more intuitive measure of skill for classifiers |
Precision | Accuracy when classifier says “yes” (useful for unbalanced classes where there are many more yes-es than no-es) |
Recall | Accuracy when the truth is “yes” (useful for unbalanced classes where there are very few yes-es) |
ROC curve | A way to pick the threshold (of the probability that is output by the classifier) at which a specific precision or recall is reached. The area under the curve (AUC) is a threshold-independent measure of skill. |
DG | Directed Graph |
DNN | Deep Neural Network |
Models
There are two types of Models, supervised and unsupervised
Supervised | Has labels, i.e. The correct answer |
Unsupervised | No labels |
Unsupervised ML is all about discovery, not prediction
Confusion
For business users, use a Confusion Matrix as it is more intuitive. A confusion matrix represents the percentage of times each label was predicted for each label in the training set during evaluation.
Error
- To calculate the Regression Error, use a Mean, Square Error approach
- To calculate the Classification Error, use Cross Entropy approach
Regularisation
This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.
Accuracy
Data Sets need to be balanced and have a similar number of each scenario. If it doesn’t you need to use Precision and Recall
Precision | Positive Predictive Value (TP/(TP + FP)) |
Recall | True Positive Rate (TP/(TP + FN)) |
Effectiveness
- Precision is the formula to check how accurate the model is when most of the output are positives. In other words, if most of the output is yes.
- Recall is the formula to check how accurate the model is when most of the output are negatives. In other words, if most of the output is no.
- Gradient Descent is an optimization algorithm to find the minimal value of a function. Gradient descent is used to find the minimal RMSE or cost function.
- Dropout is a regularization method to remove random selection of fixed number of units in a neural network layer. More units dropped out, the stronger the regularization.
- To increase the Area Under the Curve (AUC) you need to Increase Regularization.
Combining Approaches
- High confusion, low AUC scores, or low precision and recall scores can indicate that your model needs additional training data or has inconsistent labels. A very high AUC score and perfect precision and recall can indicate that the data is too easy and may not generalize well.
Fitting
Make sure models are not tuned to Under Fit or Over Fit. To solve overfitting, the following would help improve the model’s quality:
- Increase the number of examples, the more data a model is trained with, the more use cases the model can be training on and better improves its predictions.
- Tune hyperparameters which is related to number and size of hidden layers (for neural networks), and regularization, which means using techniques to make your model simpler such as using dropout method to remove neuron networks or adding “penalty” parameters to the cost function.
- Remove features by removing irrelevant features. Feature engineering is a wide subject and feature selection is a critical part of building and training a model. Some algorithms have built- in feature selection, but in some cases, data scientists need to cherry-pick or manually select or remove features for debugging and finding the best model output.
Integer Encoding
As a first step, each unique category value is assigned an integer value.
For example, “red” is 1, “green” is 2, and “blue” is 3.
This is called a label encoding or an integer encoding an example is available here.
From https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
One-Hot Encoding
For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.
For example:
red | Green | blue |
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
From https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
Batch Size define the Learning Rate, there is a goldilocks value, often hard to find
Feature Crosses
Feature crosses are engineered based on our understanding of the problem. Eg. Combine Y and NYC to get that Yellow cars in NYC are always cabs
Images
DNNs are good for image based issues where the data points are dense and correlated
SDLC
When looking at a problem
- Choose one attribute that needs to be predicted (i.e. Label)
- Choose another attribute that describe the label (i.e. Features)
Split into
- Training Data
- Validation Data
- Test Data (Independent Test Data)
If this is not possible
- Training Data
- Validation Data (cross validate)
Cloud AutoML
Cloud AutoML is a new tech to auto create ML Models. The API includes:
- Vision
- Speech
- Jobs
- Translation
- Natural Language
Cloud AutoML is a suite of machine learning products that enables developers with limited machine learning expertise to train high-quality models specific to their business needs. It relies on Google’s state-of-the-art transfer learning and neural architecture search technology
From https://cloud.google.com/automl/
Training Models
Open the AutoML Vision UI and click the lightbulb icon in the left navigation bar to display the available models.
To view the models for a different project, select the project from the drop-down list in the upper right of the title bar.
- Click the row for the model you want to evaluate.
- If necessary, click the Evaluate tab just below the title bar.
If training has been completed for the model, AutoML Vision shows its evaluation metrics. - To view the metrics for a specific label, select the label name from the list of labels in the lower part of the page.
Iterate
If you’re not happy with the quality levels, you can go back to earlier steps to improve the quality:
- AutoML Vision allows you to sort the images by how “confused” the model is, by the true label and its predicted label. Look through these images and make sure they’re labelled correctly.
- Consider adding more images to any labels with low quality.
- You may need to add different types of images (e.g. wider angle, higher or lower resolution, different points of view).
- Consider removing labels altogether if you don’t have enough training images.
- Remember that machines can’t read your label name; it’s just a random string of letters to them. If you have one label that says “door” and another that says “door_with_knob” the machine has no way of figuring out the nuance other than the images you provide it.
- Augment your data with more examples of true positives and negatives. Especially important examples are the ones that are close to the decision boundary (i.e. likely to produce confusion, but still correctly labelled).
- Specify your own TRAIN, TEST, VALIDATION split. The tool randomly assigns images, but near-duplicates may end up in TRAIN and VALIDATION which could lead to overfitting and then poor performance on the TEST set.
- Once you’ve made changes, train and evaluate a new model until you reach a high enough quality level.
TensorFlow
- TensorFlow does lazy evaluation by default. You write a Directed Graph (DG) and then run the DG in a session to get a result. Often used in production
- TensorFlow can do eager evaluation (tf.eager). You write a DG and get a result. Often used development
- TensorFlow allows for auto scalling
Numpy is the default language for programming with TF. Numpy is quicker as evaluation is immediate. Two options to code with are
- Np.arrays, Np.add
- Tf.constant, tf.add (lazy)
TF can:
- Distribute computation. Like JAVA it can run on any hardware
- To read shared data use a TextLineDataset
- When using many workers, make sure they don’t all see the same data by using dataset.shuffle
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=1000) eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn) tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
From https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate
Train
Think “steps”, not “epochs” with production-ready, distributed models.
- Gradient updates from slow workers could get ignored
- When retraining a model with fresh data, we’ll resume from earlier number of steps (and corresponding hyper-parameters)
Eval
The EvalSpec controls the evaluation and the checkpointing of the model since they happen at the same time CheckPointing is an essential part of eval. Think of eval as exporting to TensorBoard
TensorBoard
TensorBoard is a collection of visualization tools designed specifically to help you visualize TensorFlow.
- TensorFlow graph
- Plot quantitative metrics
- Pass and graph additional data
HyperParameters
Your HyperParameters are the variables that govern the training process itself. For example, part of setting up a deep neural network is deciding how many hidden layers of nodes to use between the input layer and the output layer, and how many nodes each layer should use. These variables are not directly related to the training data. They are configuration variables. Note that parameters change during a training job, while hyperparameters are usually constant during a job.
From <https://cloud.google.com/ml-engine/docs/hyperparameter-tuning-overview>
HyperParameter tuning is a vital part of tuning a model. Often input values are chosen arbitrarily. Create task.py to parse command-line parameters and send along to train_and_evaluate:
- No Of Hidden Layers
- No of Nodes in Hidden Layers
Can also parameterise the output directory so that the results are not overwritten
Online Prediction | Batch prediction |
Optimized to minimize the latency of serving predictions. | Optimized to handle a high volume of instances in a job and to run more complex models. |
Can process one or more instances per request. | Can process one or more instances per request. |
Predictions returned in the response message. | Predictions written to output files in a Cloud Storage location that you specify. |
Input data passed directly as a JSON string. | Input data passed indirectly as one or more URIs of files in Cloud Storage locations. |
Returns as soon as possible. | Asynchronous request. |
Accounts with the following IAM roles can request online predictions: Legacy Editor or Viewer AI Platform Admin or Developer | Accounts with the following IAM roles can request batch predictions: Legacy Editor AI Platform Admin or Developer |
Runs on the runtime version and in the region selected when you deploy the model. | Can run in any available region, using any available runtime version. Though you should run with the defaults for deployed model versions. |
Runs models deployed to AI Platform. | Runs models deployed to AI Platform or models stored in accessible Google Cloud Storage locations. |
Can serve predictions from a TensorFlow SavedModel or a custom prediction routine (beta). | Can serve predictions from a TensorFlow SavedModel. |
$0.0401 to $0.1349 per node hour (Americas). Price depends on machine type selection. | $0.0791 per node hour (Americas). |
From https://cloud.google.com/ml-engine/docs/tensorflow/online-vs-batch-prediction
Saving your configuration
How you specify your cluster configuration depends on how you plan to run your training job:
PYTHON
Create a YAML configuration file representing the TrainingInput object, and specify the scale tier identifier and machine types in the configuration file. You can name this file whatever you want. By convention the name is config.yaml.
The following example shows the contents of the configuration file, config.yaml, for a job with a custom processing cluster.
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: large_model
workerCount: 9
parameterServerCount: 3
Datalab
In Datalab, start locally on sampled dataset then, scale it out to GCP using serverless technology
Dialogflow
Powered by Google’s machine learning. Dialogflow incorporates Google’s machine learning expertise and products such as Google Cloud Speech-to-Text.
Kubeflow
Kubeflow is a free and open-source software platform developed by Google and first released in 2018. Kubeflow is designed to develop machine learning applications e.g. using TensorFlow and to deploy these to Kubernetes. Wikipedia
Cloud to Speech
Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API, performs recognition on that data, and returns results after all audio has been processed. Synchronous recognition requests are limited to audio data of 1 minute or less in duration.
Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API and initiates a Long Running Operation. Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 480 minutes.
Streaming Recognition (gRPC only) performs recognition on audio data provided within a gRPC bi-directional stream. Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results while audio is being captured, allowing result to appear, for example, while a user is still speaking.