Before we dive into different validation techniques for various kinds of models, let's talk a little bit about validating deep learning models in general.
When you build a machine learning model, you're training it with a set of data samples. The machine learning model learns these samples and derives general rules from them. When you feed the same samples to the model, it will perform pretty well on those samples. However, when you feed new samples to the model that you haven't used in training, the model will behave differently. It will most likely be worse at making a good prediction on those samples. This happens because your model will always tend to lean toward data it has seen before.
But we don't want our model to be good at predicting the outcome for samples it has seen before. It needs to work well for samples that are new to the model, because in a production environment you will get different input that you need to predict an outcome for. To make sure that our model works well, we need to validate it using a set of samples that we didn't use for training.
Let's take a look at two different techniques for creating a dataset for validating a neural network. First, we'll explore how to use a hold-out dataset. After that we'll focus on a more complex method of creating a separate validation dataset.
The first and easiest method to create a dataset to validate a neural network is to use a hold-out set. You're holding back one set of samples from training and using those samples to measure the performance of your model after you're done training the model:
The ratio between training and validation samples is usually around 80% training samples versus 20% test samples. This ensures that you have enough data to train the model and a reasonable amount of samples to get a good measurement of the performance.
Usually, you choose random samples from the main dataset to include in the training and test set. This ensures that you get an even distribution between the sets.
You can produce your own hold-out set using the train_test_split function from the scikit-learn library. It accepts any number of datasets and splits them into two segments based on either the train_size or the test_size keyword parameter:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
It is good practice to randomly split your dataset each time you run a training session. Deep learning algorithms, such as the ones used in CNTK, are highly influenced by random-number generators, and the order in which you provide samples to the neural network during training. So, to even out the effect of the sample order, you need to randomize the order of your dataset each time you train the model.
Using a hold-out set works well when you want to quickly measure the performance of your model. It's also great when you have a large dataset or a model that takes a long time to train. But there are downsides to using the hold-out technique.
Your model is sensitive to the order in which samples were provided during training. Also, each time you start a new training session, the random-number generator in your computer will provide different values to initialize the parameters in your neural network. This can cause swings in performance metrics. Sometimes, you will get really good results, but sometimes you get really bad results. In the end, this is bad because it is unreliable.
Be careful when randomizing datasets that contain sequences of samples that should be handled as a single input, such as when working with a time series dataset. Libraries such as scikit-learn don't handle this kind of dataset correctly and you may need to write your own randomization logic.
You can increase the reliability of the performance metrics for your model by using a technique called k-fold cross-validation. Cross-validation performs the same technique as the hold-out set. But it does it a number of times—usually about 5 to 10 times:
The process of k-fold cross-validation works like this: First, you split the dataset into a training and test set. You then train the model using the training set. Finally, you use the test set to calculate the performance metrics for your model. This process then gets repeated as many times as needed—usually 5 to 10 times. At the end of the cross-validation process, the average is calculated over all the performance metrics, which gives you the final performance metrics. Most tools will also give you the individual values so you can see how much variation there is between different training runs.
Cross-validation gives you a much more stable performance measurement, because you use a more realistic training and test scenario. The order of samples isn't defined in production, which is simulated by running the same training process a number of times. Also, we're using separate hold-out sets to simulate unseen data.
Using k-fold cross-validation takes a lot of time when validating deep learning models, so use it wisely. If you're still experimenting with the setup of your model, you're better off using the basic hold-out technique. Later, when you're done experimenting, you can use k-fold cross-validation to make sure that the model performs well in a production environment.
Note that CNTK doesn't include support for running k-fold cross-validation. You need to write your own scripts to do so.
When you start to collect metrics for a neural network using either a hold-out dataset or by applying k-fold cross-validation you'll discover that the output for the metrics will be different for the training dataset and the validation dataset. In the this section, we'll take a look at how to use the information from the collected metrics to detect overfitting and underfitting problems for your model.
When a model is overfit, it perform...