SGN-41007 Pattern Recognition and Machine Learning

torstai 22. helmikuuta 2018

Prep for exam, visiting lectures (Feb 22)

On the last lecture, we spent the first 30 minutes on an old exam. In particular, we learned how to calculate the ROC and AUC for a test sample. Check the video to see how this was done.

Last item of the lecture consisted of two visiting lectures. Antti Liski from Dain Studios and Miikka Ermes from Futurice presented their data science activities.

Good luck to the competition. I have arranged access to our classroom for gpu access. If you need more, I recommend renting an Amazon AWS gpu instance. Choose "Deep learning instance" to have anaconda, tensorflow and keras installed. The cost is $0.9/hour (AFAIK; running one now), which should stay bearable even for student budget.

maanantai 19. helmikuuta 2018

Regularization and feature selection (Feb 19)

We continued the discussion of regularization, and noted that the L₁ norm enforces weights that are exactly zero. This is explained by the figure on the right, where the regularized solutions are always restricted to a region defined by the norm and its upper bound. Because of the shape of the L₁ norm region (red square), the minimum tends to occur at one of the corners. And the smaller the value of C, the more likely we are to hit one of the corners.

Apart from the L₁ and L₂ norms, there is literature on other norms. Especially L_q for q < 1 is interesting, because it results in even sparser weight vectors. Unfortunately, such norms create non-convex loss functions, and the solutions may be hard to find (endpoint depends on the initial guess). An example of regions obtained this way is shown on the right, and now it is even more straightforward to hit a corner (and have zero coefficients). Image credits: Caron et al., "Sparse bayesian nonparametric regression", ICML 2008.

The feature selection property is useful since discarding non-useful features improves the accuracy. On the other hand, it may be used together with a feature generator that generates a large amount of redundant features. An example of this was our ICANN2011 MEG Mind Reading Competition submission. In this challenge, we generated altogether 11 different feature sets (statistics such as means, stddevs, variances,...), and selected the best feature set using both L₁ norm and a brute force search over all combinations of two statistics of the 11 (more details here; available from TUT network).

For the competition, you may also check the MLSP2013 Birds competition, which is very related to the current competition in terms of data. A summary paper was written about the solutions different teams were using (available from TUT network).

keskiviikko 14. helmikuuta 2018

Regularization and error estimation (Feb 14)

Today we studied error estimation using cross validation. This includes K-fold CV, stratified K-fold CV, Group K-fold CV, LOO CV. For each method there is a corresponding Sklearn generator that generates the splits. The CV methods differ from a simple train_test_split in that they train and test many times. For example 10-fold CV produces 10 estimates of accuracy instead of just one. Averaging the 10 numbers increases the stability and reliability of the estimator, and helps you in model selection (e.g., in deciding should I use SVM or LogReg).

Next, we studied L₁ and L₂ regularization with linear models. Regularization helps to avoid overfitting, where your model performs very well with training data but poorly with test data (see also previous blog entry where convnet overfits). These regularization techniques add a penalty to the loss function (such as log-loss or hinge loss), and the penalty is either the ed L₁ or L₂ norm of the weight vector. Note also that the same technique applies with deep nets that are just stacked LogReg models. For example, in Keras, the regularizers module enables the use of penalty if your net seems to overfit. In DNN's, the most widely used regularizer is probably the dropout, which we have discussed earlier, but all of them serve the same purpose: avoid overfitting to training data.

There was a question on what tricks are the winning teams using to succeed in the competition. Among the reports due last Monday, I have spotted a few widely used ones. I don't want to spoil the competition by revealing them all, but here are a few general tricks:

Model averaging: instead of using just one model, train many predictors (LR, SVM, RF, deep net), and average their predictions; either by majority vote (which class is predicted most often) or averaging the class probabilities. If you want to get fancy, you may want to try different combinations of predictors in your local test bench and automate the choice of predictors to include. I have done this in this very related competition (you can find my name on the leaderboard). Also check the "congratulations thread" of the discussion forum there.
Semisupervised learning: You can learn from the test data as well with this one. Googling for "semi supervised logistic regression" will find many papers where the test data is used to aid the classification. Then there is the dirty trick we used in this competition (check the leaderboard here as well):

Train a model with the training data
Predict labels for the test data
Append the training set with test samples and predicted labels
Retrain the model with train + test data and with the fake labels
Predict the labels for test data again

The accuracy of the above usually improves if you include only those test samples whose probability is high (e.g. > 0.6).

The reports also discuss many other innovative tricks related to feature extraction and classifiers to use. I will require the teams to disclose these in the final report (instructions out soon), and open them for all the world after the competition.

maanantai 12. helmikuuta 2018

Recurrent Networks (Feb 12)

Today we continued on the topic of deep learning. More specifically, we studied recurrent networks, including SimpleRNN, LSTM and GRU layers in Keras.

The recurrent networks differ from ordinary nets in that they remember their past states. This makes them attractive for sequence processing (e.g., audio signals or text sequences).

An example of a LSTM network in action is shown in the picture, which is originally from our recent paper. In that case, we detected different audio classes from microphone signal via transformation to spectrogram representation.

It is possible to take individual spectrogram timeframes and recognize them independently. However, this usually creates jitter in the event roll (classes turning on and off). Since the LSTM remebers the past, it produces a much smoother output.

For more information on the theory of recurrent nets, I recommend to study the already classical blog entry of Andrej Karpathy: The unreasonable effectiveness of recurrent neural networks.

At the end of the second hour, we looked at examples of recent highlights in deep learning, and will continue on that on Wednesday.

We also tested how well the recurrent network suits to our competition task. The below code is an example how to do this.

# Load and binarize the data

path = "audio_data"
X_train, y_train, X_test, y_test, class_names = load_data(path)

lb = LabelBinarizer()
lb.fit(y_train)
y_train = lb.transform(y_train)
y_test  = lb.transform(y_test)
num_classes = y_train.shape[1]    

# Shuffle dimensions. 
# Originally the dims are (sample_idx, frequency, time)
# Should be like: (sample_idx, time, frequency), so that
# LSTM runs along 'time' dimension.

X_train = np.transpose(X_train, (0, 2, 1))
X_test  = np.transpose(X_test,  (0, 2, 1))

# Define the net. Just 1 LSTM layer; you can add more.

model = Sequential()
model.add(LSTM(128, return_sequences = False, 
               input_shape = X_train.shape[1:]))
model.add(Dense(num_classes, activation = 'softmax'))

# Define optimizer such that we can adjust the learning rate:
optimizer = SGD(lr = 1e-3)

# Compile and fit
model.compile(optimizer = optimizer,
              loss = 'categorical_crossentropy',
              metrics = ['accuracy'])

history = model.fit(X_train, y_train, epochs = 30,

                    validation_data = (X_test, y_test))

# Plot learning curve.
# History object contains the metrics for each epoch.
# If you want to use plotting on remote machine, import like this:
#
# > import matplotlib
# > matplotlib.use("PDF") # Prevents crashing due to no window manager
# > import matplotlib.pyplot as plt

fig, ax = plt.subplots(2, 1)
ax[0].plot(history.history['acc'], 'ro-', label = "Train Accuracy")
ax[0].plot(history.history['val_acc'], 'go-', label = "Test Accuracy")
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Accuracy / %")
ax[0].legend(loc = "best")
ax[0].grid('on')

ax[1].plot(history.history['loss'], 'ro-', label = "Train Loss")
ax[1].plot(history.history['val_loss'], 'go-', label = "Test Loss")
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Loss")
ax[1].legend(loc = "best")
ax[1].grid('on')

plt.tight_layout()
plt.savefig("Accuracy.pdf", bbox_inches = "tight")

The learning curve that the script produces is shown on the right. It seems that the accuracy is growing but relatively slowly. This could be improved by adding 1-2 more LSTM layers and making the bigger (e.g., 128 -> 512 nodes).

It is noteworthy that the speed on a CPU is very slow. Here are some numbers:

CPU: about 30 minutes per epoch.
Tesla K80 GPU: about 80 seconds per epoch.
Tesla K80 GPU with CuDNN: about 11 seconds per epoch. Simply substitute "LSTM" by "CuDNNLSTM" in the above code.
Tesla K80 is about 3x slower than a high end gaming GPU, so with GTX1080Ti you might reach 5 s / epoch.

If you need a GPU, there are some in the classroom TC303. They are a bit old, however. If you are really in need of faster GPU's, talk to me over the lecture. We can arrange you GPU nodes from TCSC. However, only do this if you really intend to use those machines.

If we add another recurrent layer, change LSTM to GRU, and increase the layer size to 256, we get a learning curve as shown on the right.

Additionally, below is an example of setting up a convnet for the competition.

X_train = X_train[..., np.newaxis]
X_test  = X_test [..., np.newaxis]

model = Sequential()

model.add(Conv2D(filters = 32, 
                 kernel_size = (5, 5),
                 activation = 'relu',
                 padding = 'same',
                 input_shape=(40, 501, 1)))

model.add(MaxPool2D(2,2))
model.add(Conv2D(filters = 32, 
                 kernel_size = (5, 5),
                 padding = 'same',
                 activation = 'relu'))
model.add(MaxPool2D(2,2))

model.add(Conv2D(filters = 32, 
                 kernel_size = (5, 5),
                 padding = 'same',
                 activation = 'relu'))
model.add(MaxPool2D(2,2))

model.add(Conv2D(filters = 32, 
                 kernel_size = (5, 5),
                 padding = 'same',
                 activation = 'relu'))
model.add(MaxPool2D(2,2))

model.add(Flatten())
model.add(Dense(128, activation = 'relu'))
model.add(Dense(y_train.shape[1], activation='softmax'))

model.summary()

optimizer = SGD(lr = 1e-3)
model.compile(optimizer = optimizer,
              loss = 'categorical_crossentropy',
              metrics = ['accuracy'])

This network reaches about 60% accuracy on the test set. Notice the clear sign of overlearning on the test set (the test loss actually increases after about 40 epochs.

Although none of the example networks is magnificent in accuracy, they have a lot of potential if tuned in an appropriate manner.

Deep learning is not plug and play.

keskiviikko 7. helmikuuta 2018

Convolutional nets, pretraining, applications (Feb 7)

Today we studied the anatomy of a convnet. The first example was on computing the number of parameters of the net. We computed the number of weights of the network on slide 14 at the lecture notes. This number is also given by Keras' function "model.summary()" for each layer separately. Noteworthy things are the following:

The number of parameters on convolutional layers follows the rule N = w * h * d * c + c, where

w is convolution window width
h is convolution window height
w is the number of input channels
c is the number of output channels

The number of parameters on dense layers is simply the number of interconnections between them.
The number of parameters of conv layers does not depend on image size.
The convolutional pipeline can work for any image sizes, but the dense layers are specific to the training size.

After that, we discussed the brief history of deep learning. The Imagenet competitions had a great impact speeding up the development. Over the course of time, people moved to deeper and deeper architectures, but often with less computation required.

The next topic was the effect of pretraining. It is usually best to start the training from a pretrained model instead of starting from scratch. Keras has the module "keras.applications", which provides a number of famous (imagenet) pretrained models. As an example, we looked at training a network for classifying pictures of dogs and cats. It turned out that the pretrained model is superior as illustrated by the below figure.

Remember the Monday 12.2. deadline for submitting your report on the assignment. Follow the instructions. The format is free. Concentrate on the facts.

maanantai 5. helmikuuta 2018

Deep Learning (Feb 5)

Today's topic was deep learning. We started from the traditional 1990's networks by looking at an example of classifying license plate characters; something that our company has done for 20 years. The difference of modern and 1990's nets is that current networks are both deeper (more layers) and larger (individual layers are wider). Moreover, there are structures that were uncommon in the old days, such as convolutional layers, ReLU nonlinearity and dropout regularization. In computational terms, this is all enabled by the computational resources of modern GPU's.

Convolutional networks usually start with a convolutional pipeline, where each layer applies a number of convolutional filters to highlight different things of interest. The convolutional kernels (filter parameters) are learned from the data at training time. Here's an example of convolution in action; highlighting all vertical edges.

The code for reproducing the figure (apart from the data is below).

import numpy as np
from scipy.signal import convolve2d 
import cv2
import matplotlib.pyplot as plt

x = cv2.imread("person1.jpg")
x = np.mean(x, axis = -1)

w = np.array([[0,1,1], [0,1,1], [0,1,1]])
w = w - np.mean(w)
y = convolve2d(x, w)

fig, ax = plt.subplots(1, 2, sharex = True, sharey = True)
ax[0].imshow(x, cmap = 'gray')
ax[1].imshow(y, cmap = 'gray')

plt.show()

When a sequence of convolution operations are stacked together, we get powerful feature extraction operators, that are able to modify the data representation into something easy to recognize by computer. For example, the below example illustrated (part of) the processing pipeline of our real-time age estimation demo.

In addition to the convolutions, the pipeline consists of maxpooling and ReLU nonlinearities, with three dense layers on top.

keskiviikko 31. tammikuuta 2018

Ensemble methods and neural nets (Jan 31)

Today we continued the ensemble methods started last time.

However, before that we discussed the problem of model comparison in the competition. Namely, the train_test_split will mix samples from same recording into train and validation sets and will give an overly optimistic 99% score. Here is a great post on the topic.

Random forest have the ability of estimating the importance of each feature. This is done by randomizing the features one at the time and observing the amount of degradation of accuracy. If a feature is important, then its scrambling drops the accuracy a lot, while scrambling a non-important feature has only a minor effect to performance. In the lecture, we looked at an example, where we estimated the feature importances for the Kaggle competition data:

from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

from utils import load_data, extract_features
import matplotlib.pyplot as plt

if __name__ == "__main__":
    
    path = "audio_data"
    
    X_train, y_train, X_test, y_test, class_names = load_data(path)
    
    # Axes: 
    # 2 = average over time
    # 1 = average over frequency
    # None = average over both and concatenate
    
    F_train = extract_features(X_train, axis = None)
    F_test  = extract_features(X_test, axis = None)
    
    # Train random forest model and evaluate accuracy
    #model = RandomForestClassifier(n_estimators = 100)
    model = ExtraTreesClassifier(n_estimators = 100)
    
    model.fit(F_train, y_train)
    y_pred = model.predict(F_test)
    acc = accuracy_score(y_test, y_pred)
    
    print("Accuracy %.2f %%" % (100.0 * acc))
    
    # Plot feature importances
    importances = model.feature_importances_
    plt.bar(range(len(importances)), importances)
    plt.title("Feature importances")

The resulting feature importances are plotted on the below graph. Here, the first 501 feature are the frequency-wise averages for each 501 timepoint. The last 40 features are the corresponding averages along the time axis. It can be clearly seen, that the time-averages are clearly more significant for prediction accuracy.

After the RF part, we briefly looked at other ensembles: AdaBoost, GradientBoosing and Extremely Randomized trees. We also mentioned xgboost, which has often been a winner in Kaggle competitions.

At the second hour, we started the neural network part of the course. The nets are based on mimicking human brain functionality, although they have diverted quite far from the original idea. Among the topics, we discussed the differences between 1990's nets and modern nets (more layers, more data, more exotic layers). At the end of the lecture, we looked at how Keras can be used for training a simple 2-layer dense network.