Deep Learning LeeRinji

Convolutional Neural Networks (CNNs / ConvNets)

Architecture Overview

Regular Neural Nets don’t scale well to full images. In CIFAR-10, images are only of size $32\times 32\times 3$ (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have $32323$ = 3072 weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectable size, e.g. $200\times 200\times 3$, would lead to neurons that have 200*200*3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.

Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth. (Note that the word depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions $32\times 32\times 3$ (width, height, depth respectively). As we will soon see, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer would for CIFAR-10 have dimensions $1\times 1\times 10$, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension. Here is a visualization:


A regular 3-layer Neural Network


A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels)

Layers used to build ConvNets

A simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet architecture.


Example Architecture: Overview. We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail:

Convolutional Layer

To summarize, the Conv Layer:

A common setting of the hyperparameters is $F=3$,$S=1$,$P=1$. However, there are common conventions and rules of thumb that motivate these hyperparameters.





Pooling Layer

It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the \textbf{MAX} operation. The most common form is a pooling layer with filters of size $2\times 2$ applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding $75\%$ of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little $2\times 2$ region in some depth slice). The depth dimension remains unchanged. More generally, the pooling layer:

The CIFAR-10 dataset

The CIFAR-10 dataset ( consists of 60000 $32\times 32$ colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. Here are the classes in the dataset, as well as 10 random images from each:


The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. “Automobile” includes sedans, SUVs, things of that sort. “Truck” includes only big trucks. Neither includes pickup trucks.


Codes and Results

在终端中执行下述指令,获取 cs231n 的数据集并本地编译。

cd cs231n/datasets
cd ../
python build_ext --inplace

回到初始目录,执行下属指令进行学习,并将日志写入screen.log。(CPU 上跑了快一个小时…)

python | tee screen.log

训练了 10 个 Epoch 共 4900 次迭代,最后得到的结果如下(screen.log),在训练集上得到了大约百分之七十的准确度。

(Epoch 10 / 10) train acc: 0.706000; val_acc: 0.598000

运行结果过程中的 Loss 函数变化如下图。



from cs231n.data_utils import get_CIFAR10_data
from cs231n.classifiers.cnn import ThreeLayerConvNet
from cs231n.solver import Solver
from matplotlib import pyplot
if __name__ == '__main__':
    solver = Solver(ThreeLayerConvNet(), get_CIFAR10_data())


由于临近期末,本次实验没有假如像之前 BP 算法和神经网络一样完全手写,而是参考了来自 cs231n 的代码,剩下的核心任务就是编写 main 函数来训练 CNN 了。其中使用到的 CNN 示意图大致如下。CNN



由于不可以使用常见的深度学习框架,这次大量的计算在 CPU 上跑了将近一个小时…做机器学习是真的吃计算资源。想来近些年深度学习能够如此火热的原因很大一部分也要归功于算力的发展吧。

话说回来,一个学期的实验课终于到此结束。回看这一学期,一共做了 16 个 Lab、4 个 Project、4 个 Theory、1 个 Presentation(因为没有经验被老师怼的很惨)。总的来说,我觉得这门人工智能是我这个学期知识容量最大的一门课,也确实让人学到了很多知识(我奇怪于这门课在前一个年级是必修课,但是在今年却变成了选修)。从最开始的搜索(宽搜、迭代加深搜索、对抗搜索以及最小最大剪枝等等),再之后的知识表达(Prolog、Strips),最后到机器学习,课程和实验的难度都在不断提升,每个礼拜都要花掉好几整天的时间在这门课上,远远超出了 TA 说的“实验内容都可以在课上完成”,可能还是我太菜了吧。尤其是最后的机器学习部分,写代码的过程中伴随着大量的公式推导,让人感叹自己数学直觉和功底的差劲。不过,这门课的设计还是非常棒的,各种作业的背景也很有趣且有用。希望这门课上学到的知识能够在以后真正的派上用场~