Lecture Notes from CS 231N

CS 231N side notes

CNN models

KNN
- The computational complexity of Nearest Neighbor classifier is an active area of research. Approximate Nearest Neighbor (ANN) algorithms can accelerate.
Subderivative
- Basically any derivative in between either end can be used
The power of preprocessing
RBF unit instead of Sigmoid unit
- a topic need to be researched at
- after brief research, I found that RBF unit are not typically used in deep neural network. They are only applicable in low dimensional space?
Conjugate gradient decent
- This instead of stochastic/mini-batch/steepest decent, a topic need to be researched
tf.nn.softmax_crossentropy_logits
More about cross-entropy
- From deepnotes, we have $\frac{\partial L}{\partial o_i} = p_i - y_i$ At this blog, we have $\frac{\partial L_i }{ \partial f_k } = p_k - \mathbb{1}(y_i = k)$ They are essentially the same thing, where $f_k = o_i$. The $f_k$ is just the output layer or sometimes referred as the logits (linear combination of ($wx+b$). It’s interesting that the derivative of single cross-entropy loss has the same expression as the derivative as the sum of the cross-entropy. Note that both are derived for classification problem. Also you might take the average of the sum of cross-entropy, therefore a weighting term might appear when you do BP.
- Neat implementation:
- Also the BP on RELU and Sigmoid are essentially the same, with the minor difference of RELU having to zero out the gradients in which inputs are negative => dhidden[hidden_layer <= 0] = 0
Dying RELU
Universal approximator
- Any bounded function that is monotonically increasing. Can be used as activation function for universal approximation.
- Sigmoid Example
- Also the paper on this topic Approximation by Superpositions of Sigmoidal Function
  - Proof with functional analysis, at which I didn’t learn…
- RELU can be constructed rectangle/sigmoid like with 4 RELUs Quora intuitive answer
- However deep NN is still better by empirical experience. It’s know that they are especially good at heirachy data, such as image recognition.Approaches that accelerate the network
Approaches that accelerate the network
- Use RELU, also the special weights initialization. Paper by delving deep into rectifiers
  - Also PRELU is presented in this paper, but the ideas are essentially the same.
- Use Batch Normalization, layer-wise batch normalization and scaling and shifting. Paper by Batch Normalization
  - Notice the batch normalization is different from population normalization by the coefficient $m/(m-1)$.
  - batch normalization can been as adding stochastisity into NN. Sometimes you can ignore dropout if you’re using this trick.
  - interestingly most modern convnet architecture doesn’t make use of local response normalization. In practice they had little effect on the performance.
  - Blog post well explained
  - Adavance blog post well explained In this post the author directly computes the gradient, instead of drawing the graph representation. Something to notice is there’s a special kind of math symbol $\delta_{i,k} :\text{this evaluates to 1 when i == k, else 0}$ This combined with previous blog would give one a very clear understanding of BP and chain rule.
- Train student network to mimic the teacher network. Where the teacher network may be an ensemble of deep neural network. Paper by Do Deep Nets Really Need to be Deep?
  - Train shallow network to approximate the logits instead of the actual label presented in the deep & complex network. Since logits can be more representitive than purely [0,1] probability space.
- Parameter Updates
  - SGD vanilla update
    - x += - learning_rate * dx
    - Drawback is that the rate is fixed, but of course you can anneal the rate.
    - Suitable for large-scale problem when computation time is a constraint.
    - SGD tricks
  - Momentum update
    - v = mu * v - learning_rate * dx # integrate velocity
    - x += v # integrate position
  - Nesterov Momentum (Naiive implementation)
    - x_ahead = x + mu * v
    - # evaluate dx_ahead (the gradient at x_ahead instead of at x)
    - v = mu * v - learning_rate * dx_ahead
    - x += v
  - Others
    - Adam
      - Computes individual adaptive learning rate for different parameters from estimates of first and second moments of the gradients.
      - Look like RMSprop with momentum on the first moment of gradient
    - AdamDelta (Variant of Adam)
    - RMSprop
      - Adjust AdaGrad monotonically decreasing learning rate problem
      - Moving average of the squared gradient
    - AdaGrad
      - maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).
      - cache += dx**2
      - x += - learning_rate * dx / (np.sqrt(cache) + eps)
      - Notice that the weights that receive high gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased
- Second order methods (Newton’s method or quasi-Newton’s method)
  - $x \leftarrow x - [H f(x)]^{-1} \nabla f(x)$
  - May have to do some research on quasi-Newton approaches
  - Generally too expensive to compute
- Convergence rate of Stochastic decent
  - The above taken from SGD tricks
  - Convergence is linear with t, what it means is that it’s residual error is $-log(p) \sim t$. The weird notation has a history.
- Things related to choosing cost function
  - MSE: minimizing the mean squared error cost function would give a function that predicts the mean of y for each value of x
  - Mean absolute error: yields a function that predicts the median value of y for each x
  - Cross entropy often selected since it mitigate the diminishing gradient problem at last layer. I’m also assuming it’s possible to use cross-entropy to predict continuous output space by doing shifting and scaling of the original output to [0,1].
- Bayesian Hyperparameter Optimization
  - to appropriately balance the exploration - exploitation trade-off when querying the performance at different hyperparameters
  - need to be researched
  - need to read the Gaussian Process Book
- Implementation of regularization
  - If regularization term appear, then this code is proceeded. Usually the regularization term is very small.
- 1x1 convolution
  - Reduce feature dimensionality
  - In a way make the network wider instead of deeper
  - Paper
- Diluted convolution
  - must research
  - very useful in segmentation
  - merge spatial information across the inputs much more agressively with fewer layers
  - Paper
- Discarding pooling sometimes is a good thing to do
- Also read GAN network, and all the papers listed in cs229
  - VAEs and GANs, I heard them using Lagrangian
  - Great blog introducing GANLink
- ResNet (See my notes on this topic as well)- ResNet
  - use of skip layer, average pooling instead of fc layer
  - heavy usuage of batch normalization
  - A must read
- GoogleNet (See my notes that explaines GoogleNet)
  - state of the art Inception module
  - A must read as well
- Learn how to convert parameters into memory
  - VGG example:
  - when using non-standard gradient decent, such as momentum, Adagrad, or RMSProp. The parameters need to me multiplied by 3 or more since they are caching stepwise gradients as well. Taking a big part of memory!
- Saddle points vs local minima
  - Saddle points happens more often in high dimensional space, since it only requires some direction to points up and other directions to points down.
  - local minima says all that of all the many direction I can move, all of them cause the gradient to go up. Now this is small probability.
- Tensorflow v.s PyTorch+Caffe2
  - Parameters & dimension calculation:
    - conv layer
      - parameters = input_depth * filter_size^2 * output_depth
      - dimension_size = (input_size - filter_size)/stride + 1
      - depth = number of filters
    - pool layer
      - parameters = 0
      - dimension_size = (input_size - filter_size)/stride + 1
      - depth = input_depth
    - fc layer
      - parameter = input_size^2 * input_depth * output_depth
      - dimension_size = output_depth
      - depth = 1x1
  - Reasons with zero paddings at the edge:
    - preserves dimensionality
    - helps prevent the boarder information from being washed away
  - Reasons for using 1x1 convolution instead of FC layers:
    - saves space in memory?
    - for an enlarged image, you can result in more spatial sampling if using full conv layers.
    - reference
    - In FCNs, 1x1 convolution also preserves spatial information
  - Reasons for using smaller filter and increase in filter size
    - more depth with activations means more expressive features, more non-linearity etc
    - less parameters if same view of the input volume
    - express more powerful features of the input, and with fewer parameters.
    - drawback: need bigger memory
  - Reasons you don’t use regular NN
    - doesn’t scale well when the image is large, weights number explode
    - lead to overfitting
- GoogleNet Case Study: Inception module
  - problem with Google Net is that computational expensive as each inception module can only add up depth.
  - also they are a huge number of multiplications going on
  - proposed solution:
    - Reduce depth by using 1x1 conv layer (they also called “bottlenetck layers”) before you do anything
    - No rigorous proof of the benefits gained from this, however adding 1x1 conv can be seen taking some linear combinations of the previous features then introducing some non-linearity into the system. It also reduces redundancy?
    - Auxiliary classification outputs to inject additional gradient at lower layers:
      - what this means is that these additional outputs, provided gradients can help alleviate the problem of the diminishing gradient problem since the network is do deep. Smart!
      - also as mentioned in the literature, one possible thing to do is to average the output to get better result?
    - No FC layer
- ResNet Architecture
  - residual connections
  - can we do better by continue stacking conv and actiavtion and pool?? No
  - experiment showed in 2015 that deeper layer such as a 56-layer did worse than a 20-layer network both on training and testing. Not caused by overfitting!
  - Hypothesis: Deeper model more difficult to optimize since for a deeper network they should be as good as an approximator as the their shallower network.
  - It’s this hypothesis that you don’t learn the direct mapping from input space to output space, instead you learn a $f(x) = h(x) - x$ residual function that maps input space to output space where $f(x)$ is the residual and $x$ is the identity input. Why does this work and is this easier than learning a direct mapping? They haven’t yet prove this mathematically, but the idea is that based on the hypothesis that deeper model are more difficult to optimize, therefore one can reduce this difficulty in optimizing a deep network to saying that lot’s of the deep network’s layers are combination of identity and so we only have to learn the identity plus some delta. Again, this is just intuition for the hypothesis. In practice this model works pretty well.
  - ResNet has this property similar to L2 regularization. If you set weights in res block to be zero, then the block will just be computing the identy matrix. In a way it’s encourage the network to not use the layers it doesn’t need.
  - Active area of research! Connections!
  - No dropout used
  - Xavier/2 initialization from He et al.
  - SGD + Momentum (0.9)
  - Changeable learning rate when validation error plateaus
  - Batch Normalization after every CONV layer!
  - 2015 sweep all competitions in image recognition (3.6% top 5 errors), better than “human performance”!
  - Improving the ResNet
    - Improved ResNet block design by adjusting the layers in the block’s path. A better path for back propagate. He et al. 2016 shows an increase in performance.
    - Making wider residual network, shorten the depth. Zagoruyko et al. 2016 showed that simply adding more filter in each layer rather than increase the depth can improve the ResNet’s performance. Also there’s the benifit of being more computationally efficient.
    - Stochastic depth. Alleviate the vanishing gradient problem. Huang et al. 2016 Randomly drop a subset of layers during each training pass. Bypass with identy function.
    - Others. There’re too many I just get tired of recording them.
- Beyong ResNets:
  - FractalNet: Ultra-Deep Neural Networks without Residuals Larsson et al. 2017
    - The key is more about transitioning effectively from shallow to deep network.
    - Trained with dropping out sub-paths
  - DenseNet: helps improving the problem with gradient vanishing Huang et al. 2017
  - SqueezeNet: Much less parameters and much less model size landola et al. 2017
    - The future (Google made :))
- Other vision tasks
How to calculate IoU:
- $0_{class}: 4/7$
- $1_{class}: 2/6$
- $2_{class}: 2/4$
- $3_{class}: 3/4$
- Mean IoU = $(4/7 + 2/6 + 2/4 + 3/4)/4 = 0.53869$

CS 231N side notes

CNN models

Related blog posts:

DeepMask Review

ROI pooling, align, warping

Decision Tree Variants