#by Kevin Duh
- NAIST, Japan
- (will be affiliated with JHU, US during the winter school)
- DL is
- a family of methods that uses
- ...
- to learn hight-level representations
- opposed to engineered features
- Lee et al 2009 vision, automatically learning features
- softmax
- multinomial variant of sigmoid
- toolkits
- Theano, Torch ,pylearn2, Caffe, CNTK, CNN
- Bengio sitting (Bengio 2009)
- too many layers -> overfitting
- layer-wise pre-training
- Hinton (2006) -> interest return in deep learning
- for image recog, fist comp graphics! (P(X) instead of P(Y|X))
- why does it work? Erhan (2010)
- is necessary?
- 2006 yes
- now no
- ...in speech, still used
- word2vec is a kind of pre-training? Che and Manning (2014)
- RBM
- p(x,h) = 1/Z_Theta exp(-E_Theta(x,h)
- = 1/Z_Theta exp(x^TWh + b^Tx + d^Th
- x, h \in {0,1}
- 1/Z => probab distri
- W correlation?/interaction betw x,h
- assume b = d = 0
- components of x are independent given h
- max likelihood training
- the derivative of the log-likelihood
- increases prob of x(m)
- decreases...
- contrastive divergence algo
- algo
-
- x^(m), W given
-
- sample h from~
-
- sample x~
-
- sample h~
-
- wij += gamma(x~_j^(m) h~_j - x_j h_j^(x))
-
- pictorial
- x \mapsto P(x)
- algo
- distributed repr
- layer-wise pre-training
- intuition: Hinton, x and Teh
- deep belief net = stacked RBM (Hinton et al 2006)
- p(x) = sum_{h, h', h''} p(x|h) p(h | h') p(h'|h'')
- generatiion
- init MLP with RBMs
- fine-tune with backprop
- p(x,h) = 1/Z_Theta exp(-E_Theta(x,h)
- cheap replacement for RBMs
- stacked AEs
- h is continuous, determinic
- denoising
- noise construction is domain specific
- recurrence
- in the hidden layer
- lstm
- gate
- forget (i.e. remember), input, output
- Gers et al (2002)
- gate
- convolution
- pooling: elbambultam
- optimization
- making sgd work
- learning curve
- validation error increases from the start:
- regularization needed
- figure on learn rate \gamma from Schaul et al 2013
- validation error increases from the start:
- batch
- mx mx multi insted of vector mx multi, faster
- random init
- in the range where the abs deriv of sigmoid is not too small
- fanin: in-degree
- paper: efficient back prop
- non-convex
- momentum
- adaprive learning rate:
- adagrad: each weight has a learning rate
- adadelta: no gloval learning rate that would have to be tuned
- learning curve
- making sgd work
###second order (Newton) methods
- quasi-Newton: L-BFGS
- conjugate gradient
- drop-out
- error goes on decreasing
- multi-task learning
- Liu et al (2015)