Skip to content
1ambda edited this page Oct 22, 2014 · 1 revision

์ง€๋‚œ ์‹œ๊ฐ„์—” Regression ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด graident descent ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋„์ž…ํ–ˆ์—ˆ๋‹ค. learning rate, vectorization ๋“ฑ์— ๋Œ€ํ•ด์„œ ์•Œ์•„ ๋ณด๊ธฐ๋„ ํ–ˆ๊ณ . ์ด๋ฒˆ์‹œ๊ฐ„์—” classification ๊ณผ regulrzation ์— ๋Œ€ํ•ด์„œ ๋ฐฐ์›Œ ๋ณธ๋‹ค.

์ด ์ˆ˜์—…์ด ์žฌ๋ฐŒ๋Š” ์ด์œ ๋Š” ์ˆ˜์‹์„ ์ฆ๋ช…ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ˆ˜์‹์†์— ์ˆจ๊ฒจ์ง„ ๋‚ด์šฉ๋“ค์„ ์ง๊ด€์ ์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๊ฒŒ ์„ค๋ช…ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ต์ˆ˜๋‹˜ ๊ณผ์ œ๋Š” ์ œ๋ฐœ ๊ทธ๋งŒ

Classification

regression ์ด continuous value ๋ฅผ ๋‹ค๋ฃฌ๋‹ค๋ฉด Classification ์€ discrete value ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ๋”ฐ๋ผ์„œ Classification (๋ถ„๋ฅ˜) ์˜ ์˜ˆ๋Š”,

  • ์ด๋ฉ”์ผ์ด ์ŠคํŒธ์ธ์ง€ / ์•„๋‹Œ์ง€
  • ์˜จ๋ผ์ธ ๊ฑฐ๋ž˜๊ฐ€ ์‚ฌ๊ธฐ์ธ์ง€ / ์•„๋‹Œ์ง€ (Online Transaction: Fraudulent)
  • ์•…์„ฑ ์ข…์–‘์ธ์ง€ / ์•„๋‹Œ์ง€

(http://stats.stackexchange.com)

์œ„์™€ ๊ฐ™์€ ๊ฒฝ์šฐ, Regression ์œผ๋กœ ๋ฌธ์ œ๋ฅผ ํ’€๋ฉด ๋‹น์žฅ์€ ๋งž์•„ ๋ณด์ด๋‚˜, ์ข…์–‘์ด ์ด์ƒํ•œ ์œ„์น˜์— ์ƒ๊ฒผ์„ ๊ฒฝ์šฐ ์•„๋ž˜์™€ ๊ฐ™์ด ์ง์„ ์ด ํฌ๊ฒŒ ๋ณ€ํ•œ๋‹ค.

(http://stats.stackexchange.com)

๋”ฐ๋ผ์„œ ์ด๋ ‡๊ฒŒ discrete value ์— ๋Œ€ํ•ด์„œ๋Š” Regression ๋ณด๋‹ค๋Š” Threshold ์— ๊ธฐ๋ฐ˜์„ ๋‘์–ด, h(x) ๊ฐ€ ์ผ์ • ๊ฐ’ ์ด์ƒ์ด๋ฉด y=1 ๋กœ ์˜ˆ์ธกํ•˜๋Š” ํŽธ์ด ๋” ์ •ํ™•๋„๊ฐ€ ๋†’์•„์ง„๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ regression ์€ ์ง์„ ์ด๊ธฐ ๋•Œ๋ฌธ์—, 0 <= y <= 1 ์ธ y ์— ๋Œ€ํ•ด์„œ 0๋ณด๋‹ค ์ž‘๊ฑฐ๋‚˜, 1๋ณด๋‹ค ๋” ํฐ y ๋ฅผ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

์ด๋Ÿฐ ์ด์œ  ๋•Œ๋ฌธ์— Classification ๋ฌธ์ œ์— Regression ์„ ์ž˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ y ์˜ ๋ฒ”์œ„๊ฐ€ 0 <= h(x) <= 1 ์„ ๊ฐ€์ง€๋Š” Logistic Regression ๋„ ์žˆ๋‹ค. ์ด๊ฑด Classification ์— ์‚ฌ์šฉ๋˜๊ธฐ๋„ ํ•œ๋‹ค.

Logistic Regression

์ด์ „์— ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด classification ์—์„  ์˜ˆ์ธก๋œ ๊ฐ’, ์ฆ‰ h(x) ๊ฐ’์ด 0 ๊ณผ 1์‚ฌ์ด์— ์žˆ๊ธธ ๋ฐ”๋ž€๋‹ค. ์ด๋ฅผ ์œ„ํ•ด logistic function, ํ˜น์€ sigmoid function ์ด๋ผ ๋ถˆ๋ฆฌ๋Š” ์•„๋ž˜ ์‹์„ hypothesis h(x) ์— ์ ์šฉํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ๊ทธ๋ฆผ์ด ๋‚˜์˜จ๋‹ค.

(http://www.saedsayad.com)

์ด ๋•Œ sigmoid function ์ด ์ ์šฉ๋œ h(x) ๋Š” ์ตœ๋Œ€๊ฐ’์ด 1์ด๋ฏ€๋กœ, ์ด๊ฑด ์ž…๋ ฅ๊ฐ’ x ์— ๋Œ€ํ•ด์„œ y ๊ฐ€ 1์ด ๋‚˜์˜ฌ ํ™•๋ฅ ์ด๋ผ ๋ณด์•„๋„ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ

h(x) = P(y = 1 | x ; 0)

Probability that y = 1, given x, parameterized by 0(theta)

์ด ๋•Œ sigmoid function ์„ ๋ณด๋ฉด, X ์ถ•์ด 0๋ณด๋‹ค ํฐ ์ ์—์„  y ๊ฐ’์ด 0.5 ๋ณด๋‹ค ํฌ๋ฏ€๋กœ, ์ด ์  ์ดํ›„๋ถ€ํ„ฐ๋Š” y ๋ฅผ 1 ์ด๋ผ ์˜ˆ์ธก (predict) ํ•˜๊ณ , ๋ฐ˜๋Œ€๋กœ X ์ถ• ๊ฐ’์ด 0๋ณด๋‹ค ์ž‘์€ ์ง€์ ์—์„  y ๋ฅผ 0์ด๋ผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ h(x) = g(0^T * x) ์ด๋ฏ€๋กœ, ๋ณธ๋ž˜์˜ hypothesis 0^T * x ๊ฐ€ 0์ด ๋˜๋Š” ์ง€์ ์„ ์ฐพ์œผ๋ฉด ๋œ๋‹ค.

(http://blog.csdn.net/abcjennifer/)

Decision Boundary

์ด์ œ ์‹ค์ œ๋กœ ๋ฌธ์ œ์— ์ ์šฉํ•ด ๋ณด์ž. ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‘๊ฐœ์˜ ์ง‘๋‹จ์ด ์žˆ์„๋•Œ, ์ด ๋‘ ์ง‘๋‹จ์„ ๊ฐ€๋ฅด๋Š” ์‹์„ ์ฐพ๊ธฐ ์œ„ํ•œ h(x) = g(01 + 01x1 + 02x2) ๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•ด ๋ณด์ž.

(http://blog.csdn.net/abcjennifer/)

์ด๋•Œ 0(theta) ๋ฅผ [-3; 1; 1] ๋กœ ์žก์œผ๋ฉด y ๊ฐ€ 1 ์ด ๋˜๋Š” ์ง€์ ์€ 0^T * x >= 0 ์ธ ์ง€์ , ์ฆ‰ -3 + x1 + x2 >= 0 ์ธ์ง€์ ์„ ์ฐพ์œผ๋ฉด ๋œ๋‹ค. ์ด ์‹์„ ํ’€์–ด์„œ ์“ฐ๋ฉด

x1 + x2 => 3 ์ด๋ฏ€๋กœ, ์œ„ ๊ทธ๋ฆผ์—์„œ ๋ถ„ํ™์ƒ‰ ์„ ์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค. ์ด ์„ ์„ Decision Boundary ๋ผ ๋ถ€๋ฅธ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด Decision Boundary ๋Š” g(z) = 0 ์ฆ‰, h(x) = 0.5 ์ธ ์ง€์ ์ด๋‹ค.

Non-linear dicision boundary ๋Š” ์–ด๋–จ๊นŒ?

(http://blog.csdn.net/abcjennifer/)

์ด ๊ฒฝ์šฐ x1^2, x2^2 ์ด๋ผ๋Š” ์ƒˆ๋กœ์šด feature ๋ฅผ ๋„์ž…ํ•˜๊ณ , parameter ์ธ theta ๋ฅผ [-1; 0; 0; 1; 1;] ๋กœ ์žก์•˜๋‹ค. ์‹์„ ํ’€๋ฉด, ์œ„์™€ ๊ฐ™์€ ์› ํ˜•ํƒœ์˜ Decision Boundary ๊ฐ€ ๋‚˜์˜จ๋‹ค.

feature ๋งŒ ์ž˜ ์กฐํ•ฉํ•˜๋ฉด, ์ฆ‰ polynomial ๋งŒ ์ž˜ ๋งŒ๋“ค๋ฉด ๋•…์ฝฉ์ด๋‚˜ ํ•˜ํŠธ๋ชจ์–‘ ๋“ฑ์˜ Decision boundary ๋„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

Cost Function

์ด์ œ ๋ฌธ์ œ๋Š” theta ๋ฅผ ์–ด๋–ป๊ฒŒ ๊ณ ๋ฅด๋А๋ƒ ํ•˜๋Š”๊ฑด๋ฐ, ์‹์„ ์ข€ ๋‹ค์‹œ ์‚ดํŽด๋ณด์ž.

(http://www.holehouse.org/)

Linear regression ์—์„œ ์‚ฌ์šฉํ•˜๋Š” cost function ์— ์ง€๊ธˆ์˜ h(x), ์ฆ‰ sigmoid function ์ด ์ ์šฉ๋œ h(x) ๋ฅผ ์ œ๊ณฑํ•œ J(0) ๋Š” non-convex ํ˜•ํƒœ๊ฐ€ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ global optimum ๋ณด๋‹ค๋Š” local optimum ์„ ์ฐพ๊ฒŒ ๋œ๋‹ค.

์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ, convex ํ˜•ํƒœ์˜ cost function ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”๋ฐ,

(http://www.holehouse.org/)

์ด cost function ์„ ์‚ฌ์šฉํ•˜๋ฉด, y = 1 ์ผ๋•Œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ทธ๋ž˜ํ”„๋ฅผ ์–ป๊ฒŒ ๋œ๋‹ค. 0 <= h(x) <= 1 ์ž„์„ ์ฐธ๊ณ ํ•˜์ž. y = 1 ์ผ๋•Œ, h(x) = 0 ์œผ๋กœ ๊ฐ€๋ฉด, cost function ์˜ ๊ฐ’, ์ฆ‰ cost ์ž์ฒด๊ฐ€ ๋†’์•„์ง€๋ฏ€๋กœ, Cost ๋ฅผ ๋‚ฎ์ถ”๋Š” ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ์›€์ง์ด๊ฒŒ ๋œ๋‹ค.

์ง๊ด€์ ์œผ๋กœ ๋ณด๋ฉด, h(x) ์ž์ฒด๋Š” y = 1 ์ผ ํ™•๋ฅ ์ธ๋ฐ, y = 1 ์ผ๋•Œ, h(x) = 0 ์ด๋ผ๋Š” ๊ฒƒ์€ ๋ง์ด ์•ˆ ๋˜๋ฏ€๋กœ ๋น„์šฉ์ด ๋ฌดํ•œ๋Œ€๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ๋ง์ด ๋œ๋‹ค.

(http://www.holehouse.org/)

๋ฐ˜๋Œ€๋กœ y = 0 ์ผ๋•Œ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด h(x) = 0 ์ฆ‰, y = 0 ์ผ ํ™•๋ฅ ์ด 0 ์œผ๋กœ ๊ฐˆ๋•Œ cost ๊ฐ€ ๊ฐ์†Œํ•œ๋‹ค.

(http://www.holehouse.org/)

๊ฒฐ๊ตญ ์•„๋ž˜์˜ ์ƒˆ๋กœ์šด logistric regression cost function ์„ ์ด์šฉํ•˜๋ฉด, J(0) ๋ฅผ convex function ์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

(http://www.holehouse.org/)

Simplified Cost Function and Gradient Descent

์ด์ œ y = 0, y = 1 ๋กœ ๋‚˜๋ˆ„์–ด์ ธ ์žˆ๋˜ cost function ์„ ์ข€ ๋” ๊ฐ„๋‹จํžˆ ํ‘œํ˜„ํ•ด ๋ณด์ž.

(http://www.holehouse.org/)

๋‹ค์Œ๊ณผ ๋™์ผํ•˜๋‹ค. y = 0, y = 1 ์„ ์ง์ ‘ ๋„ฃ์–ด๋ณด๋ฉด ๊ธˆ๋ฐฉ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

cost(hฮธ(x),y) = -y * log(hฮธ(x)) - (1-y) * log(1 - hฮธ(x))


์ž ์ด์ œ ๋‹ค์‹œ ๋ณธ๋ก ์œผ๋กœ ๋Œ์•„์™€์„œ, ์šฐ๋ฆฌ๋Š” ์ฒ˜์Œ์— `theta` ๋ฅผ ์ฐพ๊ธธ ์›ํ–ˆ๊ณ , ๊ทธ๋ž˜์„œ *gradient descent* ๋ฅผ ์“ฐ๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ, ๋งˆ์นจ ๋ณด๋‹ˆ `h(x)` ๊ฐ€ *sigmoid function* ์ด ์ ์šฉ๋œ ํ˜•ํƒœ๋ผ์„œ *non-convex function* ์ด๋ฏ€๋กœ, `h(x)` ๋ฅผ ํฌํ•จํ•œ *cost-function* ์ด *convex function* ์ด ๋˜๋Š” ์‹์„ ์ฐพ์•„๋ƒˆ๋‹ค. ์ด์ œ ๊ทธ ์‹์„ *gradient descent* ์— ์ ์šฉํ•˜๋ฉด,

์ด๊ณ , ์ด์ œ ์ด๊ฑธ batch gradient descent ์— ์ ์šฉํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€๋ฐ, ์—ฌ๊ธฐ์— partial derivative ๋ฅผ ์ ์šฉํ•˜๋ฉด

๋†€๋ž๊ฒŒ๋„ linear regression ๊ณผ ๊ฐ™์€ ์‹์ด ๋‚˜์˜จ๋‹ค. ์˜ค์˜ค ๋จธ์‹ ๋Ÿฌ๋‹ ์˜ค์˜ค

(http://www.holehouse.org/)

๋‹ค๋งŒ ๋‹ค๋ฅธ์ ์€ hypothesis ๊ฐ€ sigmoid function ์„ ์ ์šฉํ•œ ํ˜•ํƒœ๋ผ๋Š” ๊ฒƒ,

(http://www.holehouse.org/)

Advanced Optimization

์œ„์—์„œ ๋ณด์•˜๊ฒ ์ง€๋งŒ, J(0) ์˜ ์ตœ์†Œ๊ฐ’์„ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•„๋ž˜ ๋‘๊ฐœ์˜ ๊ฐ’์„ ๊ตฌํ•ด์•ผ ํ•œ๋‹ค.

(http://www.holehouse.org/)

์ด ๊ฐ’๋“ค์„ ์ด์šฉํ•ด์„œ gradient descent ๋Œ€์‹  ๋‹ค์Œ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

(1) Conjugate gradient
(2) BFGS
(3) L-BFGS

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ์žฅ์ ์€, leanring rate ๋ฅผ ๊ณ ๋ฅผ ํ•„์š”๊ฐ€ ์—†๊ณ , ๋Œ€๋ถ€๋ถ„ gradient decsent ๋ณด๋‹ค ๋น ๋ฅด๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋” ๋ณต์žกํ•˜๊ณ , ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋งˆ๋‹ค ๊ตฌํ˜„์ด ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋””๋ฒ„๊น…์ด ํž˜๋“ค์ˆ˜ ์žˆ๋‹ค. ์ž ์ด์ œ advanced optimization ์„ ์ด์šฉํ•ด ๋ณด์ž.

(http://www.holehouse.org/)

์œ„์™€ ๊ฐ™์€ ์‹์— ๋Œ€ํ•ด์„œ cost function ์„ octave ์—์„œ ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

(http://www.holehouse.org/)


์ด์ œ, octave ์—์„œ ์ œ๊ณตํ•ด ์ฃผ๋Š” fminunc ์— ์šฐ๋ฆฌ๊ฐ€ ๋งŒ๋“  costFunction ๊ณผ ์ดˆ๊ธฐ theta ๊ฐ’, ๊ทธ๋ฆฌ๊ณ  ์˜ต์…˜์„ ์ง‘์–ด ๋„ฃ์œผ๋ฉด

% define the options data structure
options= optimset('GradObj', 'on', 'MaxIter', '100'); 

% set the initial dimensions for theta % initialize the theta values
initialTheta= zeros(2,1); 

% run the algorithm
[optTheta, funtionVal, exitFlag]= fminunc(@costFunction, initialTheta, options); 

optTheta ๋Š” ์šฐ๋ฆฌ ์ฐพ๊ธธ ์›ํ–ˆ๋˜ theta ๊ฐ’์ด๊ณ , functionVal ์€ ์ตœ์ข… cost ๋ฅผ ๋Œ๋ ค์ค€๋‹ค. exstFlag ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ˆ˜๋ ดํ–ˆ๋Š”์ง€, ์•„๋‹Œ์ง€ ์•Œ๋ ค์ค€๋‹ค.

๋งŒ์•ฝ logistic regression ์— ๋Œ€ํ•œ theta ๊ฐ’์„ ์ฐพ๊ณ  ์‹ถ์œผ๋ฉด, cost function ์„ logistic regression ์— ๋งž๊ฒŒ ์ž‘์„ฑํ•˜๋ฉด ๋œ๋‹ค.

Multiclass Classification

์ด์ œ ๋‹จ์ˆœํžˆ y = 0 or 1(binary classification) ์ด ์•„๋‹Œ, ๋‹ค์–‘ํ•œ class ๊ฐ€ ์žˆ๋Š” classification ์„ ๊ณ ๋ คํ•ด๋ณด์ž, ์˜ˆ๋ฅผ ๋“ค๋ฉด ๋‚ ์”จ๋Š” sunny, cloudy, hot, cold ๋“ฑ์œผ๋กœ ๋ถ„๋ฅ˜๋  ์ˆ˜ ์žˆ๋‹ค.

one-vs-all (One-vs-rest)

multi class ๋ฅผ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋Š” ํ•œ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์€, ํ•˜๋‚˜๋ฅผ ์ •ํ•˜๊ณ , ๊ทธ ๋‚˜๋จธ์ง€์™€ ๋ถ„๋ฅ˜ํ•˜๋Š”๊ฒƒ์ด๋‹ค. ์ด๊ฑธ class ๊ฐฏ์ˆ˜๋งŒํผ ์ง„ํ–‰ํ•˜๋ฉด,

(http://www.holehouse.org/)

์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ, class ๊ฐ€ 3๊ฐœ๊ธฐ ๋•Œ๋ฌธ์— (i = 1, 2, 3) ์œผ๋กœ ๋†“์œผ๋ฉด i ๋งˆ๋‹ค ๊ฐ๊ฐ์˜ hฮธ^(i)(x) ๊ฐ’, ์ฆ‰ ์˜ˆ์ธก ๊ฐ’์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ƒˆ๋กœ์šด ๋ฌด์–ธ๊ฐ€๊ฐ€ input ์œผ๋กœ ๋“ค์–ด์™”์„๋•Œ, hฮธ^(i)(x) ๊ฐ’์„ ์ตœ๋Œ€๋กœ ํ•ด์ฃผ๋Š” i ์„ ์„ ํƒํ•˜๋ฉด ๋ถ„๋ฅ˜๊ฐ€ ๋œ๋‹ค. ์ฐธ ์‰ฝ์ฃ ?

Overfitting

Overfitting ์€ ๋„ˆ๋ฌด๋‚˜ ๋งŽ์€ feature ๊ฐ€ ์žˆ์„ ๋•Œ๋Š” cost function ์ด ํŠธ๋ ˆ์ด๋‹ ์…‹์— ์ž˜ ๋งž์•„ 0 ์— ์ˆ˜๋ ด ํ•˜์ง€๋งŒ, ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์™”์„๋•Œ๋Š” ์˜ˆ์ธก์„ ์ž˜ ํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ๋งํ•œ๋‹ค. ๋‹ค์‹œ ๋งํ•ด hypothesis ๊ฐ€ ๋„ˆ๋ฌด ๊ณ ์ฐจ์›์˜ ๋‹คํ•ญ์‹์ด์–ด์„œ ๊ทธ๋ ‡๋‹ค. (too many parameters) ์ฆ‰ ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ ์ขŒ์ธก์€ ๊ฒฝํ–ฅ์„ ๋‚˜ํƒ€๋‚ด๊ธด ํ•˜์ง€๋งŒ ๋ชจ๋“  ํŠธ๋ ˆ์ด๋‹์…‹์„ ๊ฒฝ์œ ํ•˜๋Š” ์ง์„ ์€ ๋งŒ๋“ค์–ด๋‚ด์ง€ ๋ชปํ–ˆ๋‹ค. (under fit) ๋ฐ˜๋ฉด ๊ฐ€์žฅ ์šฐ์ธก์€, ํŠธ๋ ˆ์ด๋‹์…‹์„ ๋ชจ๋‘ ๊ฒฝ์œ ํ•˜๋Š” hypothesis ๋ฅผ ๋งŒ๋“ค์–ด ๋ƒˆ์ง€๋งŒ, ๋‹คํ•ญ์‹์˜ ์ฐจ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋†’์•„ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์™”์„ ๋•Œ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•  ์ˆ˜๊ฐ€ ์žˆ๋‹ค. can't apply, unable to generalize ๊ต์ˆ˜๋‹˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์Šฌ๋ผ์ด๋“œ์— ์ ์œผ์…จ๋‹ค.

It makes accurate predictions for examples in the training set, but it does not generalize well to make accurate prediction on new, previously unseen examples

(http://www.holehouse.org/)

logistic regression ์—์„œ๋„ Overfitting ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

(http://www.holehouse.org/)

์ฃผ๋กœ training set ์ด ๋ถ€์กฑํ•˜๊ณ  feature ๊ฐ€ ๋งŽ์„๋•Œ ๋ฐœ์ƒํ•˜๋Š”๋ฐ ํ•ด๊ฒฐ์ฑ…์€

(1) feature ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ์ˆ˜๋™์œผ๋กœ ์‚ฌ์šฉํ•  feature ๋ฅผ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ Model selection algorithm ์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
(2) regularization ์„ ์ด์šฉํ•œ๋‹ค. ๋ชจ๋“  feature ๋ฅผ ์œ ์ง€ํ•˜์ง€๋งŒ, ์–ผ๋งˆ๋‚˜ ๊ฐ feature ๊ฐ€ prediction ์— ๊ธฐ์—ฌํ• ์ง€๋ฅผ ๋ณ€๊ฒฝํ•œ๋‹ค.

Regularization, Cost function

Regularization ์€ ์›ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ hypothesis ์— ๊ธฐ์—ฌํ•˜๋Š” ๋ฐ”๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ๋งŒ์•ฝ์— 0_3 ๊ณผ 0_4 ๋ฅผ ์ตœ์†Œํ™” ํ•˜๊ณ  ์‹ถ๋‹ค๊ณ  ํ•˜์ž. ๊ทธ๋Ÿผ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹์„ ๋งŒ๋“ค๋ฉด ๋œ๋‹ค. ์ „์ฒด ์‹์˜ ์ตœ์†Œ๊ฐ’์„ ์ฐพ๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์—, ์ƒ์ˆ˜๊ฐ€ 1000 ์ธ 0_3, 0_4 ๋Š” 0(zero) ์— ๊ฐ€๊นŒ์šด ์ˆ˜๊ฐ€ ๋‚˜์˜จ๋‹ค. ๋‹ค์‹œ ๋งํ•ด์„œ ์ด๋“ค ๋‘ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๊ธฐ์—ฌํ•˜๋Š” ๋ฐ”๋ฅผ ์ค„์ธ ๊ฒƒ์ด๋‹ค.

(http://www.holehouse.org/)

parameters ๊ฐ€ ์ž‘์€ ๊ฐ’์„ ๊ฐ€์งˆ์ˆ˜๋ก ๊ฐ„๋‹จํ•œ hypothesis ๊ฐ€ ๋‚˜์˜ค๊ณ , overfitting ํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ฮป ๋ผ๋Š” regularization parameter ๋ฅผ ๊ฐ€์ง„ ์‹์„ cost function ์— ๋” ๋ถ™์—ฌ parameter ๊ฐ€ ๊ธฐ์—ฌํ•˜๋Š” ๋ฐ”๋ฅผ ์กฐ์ ˆํ•˜๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์€ ์‹์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฐธ๊ณ ๋กœ ๋’ท ๋ถ€๋ถ„์˜ ์‹์€ regularization term ์ด๋ผ ๋ถ€๋ฅด๋Š”๋ฐ, j ๊ฐ€ 1๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ์— ์ฃผ๋ชฉํ•˜์ž. ์ด๋Š” 0_0 ์€ regularization ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

(http://www.holehouse.org/)

ฮป ๊ฐ€ ๋งค์šฐ ํฌ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ? 0_0 ์ด์™ธ์˜ ๋‹ค๋ฅธ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” 0์— ์ˆ˜๋ ด ํ•˜๋ฏ€๋กœ, hypothesis ๋Š” ์ƒ์ˆ˜๊ฐ€ ๋˜์–ด ํŠธ๋ ˆ์ด๋‹ ์…‹์— under fit ํ•  ๊ฒƒ์ด๋‹ค.

Regularized Linear Regression

regularization term ์œผ j ๊ฐ€ 1๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋ฏ€๋กœ, cost function ์„ ์‰ฝ๊ฒŒ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ๋ถ„๋ฆฌํ•˜๋ฉด gradient descent ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ ์„ ์ˆ˜ ์žˆ๋‹ค.

(http://www.holehouse.org/)

์ด์ œ ์œ„ ๋‘ ์‹์—์„œ ์•„๋ž˜ ์‹์„ ์ •๋ฆฌํ•˜๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™๊ณ 

<๋ฉด ๋œ๋‹ค. ์ด๋•Œ ์ด ๋งคํŠธ๋ฆญ์Šค์˜ `(0, 0)` ์œ„ ์‹์—์„œ ์•ž๋ถ€๋ถ„์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ๋ณดํ†ต `m` ์ด ๋งค์šฐ ํฌ๊ณ , `a` ๊ฐ€ ๋งค์šฐ ์ž‘์œผ๋ฏ€๋กœ ์œ„ ๊ฐ’์€ 1๋ณด๋‹ค ์ž‘๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด `0.99 * 0_j` ์ฒ˜๋Ÿผ.

์ด์ œ Normal equation ์— ์–ด๋–ป๊ฒŒ ์ ์šฉํ• ์ง€ ๊ณ ๋ คํ•ด ๋ณด์ž, ๋ณธ๋ž˜ normal equation ์‹์€ ์•„๋ž˜์™€ ๊ฐ™์€๋ฐ,

X^T * X ๋ถ€๋ถ„์— ฮป ๊ฐ€ ๊ณฑํ•ด์ง€๋Š” n+1 * n+1 ์˜ matrix ๋ฅผ ๊ณฑํ•˜๋ฉด ๋œ๋‹ค. ์ด๋•Œ ์ด ๋งคํŠธ๋ฆญ์Šค์˜ (0, 0) ๋ถ€๋ถ„์ด 0 ์ธ ๊ฒƒ์€ 0_0 ์— regularization ์„ ์ ์šฉํ•˜์ง€ ์•Š๊ธฐ ์œ„ํ•œ ๊ฒƒ.

๊ทธ๋Ÿผ ๋งŒ์•ฝ์— X^T * X ๊ฐ€ non-invertible ์ด๋ผ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ? ์ด๊ฑด ์ง€๋‚œ ์‹œ๊ฐ„์— ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด redundant feature ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๊ฑฐ๋‚˜, m <= n, ์ฆ‰ ํŠธ๋ ˆ์ด๋‹ ์…‹์— ๋น„ํ•ด feature ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์„ ๋•Œ ๋ฐœ์ƒํ•œ๋‹ค๊ณ  ๋งํ–ˆ๋‹ค.

๋†€๋ž๊ฒŒ๋„, ฮป > 0 ์ด๋ฉด, ์•„๋ž˜ ์‹์—์„œ X^T * X + ฮป (ฮป's (0, 0) = 0) ์€ ์ œ๋Œ€๋กœ invertible ํ•จ์„ ์ฆ๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด์„œ regularzation ์„ ํ†ตํ•ด์„œ non-invertible ๋ฌธ์ œ๋„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ.

(http://www.holehouse.org/)

Regularized Logistic Regression

linear regression ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ 0(theta) ๋ฅผ 0๊ณผ 1๋กœ ๋ถ„๋ฆฌํ•ด regularization term ์„ ์ถ”๊ฐ€ํ•˜๋ฉด ๋œ๋‹ค. ๋‹ค๋ฅธ์ ์€ h(x) ๊ฐ€ sigmoid function ์˜ ํ˜•ํƒœ๋ผ๋Š” ๊ฒƒ.

๊ทธ๋ฆฌ๊ณ  gradient descent ๋ฅผ ํ’€๊ธฐ ์œ„ํ•ด octave ์—์„œ ์ œ๊ณตํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„(conjugate, BFGS, L-BFGS ๋“ฑ) ์„ fminunc ์ด์šฉํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์–ธ๊ธ‰ ํ–ˆ๋“ฏ์ด jval ๊ณผ 0(theta) ์— ๋Œ€ํ•œ graident ๋ฅผ ๋Œ๋ ค์ฃผ๋Š” cost function ์„ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋Š”๋ฐ, regularzation term ์ด ์ถ”๊ฐ€๋˜์—ˆ์œผ๋ฏ€๋กœ ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์„ ๋”ํ•ด์„œ ๊ฐ 0 ์— ๋Œ€ํ•œ gradient ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์‹์„ ๋งŒ๋“ค์–ด์ฃผ๋ฉด ๋œ๋‹ค.

(http://www.holehouse.org/)


Summary

3์ฃผ์งธ์—๋Š” Classification ๊ณผ Regularization ์— ๋Œ€ํ•ด์„œ ๋ฐฐ์› ๋‹ค. ์ˆ˜์—…์€ ์–ด๋ ต์ง€ ์•Š๋‹ค. ๊ณผ์ œ๊ฐ€ ๋ฌธ์ œ์ง€ ใ… ใ…  ๊ต์ˆ˜๋‹˜. ํŒŒ์ด์ฌ์œผ๋กœ ๊ณผ์ œ๋ฅผ ๋‚ด์ฃผ์…จ์œผ๋ฉด ์ข€ ๋” ๋ฐฐ์šฐ๋Š” ๋ง›์ด ์žˆ์—ˆ์„ํ…๋ฐ์š”!

References

(1) why-not-approach-classification-through-regression
(2) http://www.saedsayad.com
(3) http://blog.csdn.net/abcjennifer/
(4) http://www.holehouse.org/

Machine Learning by Andrew Ng, Coursera