Module4. 지도학습(분류/회귀)

Part1. Foundation of Supervised Learning

[Machine learning problems ]

머신러닝 : Data로부터 내재된 패턴을 학습하는 과정
- Binary classification (Is it a spam mail or not?)
- Multi-class classification (Image recognition)
- Regression (House prices)

Learning pipeline
- Input data with label
  -> training (training data -> ML model -> desired output(label) -> patameter training / error 줄여가면서 학습)
  -> testing (Unseen input -> ML model -> prediction output)
Problem formulation
- input representation : 입력 feature 고르기 (도메인 지식 필요)
- Hypothesis : target function 추측

[Model generalization ]

머신러닝의 목적
a model needs to perform well on unseen data

Errors
- 목적: test에러와 train에러 비슷, train에러 0과 비슷
- 1단계 ) test에러와 train에러 비슷하게 만들기 (variance ↓)
- 2단계) train에러 0과 비슷하게 만들기 (bias ↓)
- bias와 variance는 trade-off 관계

Avoid overfitting
- a complex model tends to be used to handle high-dimensional data -> overfitting problem
  => need to increase the data exponentially
  => Data augmentation
  - 방법) regularization , Ensemble

Cross-validation (CV)
- validation data set : used to provide unbiased evaluation of the model's fitness
- 만약 k = 5 -> k-1개 train, 1개는 validation -> 5번 수행 -> data augmentation 효과 -> 일반화↑
- cross validation allows a better model to avoid overfitting but more complexity

Part2. Linear Regression

[Linear models]

입력변수x가 선형일 필요는 없음
장점
- simplicity : easy to implement and interpret
- generalization : highter chance test error ~~ train error
Framework
- Hypothesis class
  h(x) = ⍬ + ⍬x (Univariate linear model, 변수 한개)
- Loss function
  Minimizing MSE
- Optimization algorithm
  파라미터 최적화 -> ⍬를 구하는 과정
  -> Gradient descent algorithm : 함수의 변화도가 가장 큰 방향으로 이동
  ⍬를 바꿔가면서 gradient가 0인 지점 찾기, local optimum에 빠질 수 있음

[Gradient descent algorithm]

J : objective function that we want to optimize
⍺ : the step size to control the rate to move down the error surface (하이퍼파라미터, 양수)
⍬ : learnable parameter
최소의 비용을 달성할 때까지 파라미터 업데이트
normal equation에 비해 반복적으로 구할 수 있고 n이 커도 잘 작동함

Part3. Gradient Descent Algorithm

Batch gradient descent
- 기존 : sample size m개의 모든 데이터 고려해야 함 -> stochastic gradient descent (SGD) : m= 1
  -> batch gradient descent보다 빠르게 iteration 돌 수 있으나 noise 영향 ↑
- 한계 : local optimum

[Some ideas to avoid local minimum]

Momentum
: 과거에 gradient가 업데이트 되어오던 방향 및 속도를 어느 정도 반영해서 현재 포인트에서 gradient가 0이 되더라도 계속해서 학습을 진행할 수 있는 동력을 제공하는 것
-> local minimum 피할 수 있음
- 과거 gradient 값에 weighting을 함으로써 (과거일수록 작은 weight) exponentially weighted moving average 안정적으로 수렴
- Nestrov Momentum
  - lookahead gradient step
  - 기존 모멘텀 : gradient step + momentum step = actual step (벡터합)
  - Nestrov : momentum step으로 이동하고 gradient step 더해서 actual step 결정

AdaGrad : 각 방향으로의 learning rate를 적응적으로 조절하여 학습 효율 ↑
- 단점 : gradient값이 누적될수록 learning rate 값이 작아짐 -> 학습x
- 개선 -> RMSProp (gradient 영향 완충)

Adam : RMSProp + Momentum (가장 많이 쓰임)
1. compute the first moment from momentum
2. compute the second moment from RMSProp
3. Correct the bias
4. Update the parameters

learning rate scheduling
- learning rage - 작으면 천천히 loss 작아지고 반복 더 많이 해야함, 크면 단시간에 loss 줄어드나 그 이상 줄이기 어려움
- 수렴단계마다 learning rate를 적응적으로 조절 -> 초기에는 빠르게 학습, 학습 진행될수록 learning rate ↓ =>학습 용이

[Some optimization to avoid overfitting]

More features -> more parameters -> overfitting , MSE is sensitive to ourliers
방법1) reduce number of features
2) Regularization
- 비교적 덜 중요한 feature의 ⍬는 0으로 처리

Part4. Linear Classification

Linear model with a set of features

[Framework]

Hypothesis class
h(x) = sign(W^Tx)
Loss function (regression과 다른 loss function)
- Zero-one loss - 내부 logic 판별하여 맞으면 0, 틀리면 1 출력
- Hinge loss
- Cross-entropy loss
Optimization algorithm
- Gradient descent algorithm

[Score and margin]

Input data : x
Predicted label : h(x) = sign(W^T∮(x))
Target label : y (1 or -1)
Score : W*∮(x) -> how confident we are in predicting (h(x)로부터 해당 좌표까지의 거리)
Margin : (W*∮(x))*y -> how correct we are (score*y, 맞은 판별이면 값이 커짐)

[Loss function]

Zero-one loss : 미분하면 거의 대부분 0의 값 -> hinge loss 사용
Hinge loss : 1-margin과 0 중에 큰 값을 고름 -> 잘 예측하면 0
=> margin이 1보다 작으면 loss의 미분값은 -∮(x)y, 아니면 0
Cross-entropy loss - 가장 많이 사용
p와 q가 유사할수록 loss↓
- sigmoid함수로 score 실수값을 확률값으로 mapping (0~1사이의 값)

[Multiclass classification]

One-VS-All : multiclass 문제를 binary class 문제의 linear combination으로 나타냄
- 각 binary classification의 확률값에 one-hot encoding
- one-hot encodeing : 각 벡터마다 해당하는 위치에 label의 정보를 기록하는 것
- one hot encoding된 label 값과 확률값을 비교하여 loss 계산 => 학습

[Advantage of linear classifiacation]

simple (쉽게 구현, test)
Interpretability (요소 1단위 증가할 때마다 전체 score값이 어떻게 변화하는지 추정 -> 해석 가능성 제공)

Part5. Advanced Classification Model

[Support Vector Machine]

: 가장 큰 margin(가장 가까운 sample과 hyperplane사이의 거리)을 양쪽에 가지는 hyperplane을 고르는 방법

Support vertors : 가장 가까운 sample
Robust to outliers
Hard margin SVM : margin 내 어떠한 sample도 용인 x
Soft margin SVM : 어느 정도 error 용인
Nonlinear transform & kernel trick : 2차원을 고차원으로 mapping하는 함수(kernel) 사용
kernel 함수 : linearly separable하지 않은 data sample들이 있을 때, 그 차수를 높여 linearly separable하게 만드는 과정
- 종류) Polynomial, Gaussian radial basis function, Hyperbolic tangent
=> 최적화 문제

[Artificial neural network]

: non-linear classification model, deep neural network의 기본

activation functions : linear한 값인 score을 nonlinear값으로 매핑
- sigmoid - 깊이 있게 쌓을 때 한계
  z값이 굉장히 크거나 작은 경우 gradient값 매우 작아짐 -> 학습↓ -> ReLU
- ReLU - 가장 많이 활용, 미분을 해도 gradient term 1
여러 계층으로 구성함으로써 분류 성능 ↑ (can reprsent more complex boundaries) -> deep neural network
- 고차원 데이터 (image, video)에 잘 적용 -> computer vision, image recognization 최근 연구에 많이 활용
Gradien Vanishing Problem : 계층이 깊어질수록 gradient ↓
- breakthrough
  - Pre-training + fine tuning
  - Convolutional neural networks

Part6. Ensemble Learning

: 이미 사용하고 있거나 개발한 알고리즘의 간단한 확장, supervised learning task에서 성능을 올릴 수 있는 방법

-> 머신러닝에서 알고리즘 종류에 상관없이 서로 다르거나, 같은 메커니즘으로 동작하는 다양한 머신러닝 모델을 묶어 함께 사용하는 방식

[Ensemble Methods]

aggregating a set of predictions
make a desicion with a voting (다수결로 결정)

장점
- 예측 성능을 안정적으로 향상
- noise로부터 안정적
- easy to implement
- not too much parameter tuning
단점 : not a compact representation
Basic idea : Bagging and boosting

[Bagging]

Training samples are randomly chosen
같은 모델이더라도 서로 다른 data -> 다른 특성 학습
병렬적 - 각 데이터 subset이 다른 모델에 영향x
reduces variance (robust to overfitting)
- sample random -> data augmentation 효과
- 간단한 모델 집합으로 사용 -> 안정적 성능
Bootstrapping + aggregating
- Bootstrapping : 다수의 sample 생성해서 학습하는 방식
  -> noise에 robust

[Boosting]

Cascading of weak classifiers
-> bias가 높은 classifier
sequential -> 성능 ↑
Adaboost : 대표적인 boosting 알고리즘
base classifier에 의해 오분류된 sample에 대해 보다 높은 가중치를 두어 다음 학습에 사용
장점
simple and easy to implement
Flexible : can combine with any learning algorithm

Random Forest : Bagging과 Boosting을 활용한 대표적인 알고리즘, decision tree의 집합
- By bagging -> random forest - 서로 다르게 학습한 decision tree
- By boosting -> gradient boosting machine
  decision tree는 매 노드에서 결정 이루어짐, 자체적으로 boosting

[Performance evaluation]

model 정확도 accuracy 측정
confusion matrix : 각 경우에 대해 오차가 어느정도 있었는지 표현하는 방법

Precision (P) = TP / TP + FP
Recall (R) = TP / TP + FN
unblanced 데이터 세트인 경우 P와 R로 함께 봐야 성능 판단 가능
ROC Curve : performance comparisons between different classifier

Error measure
- 데이터의 특성에 따라 recall, precision이 더 중요한지 파악해야 함
- ex) 암 환자 파악 -> 암 환자 놓치면 안됨 -> recall 값 중요

강의 내용 정리 : LG Aimers AI Essential Course - Module4 by 이화여대 강제원 교수님

저작자표시 비영리

'LG Aimers' 카테고리의 다른 글

Module 7. B2B 마케팅 (0)	2024.01.21
Module6. 딥러닝 (2) (0)	2024.01.20
Module6. 딥러닝 (1) (0)	2024.01.20
Module2. Mathematics for ML (1)	2024.01.11
Module3. Machine Learning개론 (1)	2024.01.11

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Yeazzing's

Module4. 지도학습(분류/회귀)

Part1. Foundation of Supervised Learning

[Machine learning problems ]

[Model generalization ]

Part2. Linear Regression

[Linear models]

[Gradient descent algorithm]

Part3. Gradient Descent Algorithm

[Some ideas to avoid local minimum]

[Some optimization to avoid overfitting]

Part4. Linear Classification

[Framework]

[Score and margin]

[Loss function]

[Multiclass classification]

[Advantage of linear classifiacation]

Part5. Advanced Classification Model

[Support Vector Machine]

[Artificial neural network]

Part6. Ensemble Learning

[Ensemble Methods]

[Bagging]

[Boosting]

[Performance evaluation]

'LG Aimers' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

Module4. 지도학습(분류/회귀)

Part1. Foundation of Supervised Learning

[Machine learning problems ]

[Model generalization ]

Part2. Linear Regression

[Linear models]

[Gradient descent algorithm]

Part3. Gradient Descent Algorithm

[Some ideas to avoid local minimum]

[Some optimization to avoid overfitting]

Part4. Linear Classification

[Framework]

[Score and margin]

[Loss function]

[Multiclass classification]

[Advantage of linear classifiacation]

Part5. Advanced Classification Model

[Support Vector Machine]

[Artificial neural network]

Part6. Ensemble Learning

[Ensemble Methods]

[Bagging]

[Boosting]

[Performance evaluation]

'LG Aimers' 카테고리의 다른 글

'LG Aimers' Related Articles

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

Part1. Foundation of Supervised Learning

[Model generalization ]

[Gradient descent algorithm]

Part3. Gradient Descent Algorithm

[Some ideas to avoid local minimum]

[Some optimization to avoid overfitting]

[Framework]