Supervised Learning
# What is supervised learning?
- Training an algorithm to output for a given using sufficient training samples for some input and correct output
- Regression: predicting an a number (infinitely many outputs)
- Classification: predicting categories (finite outputs)
# Linear Regression
- Given a training set we can use a learning algorithm to learnign a function that predicts an output given an input
- for linear regression, is a straight line. With parameters we can then represent as:
# Cost Function
- Since our objective to find such taht is close to for all .
- Squared error cost function:
- where number of training examples. So is to average it so it doesnt blow up, factor of is for computational convience later.
- Now we can also rewrite it as: $$J(w,b)=\frac{1}{2m}\sum^{m}{i=1}f{w,b}(x^{(i)})-y^{(i)})^2$$
- This can be solved analytically for simple cost functions, but for complicated , we can use gradiant descent to minimize instead:
# Gradient Descent
initialize , calcuated
adjust to decrease
repeat until hopefully settles near minimum
Step 1: ( here is assignment, not equals) where is the learning rate, a hyperparameter that controls the “fast” we change
Step 2: do the same for
Note: and must be updated at the same time.
# Learning rate
- If is too small, then it will take many steps to reach minimum
- If is too large, then it might never reach the minimum
# Gradient Descent for linear regression
Calculating derivatives for , $$\frac{d}{dw}J(w,b)= \frac{d}{dw}\frac{1}{2m}\sum^{m}{i=1}(\hat{y}^{(i)}-y^{(i)})^2=\frac{d}{dw}\frac{1}{2m}\sum^{m}{i=1}(wx^{(i)}+b-y^{(i)})^2\frac{1}{m}\sum^{m}{i=1}(f{w,b}(x^{(i)}-y^{(i)})x^{(i)}$$
and derivative for , $$\frac{d}{db}J(w,b)= \frac{d}{dw}\frac{1}{2m}\sum^{m}{i=1}(\hat{y}^{(i)}-y^{(i)})^2 =\frac{d}{db}\frac{1}{2m}\sum^{m}{i=1}(wx^{(i)}+b-y^{(i)})^2\frac{1}{m}\sum^{m}{i=1}(f{w,b}(x^{(i)})-y^{(i)})$$
Psuedocode for gradient descent:
|
|
where dJdW
= and dJdb
=
# Multiple features
what if you have multiple features (variables)?
- feature
- = number of features
- = features of training example
- = value of feature in the training example
We can then express the linear regression model as:
define and , here represents transpose. Then Where represents the dot product.
# Feature Scaling
When the range of values your features can take up differ greatly, i.e.
- = square footage of house
- = number of bedrooms
this may cause gradient descent to run slowly.
Some examples of feature scaling
# max scaling
- divide each data point for a feature by the max value for that feature.
# mean normalization
- eg: if , we can scale it like such
- where = mean
# Z-score normalization
- find standard deviation , mean then