Supervised Learning

Last updated Jun 22, 2022 Edit Source

# What is supervised learning?

Training an algorithm to output $y$ for a given $x$ using sufficient training samples ${(x^{(0)},y^{(0)}),(x^{(1)},y^{(1)}),\ldots,(x^{(n)},y^{(n)})}$ for some input $x^{(i)}$ and correct output $y^{(i)}$
Regression: predicting an a number (infinitely many outputs)
Classification: predicting categories (finite outputs)

# Linear Regression

Given a training set we can use a learning algorithm to learnign a function $f$ that predicts an output $\hat{y}$ given an input $x$
for linear regression, $f$ is a straight line. With parameters $w,b$ we can then represent $f$ as: $$f_{w,b}(x)=wx+b$$

# Cost Function

Since our objective to find $w,b$ such taht $\hat{y}^{(i)}$ is close to $y^{(i)}$ for all $(x^{(i)},y^{(i)})$.
Squared error cost function: $$J(w,b)=\frac{1}{2m}\sum^{m}_{i=1}(\hat{y}^{(i)}-y^{(i)})^2$$
where $m=$ number of training examples. So $\frac{1}{m}$ is to average it so it doesnt blow up, factor of $2$ is for computational convience later.
Now we can also rewrite it as: $$J(w,b)=\frac{1}{2m}\sum^{m}{i=1}f{w,b}(x^{(i)})-y^{(i)})^2$$
This can be solved analytically for simple cost functions, but for complicated $J$, we can use gradiant descent to minimize $J$ instead:

# Gradient Descent

initialize $w,b$, calcuated $J$
adjust $w,b$ to decrease $J$
repeat until hopefully $J$ settles near minimum
Step 1: ($=$ here is assignment, not equals) $$w=w -\alpha \frac{d}{dw}J(w,b)$$ where $\alpha$ is the learning rate, a hyperparameter that controls the “fast” we change $w$
Step 2: do the same for $b$ $$b=b -\alpha \frac{d}{db}J(w,b)$$
Note: $w$ and $b$ must be updated at the same time.

# Learning rate

If $\alpha$ is too small, then it will take many steps to reach minimum
If $\alpha$ is too large, then it might never reach the minimum

# Gradient Descent for linear regression

Calculating derivatives for $w$, $$\frac{d}{dw}J(w,b)= \frac{d}{dw}\frac{1}{2m}\sum^{m}{i=1}(\hat{y}^{(i)}-y^{(i)})^2$$$$=\frac{d}{dw}\frac{1}{2m}\sum^{m}{i=1}(wx^{(i)}+b-y^{(i)})^2$$ which is equal to $$\frac{1}{m}\sum^{m}{i=1}(f{w,b}(x^{(i)}-y^{(i)})x^{(i)}$$

and derivative for $b$, $$\frac{d}{db}J(w,b)= \frac{d}{dw}\frac{1}{2m}\sum^{m}{i=1}(\hat{y}^{(i)}-y^{(i)})^2$$$$ =\frac{d}{db}\frac{1}{2m}\sum^{m}{i=1}(wx^{(i)}+b-y^{(i)})^2$$ which is equal to $$\frac{1}{m}\sum^{m}{i=1}(f{w,b}(x^{(i)})-y^{(i)})$$

Psuedocode for gradient descent:

1
2
3
while J not converged:
	w = w - a * dJdW
	b = b- a * dJdb

where dJdW = $\frac{d}{dw}J(w,b)$ and dJdb = $\frac{d}{db}J(w,b)$

# Multiple features

what if you have multiple features (variables)?

$x_j = j^{th}$ feature
$n$ = number of features
$\vec{x}^{(i)}$= features of $i^{th}$ training example
$x_{j}^{(i)}$ = value of feature $j$ in the $i^{th}$ training example

We can then express the linear regression model as:

$$f_{w,b}(x)=w_1x_1+w_2x_2+\cdots+w_nx_n+b$$ define $\vec{w} = [w_1,\ldots,w_n]$ and $\vec{x}=[x_1,\ldots x_n]^T$, $T$ here represents transpose. Then $$f_{\vec{w},b}=\vec{w}\cdot \vec{x}+b$$Where $(\cdot)$ represents the dot product.