Intro
For preparing for the next coming exam, Machine Learning in 2019 Spring, I wrote this article down to mark and share some insights about basic mahcine learning methods, including KNN, Perceptron, Naive Bayes(MLE, MAP), Logistic Regression, Linear Regression, SVM.
Basic Idea
Basically, machine learning is trying to use known data to predict unknown data. Supervised or unsupervised learning is depend on if any label is paired with data.
Hypothesis classes is the set of possible classification functions you’re considering.
Loss function (Objective function)
It is your model’s target, which will decide the final goal of your model and provide learning gradient. Common loss functions are zero-one loss(1 as penalty), square loss, absolute loss and etc.
No free lunch rule
Need to make some assumptions or bias to find the best method. As a result, there is no any method that can handle all problems.
Train/Test split
It is used to test your model’s score. Similarly, valid data is split to validate your model during trainning without getting your model learn from test data.
KNN
K-Nearest Neighbor algorithm is a lazy online learning classification method based on the assumption that the most common label amongst is the input’s label. As $n \to \infty$, this method becomes accurate and slow (high memory requirement, $O(nd)$).
Distance
A general version is $dist(x,z) = (\sum_{r=1}^{d}|x_{r}-z_{r}|^p)^{1/p}$, when p = 1 -> ManhattanDistance, p = 2 -> EuclideanDistance, $p \to \infty$ -> ChebyshevDistance
Choose K
Large->reduce noise, small->less distinct. Use cross validation or Bayes method to choose this hyperparameter. It’s usually an odd number.
1-NN
The error of 1-NN is not more than twice than Bayes Optimal. where $P(y^\star|x_t)=P(y^\star|x_{NN})$
Curse of Dim
Distance of samples in high dimensional space is too big to even search whole space.
What about the hyperplanes?
Solution: Dimensional reduction, numerosity reduction, data compression…
Perceptron
Perceptron is a linear binary classifier searching a linear hyperplane to seperate binary classes data.
Algorithm Process:
Perceptron uses SGD to optimal loss function ($J(\theta) = - \sum\limits_{x_i \in M}y^{(i)}\theta \bullet x^{(i)}$).
So $\frac{\partial}{\partial \theta}J(\theta) = - \sum\limits_{x_i \in M}y^{(i)}x^{(i)}$ becomes $\theta = \theta + \alpha y^{(i)}x^{(i)}$.
Convergence
Naive Bayes
Bayes is a non-linear classification method. Naive Bayes’ assumpation is all conditions (features) are irrelevant (that’s why naive), so that based on the join distribution of data learned from training dataset, we can get output Y with MAP from input X.
B is category and A is feature. So we can find the highest posterriori probability of B’s category.
Generative Model & Discriminative Model
The difference of these models is the way to obtain $P(X,Y)$. The formmer estimates the join distribution of X&Y and use Bayes methods. The latter estimate $P(Y|X)$ directly.
For example, LR is discriminative (GD), and Bayes methods are generative.
MLE (Maximum Likelihood Estimation)[parameter]
- MLE gives the explanation of the data you observed.
- If n is large and your model/distribution is correct (that is H includes the true model), then MLE finds the true parameters.
- But the MLE can overfit the data if n is small. It works well when n is large.
- If you do not have the correct model (and n is small) then MLE can be terribly wrong!
- Statics.
MAP (Maximum a Posterriori Probability Estimation)[random variable]
Add a prior knowledge item to estimate $\theta$, for coin example, $\hat{\theta}=\frac{n_H+m}{n_H+n_T+2m}$. With data increasing, MAP becomes MLE, but when sample size is limited, it would be a significant change.
As a result, we change the target from MLE’s $log[P(D;\theta)]$ to $log[p(D;\theta)]+log[P(\theta)]$. The added item is independent of the data and penalizes if the parameters, θ deviate too much from what we believe is reasonable.
Ture Bayesian
It’s intractable.
Linear Regression
Linear Regression: $f(x) = \sum_{m=1}^{p}w_{m}x_{im}+w_{0}=w^{T}x_{i}$
Loss function: $J(w)=\frac{1}{n} \sum_{i=1}^{n}(y_{i}-f(x_{i}))^{2}$.
it can be also derived from MLE.
When it comes to MAP.
Logistic Regression
LR uses sigmoid function ($\frac{1}{1+e^{-Z}}$) to output classification probability, but still is are linear method drawing a linear hyperplane.
Use gradient decent (Partial derivative) to optimal MLE loss:
It looks like add an activation function to linear regression, what’s more, optimal methods can achieve the global maximum point by this convex loss function.
L2 Regularization
$\sum_{j=1}^{n}\theta^{2}_{j}$ is used to prevent weights become too large and finally simplify this model. Use a hyperparameter to weight this item and add to the loss function.
SVM
Support Vector
To make sure all samples are not only seperated by hyperplane but also away from hyperplane. The distance between two hyperlaones is $\frac{2}{|w|_2}$.
When training SVM, those samples not in support vectors are useless for changing support vectors.
Target
Find a hyperplane with max margin $max \;\; \gamma = \frac{y(w^Tx + b)}{|w|_2}$ [distance from point to hyperplane].
Linear SVM uses hinge loss $\ell(y) = \max(0, 1-y \cdot \hat y)$, to maximum the margin, so the numerator part would be 1 and the target is $min \;\; \frac{1}{|w|_2} \;\; s.t \;\; y_i(w^Tx_i + b) \geq 1$.
Soft SVM (Slack Variable)
Change the target to $min \;\; \frac{1}{|w|^2}+C\sum_{i=1}^{m}\zeta_{i} \;\; s.t \;\; y_i(w^Tx_i + b) \geq 1-\zeta_{i}, \zeta_{i} \geq 0$, wheren C is a hyperparameter and $\zeta$ is for soft 1-constraint。
Also, there is a L2 version.
Why do I write in English?
Practicing my shitty writing skill.