-
Notifications
You must be signed in to change notification settings - Fork 0
/
Machine_Learning_Notes.txt
84 lines (67 loc) · 4.64 KB
/
Machine_Learning_Notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
Machine Learning: Teaching the computer how to work with data without any human intervention
-One way to accomplish this is supervised classification, which is teaching the computer to classify
data based on patterns discerend by human users
Other def: Computer finding patterns in data, allowing for predictions
Supervised Classification algorithms: (Output: Class label)
*************************************
Naive Bayes: Algorithm used to categorize data (almost always text-based); very simple and easy to use
SVM (Support vector machine): Outputs line that divides up data based on data input
-Finds line with largest margin (distance to nearest points for both classes)
-Can be linear (straight line divider, not as accurate) or non linear
-Kernel Trick: make two-variable data separable by adding variables, then changing the solution back to
a two-variable system, which results in a separable, nonlinear solution
-Parameters: kernel, C, gamma
-Kernel: type of line that divides data (linear, poly, rbf, etc.)
-C: higher the C, the more accurate the line (might be harder to use)
-Gamma: higher gamma, more that closer points influence line (may result in harder to use line)
Decision Trees: Divides up data in linear fashion, but may divide up data in mulitple ways
-Entropy: Measure of impurity among examples (Decision trees must have least entropy as possible)
-Purity: How accuracte a specific split is
-Information gain: Equation that allows algorithm to decide whether to split or not (is entropy better in child node?)
-Parameters: min_samples_split
-min_samples_split: minimum number of samples required for the line to split again (higher = easier to use)
-Pro: Easy to understand Con: Often overfits
Adaboost: Boosting algorithm that initially uses very weak classifiers, but as it continues rounds of boosting,
the algorithm eventually becomes very accurate
-Often uses decision-tree algorithm as weak learner, thus far better than a single decision tree
-Although algorithm becomes more and more complex, boosting can prevent overfitting from occurring
-As long as weak classifiers are substantially better than random
-Boosting simplified: H is simply a weighted majority vote, that is, the result of a small-scale “election”
of the predictions of the weak classifiers
(http://rob.schapire.net/papers/explaining-adaboost.pdf)
*************************************
Regression: (Output: number)
***********
-Continuous results instead of discrete results (e.x. age vs having specific disease)
-The closer the data points are to the regression line, the more accurate the regression is
-r squared measures accuracy of regression (close to 0 is bad, close to 1 is good)
***********
Overfitting: problem when dividing line is too difficult to use because it takes the data too literally
General rules for ML Algorithms: (http://rob.schapire.net/papers/explaining-adaboost.pdf)
-Should be trained on "enough" training examples
-Should have low training error
-Should be as simple as possible
Test set: 20% of whole data (ish?)
Bias: High bias doesn't care enough about data
Variance: High variance cares about data too much (bad for when it runs into data it hasn't seen)
Feature scaling: method for rescaling features so output is more accurate (weighted features)
Feature selection: balance between bias and variance when deciding how many features to use
-Regularization: automatically penalizes for extra features
-Lasso regression: automatically uses regularization while taking accuracy of regression into account
PCA: shifts multi-feature data to make it linear, thus creating a principle component
-Principle component: a composite feature that represents latent features
-Maximal variance: line that minimizes information loss when creating PCA
-Latent features: underlying grouping of features
-Fit: use when creating PCA, fit onto training features
-Transform: transform features using fitted PCA
Cross Validation: process of breaking training and testing data into different groups of data
Parameter selection: selecting best parameters for classifier in order to find best accuracy
-GridSearchCV: automatically searches through possible parameters, finds most accurate
Confusion matrix: matrix of possible outcomes that determines if a data point is where it should be
-Recall: good chance it identifies most positives; may flag too many (++ / ++ + --)
-Precision: if flagged, good chance it's correct; may miss some positives (++ / ++ + -+)
Unsupervised learning: clustering data in unsupervised manner
**********************
K-Means: Commonly used algorithm that clusters data
-Centroids iteratively move towards centers of clusters
**********************