这门课刷了两遍,这是我个人在上课期间的的一些总结,权当参考,如有错误请指正。
Introduction
Tom Mitchell (1998) Well-posed Learning Problem:
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
supervised or unsupervised
- SVD:cocktail party problem algorithm
linear Regression
- Hypothesis:
- cost function:
- Goal:
- Gradient descent algorithm:
- bowl-shape, never local optimal
- batch
- learning rate:
- small->slow convergence
- large->not decrease on every iteration,may not converge
- Normal equation:
- Gradient Descent vs Normal Equation
- alpha
- iterations
- work well even n is large
- Linear Algebra:
- Matrices are associative: (A∗B)∗C=A∗(B∗C)
- non-invertible:singular or degenerate, use pinv
- Feature Normalization
- feature scaling and mean normalization
- will oscillate inefficiently down to the optimum when the variables are very uneven
- Predictor:
Logistic Regression
- “Sigmoid Function” “Logistic Function”
- Decision Boundary
- 0.5
- the line that separates the area where y = 0 and where y = 1
- Cost Function
- We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima
- We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima
- Gradient Descent
- 回归问题可以推导出分类问题,用回归的线的区分性质
- optimaze
- Conjugate gradient
- BFGS
- L-BFGS
- gradient descent
- Multiclass Classification: One-vs-all
- Overfitting:
- Reduce the number of features:
- select which features to keep
- model selection algorithm
- Regularization
- reduce the magnitude of parameters θj
- Regularization works well when we have a lot of slightly useful features
- Reduce the number of features:
- Regularization
- when j = 0,lambda=0(不对theta0进行惩罚,所以theta0是大的,这是一个约定???)
offsets the relationship and its scale therefore is far less important to this problem. Moreover, in case a large offset is needed for whatever reason, regularizing it will prevent finding the correct relationship.
- cost function
- gradient descent
- normal equation in linear regression
- if m ≤ n, then XTX is non-invertible. However, when we add the term λ⋅L, then XTX + λ⋅L becomes invertible.
- when j = 0,lambda=0(不对theta0进行惩罚,所以theta0是大的,这是一个约定???)
Neural Network
- 为什么bias总是1:只是一个偏好而已;不参与正则化:比较大的时候效果比较好
- Non-linear Hypotheses
- feature mapping to polynomial features
- neural network
- Represent
= “activation” of unit i in layer j
= matrix of weights controlling function mapping from layer j to layer j+1
- Intuitions
- XNOR
- Multiclass Classification
- only one
- only one
- Cost Function
- Backpropagation Algorithm
= “error” of node j in layer l
证明???
- Gradient Checking
- gradient checking will assure that our backpropagation works as intended.
- gradient checking will assure that our backpropagation works as intended.
- Random Initialization
- Initializing all theta weights to zero, all nodes will update to the same value repeatedly.
- Initialize each Θ(l)ij to a random value between[−ϵ,ϵ]:
- Training a Neural Network
- Randomly initialize the weights
- Implement forward propagation to get hθ(x(i))
- Implement the cost function
- Implement backpropagation to compute partial derivatives
- Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
- Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.
Advice for Applying Machine Learning
- Errors in your predictions can be troubleshooted by:
- Getting more training examples(fix high variance)
- Trying smaller sets of features(fix high variance)
- Trying additional features(fix high bias)
- Trying polynomial features(fix high bias)
- Increasing or decreasing λ(fix high variance/bias)
- Train/Validation/Test Sets
- trianing set:optimize the parameters in Θ
- validation set:find the best model with the least error
- test set:estimate the generalization error
- display error without lambda
- Bias vs Variance
- underfit => bias => Jtrian high/Jcv high/Jtrain≈Jcv
- overfit => variance => Jtrain low/Jcv high
- Learing Curves
- high bias => more training data not help
- high variance => more training data help
- Diagnosing Neural Networks
- fewer parameters:underfitting,computationally cheaper
- more parameters:overfitting,computationally expensive(increase λ to address the overfitting)
- Model Selection
- Intuition for the bias-variance trade-off:
- Complex model => sensitive to data => much affected by changes in X => high variance, low bias.
- Simple model => more rigid => does not change as much with changes in X => low variance, high bias.
- Regularization Effects:
- Small values of λ allow model to become finely tuned to noise leading to large variance => overfitting.
- Large values of λ pull weight parameters to zero leading to large bias => underfitting.
- Intuition for the bias-variance trade-off:
Machine Learning System Design
- Error Analysis
- Start with a simple algorithm, implement it quickly, and test it early.
- Plot learning curves to decide if more data, more features, etc. will help
- Error analysis: manually examine the errors on examples in the cross validation set and try to spot a trend.
- skewed classes
- Optimization Objective:hinge loss
- projection:
- the vector for Θ is perpendicular to the decision boundary. In order for our optimization objective (above) to hold true, we need the absolute value of our projections p(i) to be as large as possible.
- Kernels:allow us to make complex, non-linear classifiers using Support Vector Machines.
- Gaussian Kernel:
- Gaussian Kernel:
- Logistic Regression vs. SVMs
- If n is large (relative to m), then use logistic regression, or SVM without a kernel (the “linear kernel”)
- If n is small and m is intermediate, then use SVM with a Gaussian Kernel
- If n is small and m is large, then manually create/add more features, then use logistic regression or SVM without a kernel.
Unsupervised
Clustering
- K-means
- ‘Cluster Assignment’ step
- ‘Move Centroid’ step
- always descent
- random initialization
- can get stuck in local optima
- run the algorithm on many different random initializations
- choosing K
- the elbow method
- ‘Cluster Assignment’ step
Dimensionality Reduction
- Motivation
- data compression:
- Reduce space of data
- Speed up algorithm
- visualization:3、2-D
- data compression:
- PCA(Principal Component Analysis)
- minimizing the shortest distance,not linear regression
- preprocess
- algorithm:
- covariance matrix:
- eigenvectors
- take the first k columns of the U matrix and compute z
- covariance matrix:
- reconstruct
- choose k to be the smallest value such that
- Bad use of PCA: trying to prevent overfitting.
Anomaly Detection
- Gaussian Distribution
- base on independence assumption
- base on independence assumption
- Anomaly if p(x)<ϵ
- use the cross-validation set to choose parameter ϵ
- Anomaly Detection vs. Supervised Learning
- a very small number of positive examples
- many different “types” of anomalies
- fearture transforms
- be gaussian
- log(x)、x^1/2
- Multivariate Gaussian Distribution
- automatically capture correlations between different features of x
- original model advantages: computationally cheaper, it performs well even with small training set size
Recommender Systems
- Content Based Recommendations
θ(j)= parameter vector for user j
x(i)= feature vector for movie i
learn θ(j) - Collaborative Filtering
- Collaborative Filtering Algorithm
- Initialize x(i),…,x(nm),θ(1),…,θ(nu) to small random values
- break symmetry: different from each other
- Minimize J(x(i),…,x(nm),θ(1),…,θ(nu)) using gradient descent
- For a user with parameters θ and a movie with (learned) features x, predict a star rating of θTx.
- calculate y
- Implementation Detail
- normalizing the data relative to the mean
- normalizing the data relative to the mean
- Initialize x(i),…,x(nm),θ(1),…,θ(nu) to small random values
Week 10 Lecture Notes
- Stochastic Gradient Descent
- Mini-Batch Gradient Descent
- Online Learning
- continuously updating theta with a continuous stream
- Map Reduce and Data Parallelism
一些证明题
- Normal eauation partial in linear Regression
- Gradient Descent in Logistic Regression
- BP in Neural Network
- 不需要feature scaling的机器学习算法:没有用到梯度下降的
cs229补充
- 非参数机器学习算法
- 对于目标函数形式不作过多的假设的算法称为非参数机器学习算法。通过不做假设,算法可以自由的从训练数据中学习任意形式的函数。
- 当你拥有许多数据而先验知识很少时,非参数学习通常很有用,此时你不需要关注于参数的选取。
- 一些非参数机器学习算法的例子包括:
- 决策树,例如CART和C4.5
- 朴素贝叶斯
- 支持向量机
- 神经网络
- 非参数机器学习算法的优势:
- 可变性:可以拟合许多不同的函数形式。
- 模型强大:对于目标函数不作假设或者作微小的假设
- 表现良好:对于预测表现可以非常好。
- 非参数机器学习算法局限性:
- 需要更多数据:对于拟合目标函数需要更多的训练数据
- 速度慢:因为需要训练更多的参数,训练过程通常比较慢。
- 过拟合:有更高的风险发生过拟合,对于预测也比较难以解释。
- 感知学习算法
- 指数族
- softmax回归
- GDA
- event model for textclassifier
- ablative analysis
- feature select
- wrapper methods
- forward search
- backward search
- filter methos
- mutual information
- KL divergence
- wrapper methods
- 降维算法总结:
- linear
- PCA
- 新生成的点之间方差仍然最大
- 协方差矩阵做SVD分解,取特征值最大的保留
- FA
- 低维样本点经过高斯分布、线性变换、误差扰动生成
- 用EM算法解决
- ICA
-
- PCA
- nonlinear
- local
- LLE
- laplacian EMs
- global
- kernel PCA
- Auto-encoder
- SOM
- MDS
- local
- linear
- Ensemble Method:得到若干个体学习器,选择结合策略
- 个体学习器可以同质异质,强依赖(串行boosting)弱依赖(并行boosting)
- 结合策略:平均法(回归)、投票法(分类)、学习法(stacking)
- boosting:基于错误提升分类器性能,通过集中关注被已有分类器分类错误的样本,构建新分类器并集成。
- 误差率e,学习器权重alpha,样本权重D,结合策略
- adaboost: boosting + 单层cart决策树or神经网络
- 训练出G,e为加权错误率
- α=1/2*log((1−e)/e)
- wi:=wi/Z*exp(−αyiG(xi))
- f(x)=sign(∑αkGk(x))
- adaboost多分类:
- M1:要求弱分类器大于1/2
- M2:伪损失,置信度
- HM:把每个样本的分类,转化为k个10标签
- SAMME:弱分类器大于1/k
- 优缺点:不容易发生过拟合,对异常样本敏感
- boosting tree:向前分步算法,每次拟合残差学习一个回归树
- GBDT:残差即负梯度
- XGBoost:二阶信息
- bagging:基于数据随机重抽样,分类投票,回归平均
- random forest:bagging + CART
- NxM训练集bootstrap sampleN样本,选m个featrue
- 生成决策树不剪枝
- 回归:森林求平均;分类:森林投票
- m提高,树的相关度提高,错误率提高;树的分类能力提高,错误率降低
- oob误分率:选择没有拿某样本进行训练的树们对该样本进行预测。计算总的误分类率
- rf推广
- extra trees:
- 使用原始数据不抽样,随机选特征值来划分决策树
- 决策树更大,泛化能力更好
- Totally Random Trees Embedding:特征转换
- 建立T个决策树来拟合数据
- 每一个样本,都可以被表示为它在每一棵树上的叶子的索引
- 生成的高维特征可以用于各种分类回归算法
- Isolation Forest:异常点检测
- 采样远小于训练集
- 决策树随机划分,设置比较小的决策树最大深度
- extra trees:
- rf优缺点:
- 并行、特征维度高的时候高效,部分特征缺失不敏感、生成特征重要度(每个特征加噪声,加噪声后的oob2-oob的平均)
- 噪声大容易过拟合
- random forest:bagging + CART
- stacking:将训练好的所有基模型对整个训练集进行预测,第j个基模型对第i个训练样本的预测值将作为新的训练集中第i个样本的第j个特征值,最后基于新的训练集进行训练。