引入
回归问题和分类问题以前老是弄混,最直接的记法就是,分类问题是针对离散空间的,就比如说给定一张图片,根据其特征判断这张图片是猫还是狗,这就是分类问题;回归问题的话是针对连续空间的,比如预测距离,概率等。
最小二乘法线性回归(Least Squares Linear Regression)
一次项回归
给定一组训练数据以及相关的函数值$(x_i, y_i)$(这里的$x_i$是一组向量)如下:线性回归的方程如下:这个等式与最小二乘法分类的等式一样,唯一的区别就是值是连续的。\
$w_0$在这里是一个偏置,即是常见的$bias$,一般我们会把它整合到权重里,既方便也好算。\
最小二乘法线性回归的一般性步骤如下:
Step 1:Define:
Step 2: Rewrite:
Step 3: Matrix-vector notation
并且$\phi_0(x)=1$,$\phi_i(.)$叫基础方程(basis functions),对于$w$来说这依旧是个线性方程。\
此处的$\phi(x)$代表多项式,比如说:
举个例子,对于不同degree的多项式,其拟合效果如下:
对于不同的多项式会出现欠拟合或者过拟合的现象。
回归的最大似然法(Maximum Likelihood Approach to Regression)
概率回归(Probabilistic Regression)
对于概率回归有以下两个假设:
Assumption 1:我们的目标函数值是通过向方程添加噪声产生的,即是:
$y$:目标函数值; $x$:输入值\
$f$: 回归方程; $w$:权重或者说参数;$\epsilon$:噪声\
Assumption 2:这个噪声是服从高斯分布的随机变量,即:其中$f(x,w)$是均值,$\beta^{-1}$是方差($\beta$是precision),注意,现在$y$是一个随机变量,其基本概率分布为$p(y|x,w,\beta)$。
具体说明\
现在给定一组数据点$X=[x_1,…,x_n]\in \mathcal{R}^{d\times n}$以及相关的函数值$Y=[y_1,…,y_n]^T$\
其条件似然为(设数据是i.i.d.的):带入线性模型:其中$w^T\phi(x_i)$是广义的线性回归方程,然后现在我们就要计算关于$\beta$和$w$的最大化似然了。\
用对数似然如下:
对于$w$的梯度为:
我们做如下定义:
求解上面的式子可以得到:
由此可见,用最大似然求解的权重值和前面最小二乘法回归的结果一样,我们也可得出这么一个结论:
最小二乘法等同于假设目标是高斯分布。
所以要注意,运用最小二乘法是有假设前提的。\
然而,与最小二乘法相比,最大似然的运用更广法,因为除了权重$w$,我们还可以用这个方法估计$\beta$:
而且可以衡量我们的估计的不确定性(用loss function)。
回归中的损失函数(Loss Functions in Regression)
如果我们现在有一个新的数据$x_t$,在最小二乘回归中,其对应的值为:$y_t=\hat{x}_t^T\hat{w}$\
但在最大似然回归中,我们则是要计算相关概率值:$p(y|x,w,\beta)$。我们如何准确地估计$y_t$呢?这就要引入损失函数(loss function)了。
然后我们最小化预期损失:
举个例子,比如说损失函数是squared loss的话,可进行如下计算:
由此可见,对于squared loss,其最佳回归方程为后验$p(y|x)$的均值$E[y|x]$,这也称为均值预测。\
所以对于我们的广泛线性回归方程,可写为:
与前面的结论不同,这里多了个$\frac{\alpha}{\beta}$,先验的作用是可以正则化这个伪逆,这个也叫岭回归(ridge regression)。
既然说到了这个概念,就顺道提一下另一个跟它很像的$LASSO$ 回归,Lasso回归跟岭回归非常相似,它们的差别在于使用了不同的正则化项(LASSO是$l1$正则化,岭回归是$l2$)。最终都实现了约束参数从而防止过拟合的效果。另外,Lasso能够将一些作用比较小的特征的参数训练为0,从而获得稀疏解。也就是说用这种方法,在训练模型的过程中实现了降维(特征筛选)的目的。
下面这张图可以直观显示岭回归和最小二乘回归的差别:
MAP与正则化的最小二乘法的比较
现在我们在前面的最小二乘法的结论等式上加一个正则化项:
求解$w$我们得到一个新的估计:
此处 $\lambda = \alpha / \beta$。\
也就是说,如果我们在最小二乘回归上添加一个正则化项$\lambda$,则意味着我们假设目标有一个符合高斯分布的噪音,而且参数也是符合高斯分布的。\
以前面的9次多项式回归为例,加上不同的正则化项$\lambda$后,其效果为:
由此可见,$\lambda=\alpha / \beta$控制模型的复杂程度,并决定过拟合的程度。
完全贝叶斯回归(Full Bayesian Regression)
换个思路,其实我们也不是非得求出参数$w$,我们只需要通过训练数据来预测对应的函数值即可。也就是说可用边缘概率进行计算:
用贝叶斯公式可将上面等式写为:
若先验之类的都是高斯的话,则这个预测分布也将服从高斯分布,即:
(注:上图的等式有一个错误,中间那个等式最右边的$\Phi$没有转置。)
(附)作业相关代码
大概就是按照这篇博文全部实现一遍:(这次的代码大的方面应该是没问题的,但一些细节上可能会有问题,改得有点心累,先这样吧)\
(a):1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54def linear_features(X_train, X_test, y_train, y_test, _lambda = 0.01):
"""
param:
X_train(ndarray): shape = (N, 1)
y_train(ndarray): shape = (N, 1)
_lambda(float)
return:
"""
def add_dim_bias(X):
"""
[x1, x2, x3, ... , xn] -> [[x1,1], [x2,1], ... ,[xn,1]]
param:
X(ndarray): train_data (N_train, 1)
return:
new_X(ndarray): train_data (2, N_train)
"""
N_x = X.shape[0]
bias_dim = np.ones((N_x, 1))
new_X = np.c_[X, bias_dim].T
return new_X
def loss_fn(X, y, N):
"""
calculate RMSE
"""
rmse = np.sqrt(np.sum((X.T @ w - y) ** 2) / N)
return rmse
X_train = add_dim_bias(X_train)
X_test = add_dim_bias(X_test)
N_train = X_train.shape[1]
N_test = X_test.shape[1]
w = np.linalg.inv(X_train @ X_train.T + _lambda * np.eye(X_train.shape[0])) @ X_train @ y_train # w(2, 1)
rmse_train = loss_fn(X_train, y_train, N_train)
rmse_test = loss_fn(X_test, y_test, N_test)
plt.figure()
x = np.linspace(X_train[0,:].min(), X_train[0,:].max(), 100)
x = add_dim_bias(x)
for i in range(X_train.shape[1]):
plt.scatter(X_train[0,i], y_train[i, 0], marker = "o", color = "black")
plt.plot(x[[0]].flatten(), (w.T @ x).flatten(), color = "blue")
plt.title("linear_features")
plt.show()
return rmse_train, rmse_test
def main_a():
rmse_train, rmse_test = linear_features(X_train, X_test, y_train, y_test, _lambda = 0.01)
print("root mean squared error of the training data: ", rmse_train)
print("root mean squared error of the test data: ",rmse_test)
(b):1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57def polynomial_features(X_train, X_test, y_train, y_test, _lambda = 0.01, degrees = [2, 3, 4]):
"""
param:
X_train(ndarray): shape(N, 1)
y_train(ndarray): shape(N, 1)
_lambda(floar): ridge coefficient
degrees(list): list of degrees
"""
def add_degree(X, degree):
N = X.shape[0]
new_X = np.ones((N, 1))
for i in range(1, degree+1):
temp = X ** i
new_X = np.c_[new_X, temp]
new_X = new_X.T
return new_X
def loss_fn(w, X, y):
N = len(y)
rmse = np.sqrt(np.sum((X.T @ w - y) ** 2) / N)
return rmse
w_lst = []
rmse_train = []
rmse_test = []
for i in range(len(degrees)):
tmp_X = add_degree(X_train, degrees[i])
tmp_w = np.linalg.inv(tmp_X @ tmp_X.T + _lambda * np.eye(tmp_X.shape[0])) @ tmp_X @ y_train
w_lst.append(tmp_w)
for i in range(len(degrees)):
tmp_X_train = add_degree(X_train, degrees[i])
tmp_X_test = add_degree(X_test, degrees[i])
tmp_rmse_train = loss_fn(w_lst[i], tmp_X_train, y_train)
tmp_rmse_test = loss_fn(w_lst[i], tmp_X_test, y_test)
rmse_train.append(tmp_rmse_train)
rmse_test.append(tmp_rmse_test)
for i in range(len(degrees)):
print("polynomials of degrees={}, root mean squared error of the training data is: {}".format(degrees[i], rmse_train[i]))
for i in range(len(degrees)):
print("polynomials of degrees={}, root mean squared error of the test data is: {}".format(degrees[i], rmse_test[i]))
for i in range(len(w_lst)):
x = np.linspace(X_train[:,0].min(), X_train[:,0].max(), 100)
plt.figure()
for j in range(X_train.shape[0]):
plt.scatter(X_train[j,0], y_train[j, 0], marker = "o", color = "black")
x = add_degree(x, degrees[i])
plt.plot(x[1].flatten(), (w_lst[i].T @ x).flatten(), color = "blue")
plt.title("polynomial_features with degree = {}".format(degrees[i]))
plt.show()
def main_b():
polynomial_features(X_train, X_test, y_train, y_test)
(c):1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71def bayesian_linear_regression(X_train, y_train, X_test, y_test, mu = 0, sigma = 0.1, _lambda = 0.01, std = [1,2,3]):
def add_dim_bias(X):
"""
[x1, x2, x3, ... , xn] -> [[x1,1], [x2,1], ... ,[xn,1]]
param:
X(ndarray): train_data (N_train, 1)
return:
new_X(ndarray): train_data (2, N_train)
"""
N_x = X.shape[0]
bias_dim = np.ones((N_x, 1))
new_X = np.c_[X, bias_dim].T
return new_X
def prediction_mu_sigma(X):
#X_bias = add_dim_bias(X)#(2,50)
mu_or_pred = X.T @ w#(50,1)
sigma_2 = 1 / beta + X.T @ np.linalg.inv(alpha * np.eye(X.shape[0]) + beta *
X @ X.T) @ X
# Here take attention
sigma_2 += sigma * np.eye(sigma_2.shape[1])
return mu_or_pred, np.sqrt(sigma_2.diagonal())
def loss_fn(X, y):
N = len(y)
pred, _ = prediction_mu_sigma(X)
rmse = np.sqrt(np.sum((pred - y) ** 2) / N)
return rmse
def log_likelihood_fn(X, y):
n = len(y)
log_likelihood = n / 2 * (np.log(beta) - np.log(2 * np.pi)) - beta / 2 * (np.linalg.norm(y - w.T @ X) ** 2)
average_ll = log_likelihood / n
return average_ll
beta = 1 / (sigma ** 2)
alpha = _lambda * beta
X_train = add_dim_bias(X_train)#(2,50) y_train (50,1)
X_test = add_dim_bias(X_test)#(2,100) y_test (100,1)
w = np.linalg.inv(X_train @ X_train.T + _lambda * np.eye(X_train.shape[0])) @ X_train @ y_train # w(2, 1)
rmse_train = loss_fn(X_train, y_train)
rmse_test = loss_fn(X_test, y_test)
log_ll_train = log_likelihood_fn(X_train, y_train)
log_ll_test = log_likelihood_fn(X_test, y_test)
print("the RMSE of the train data is: {}".format(rmse_train))
print("the RMSE of the test data is: {}".format(rmse_test))
print("the average log-likelihood of the train data is: {}".format(log_ll_train))
print("the average log-likelihood of the test data is: {}".format(log_ll_test))
pred, x_std = prediction_mu_sigma(X_train)
plt.figure()
for j in range(X_train.shape[1]):
plt.scatter(X_train[0,j], y_train[j, 0], marker = "o", color = "black")
plt.plot(X_train[0], pred.flatten(), color = "blue")
for i in range(len(std)):
plt.fill_between(X_train[0], pred.flatten()+std[i]*x_std, pred.flatten()-std[i]*x_std,
color = "blue", alpha = 0.2)
plt.title("bayesian_linear_regression")
plt.show()
def main_c():
bayesian_linear_regression(X_train, y_train, X_test, y_test)
(d):1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68def squared_exponential_features(X_train, y_train, X_test, y_test, _lambda = 0.01, k = 20, sigma = 0.1, beta = 10, std = [1,2,3]):
def new_data(X):
n = len(X)
"""
new_x = np.zeros((n, k))
for i in range(n):
for j in range(k):
alpha_j = j * 0.1 - 1
tmp = np.exp(-0.5 * beta * (X[i] - alpha_j) ** 2)
new_x[i][j] = tmp
"""
X_poly = np.ones(X.shape)
for j in range(1, k+1):
X_poly = np.hstack((X_poly, np.exp(-beta/2*np.power(X-(j*0.1 -1), 2))))
return X_poly.T
def prediction_mu_sigma(X):
mu_or_pred = X.T @ w
sigma_2 = 1 / beta + X.T @ np.linalg.inv(alpha * np.eye(X.shape[0]) + beta *
X @ X.T) @ X
sigma_2 += sigma * np.eye(sigma_2.shape[1])
return mu_or_pred, np.sqrt(sigma_2.diagonal())
def loss_fn(X, y):
N = len(y_test)
pred, _ = prediction_mu_sigma(X)
rmse = np.sqrt(np.sum((pred - y) ** 2) / N)
return rmse
def log_likelihood_fn(X, y):
_beta = 1 / (sigma ** 2)
n = len(y)
log_likelihood = n / 2 * (np.log(_beta) - np.log(2 * np.pi)) - _beta / 2 * (np.linalg.norm(y - w.T @ X) ** 2)
average_ll = log_likelihood / n
return average_ll
orignal_data = X_train.copy()
X_train = new_data(X_train)
X_test = new_data(X_test)
alpha = _lambda * beta
w = np.linalg.pinv(X_train @ X_train.T + _lambda * np.eye(X_train.shape[0])) @ X_train @ y_train
rmse_train = loss_fn(X_train, y_train)
rmse_test = loss_fn(X_test, y_test)
log_ll_train = log_likelihood_fn(X_train, y_train)
log_ll_test = log_likelihood_fn(X_test, y_test)
print("the RMSE of the train data is: {}".format(rmse_train))
print("the RMSE of the test data is: {}".format(rmse_test))
print("the average log-likelihood of the train data is: {}".format(log_ll_train))
print("the average log-likelihood of the test data is: {}".format(log_ll_test))
pred, x_std = prediction_mu_sigma(X_train)
sorted_idx = np.argsort(X_train[0])
orignal_data = orignal_data[sorted_idx]
pred = pred.flatten()[sorted_idx]
y_train = y_train[sorted_idx]
plt.figure()
for j in range(orignal_data.shape[0]):
plt.scatter(orignal_data[j, 0], y_train[j, 0], marker = "o", color = "black")
plt.plot(orignal_data[:,0], pred.flatten(), color = "blue")
for i in range(len(std)):
plt.fill_between(orignal_data[0], pred.flatten()+std[i]*x_std, pred.flatten()-std[i]*x_std,
color = "blue", alpha = 0.2)
plt.title("bayesian_linear_regression")
plt.show()
效果图如下: