在机器学习领域,二分类问题是基础且广泛应用的任务之一。它旨在将数据点划分为两个互斥的类别,例如垃圾邮件识别、疾病诊断等。然而,要有效地解决这类问题,我们不仅需要构建合适的模型,更需要一套行之有效的参数求解策略。当模型参数无法通过简单的代数运算获得闭式解时,数值优化方法便成为了关键。本文将深入探讨二分类问题的建模、基于最大似然估计(MLE)的参数求解方法,并详细介绍梯度下降法(Gradient Descent) 这一重要的数值优化算法,最后对其在固定步长下的收敛性进行严格的数学推导。
二分类问题的概率模型与参数估计
二分类问题的概率模型构建
给定二分类问题的标注数据集\mathcal D = \{(\pmb x_i, y_i)|\pmb x_i \in \mathbb R^{d + 1}, y_i \in \{0, 1\}\}_{i = 1}^N,集合中的样本均是独立同分布(Independently and Identically distributed, i.i.d.)地采样自真实分布,标签集合为K = \{0, 1\}。令类别条件概率分布:
P(Y = 1|\pmb x) \doteq \mu(\pmb x)
其中\pmb x \in \mathbb{R}^{d+1},\mu(\pmb x)为Logistic函数,其表达式如下所示:
\mu(\pmb x) = \frac{1}{1+\exp(-\pmb \theta^\top \pmb x)}
其中参数\pmb \theta = [\theta_0, \cdots, \theta_{d}]^\top,\pmb x = [1, x_1, \cdots, x_{d}]。进而可采用0-1分布进行建模,令其参数
p \doteq P(Y = 1|\pmb x) = \mu(\pmb x)
根据0-1分布的概率分布函数,可得:
\begin{align*}
P(Y = k|\pmb x) &= p^{k}(1-p)^{1-k} \\
&= \mu(\pmb x)^k(1 - \mu(\pmb x))^{1-k}
\end{align*}
基于最大似然估计的参数求解
采用最大似然估计(Maximum Likelihood Estimation, MLE)方法对参数\pmb \theta进行估计,首先构造似然函数:
\begin{align*}
\mathcal{L}(\pmb \theta|\mathcal D) &\doteq P(\mathcal D|\pmb \theta) \\
&= P(\prod_{i = 1}^N(\pmb x_i, y_i)|\pmb \theta) \\
&\overset{i.i.d.}{=} \prod_{i = 1}^N P(\pmb x_i, y_i|\pmb \theta) \\
&= \prod_{i = 1}^N P(y_i|\pmb x_i, \pmb \theta) P(\pmb x_i) \\
&= \prod_{i = 1}^N \mu(\pmb x_i)^{y_i}(1 - \mu(\pmb x_i))^{1 - y_i} P(\pmb x_i)
\end{align*}
等式两边同时取对数\ln(\cdot),将连乘运算转换成加法运算,有下式:
\begin{align*}
\ln \mathcal L(\pmb \theta|\mathcal D) &= \sum_{i = 1}^n \ln \left[ \mu(\pmb x_i)^{y_i}(1 - \mu(\pmb x_i)^{1 - y_i}) \right] + \sum_{i = 1}^N \ln P(\pmb x_i) \\
&= \sum_{i = 1}^N \left[y_i \ln \mu(\pmb x_i) + (1 - y_i) \ln(1 - \mu(\pmb x_i))\right] + \sum_{i = 1}^N \ln P(\pmb x_i)
\end{align*}
\ln(\cdot)是严格单调的,因此其并不影响优化问题的最优解。根据MLE,有:
\begin{align*}
\mathop{\arg\max}_{\pmb \theta \in \mathbb{R}^{d + 1}} \ln \mathcal L(\pmb \theta|\mathcal D) &= \mathop{\arg\max}_{\pmb \theta \in \mathbb{R}^{d + 1}} \sum_{i = 1}^N \left[y_i \ln \mu(\pmb x_i) + (1 - y_i) \ln(1 - \mu(\pmb x_i))\right] + \sum_{i = 1}^N \ln P(\pmb x_i) \\
&= \mathop{\arg\max}_{\pmb \theta \in \mathbb{R}^{d + 1}} \sum_{i = 1}^N \left[y_i \ln \mu(\pmb x_i) + (1 - y_i) \ln(1 - \mu(\pmb x_i))\right] \\
&= \mathop{\arg\min}_{\pmb \theta \in \mathbb{R}^{d + 1}} \sum_{i = 1}^N -y_i\ln \mu(\pmb x_i) - (1 - y_i) \ln(1 - \mu(\pmb x_i))
\end{align*}
根据一阶最优性条件,有:
\begin{align*}
&\nabla_{\pmb \theta} \left[-\ln \mathcal L(\pmb \theta |\mathcal D)\right] \\
=& \sum_{i = 1}^N -y_i \frac{1}{\mu(\pmb x_i)}\nabla_{\pmb \theta} \mu(\pmb x_i) - (1 - y_i)\frac{1}{1 - \mu(\pmb x_i)} \nabla_{\pmb \theta}(1-\mu(\pmb x_i)) \\
=&\sum_{i = 1}^N -\frac{y_i}{\mu(\pmb x_i)}[\mu(\pmb x_i)(1 - \mu(\pmb x_i)) \nabla_{\pmb\theta}(-\pmb \theta^\top \pmb x_i)] - \frac{1 - y_i}{1 -\mu(\pmb x_i)}[-\mu(\pmb x_i)(1 - \mu(\pmb x_i))\nabla_{\pmb \theta}(-\pmb \theta^\top \pmb x_i)] \\
=& \sum_{i = 1}^N y_i(1 - \mu(\pmb x_i)) \pmb x_i - (1 - y_i)\mu(\pmb x_i)\pmb x_i \\
=& \sum_{i = 1}^N [y_i(1 - \mu(\pmb x_i)) - (1 - y_i)\mu(\pmb x_i)]\pmb x_i \\
=& \sum_{i = 1}^N [y_i -y_i\mu(\pmb x_i) - \mu(\pmb x_i) + y_i\mu(\pmb x_i)] \pmb x_i \\
=& \sum_{i = 1}^N [y_i - \mu(\pmb x_i)]\pmb x_i = \pmb 0
\end{align*}
而方程\sum_{i = 1}^N [y_i - \mu(\pmb x_i)]\pmb x_i = \pmb 0 无法通过有限次的代数运算(加减乘除和开方等)表示为代数方程,故其为超越方程,无闭式解。因此,需要通过数值方法近似该方程的解。
梯度下降法(Gradient Descent)
梯度下降法是机器学习和优化领域中一种核心的迭代优化算法,主要用于求解光滑的无约束优化问题的数值解。接下来的两个小节将详细阐述梯度下降法的标准形式,并针对固定步长下无约束凸问题的收敛性进行严谨的数学分析。
梯度下降法的标准形式
给定求解最小化的无约束优化问题f(\pmb x),其中\pmb x \in \mathbb{R}^n。假设目标函数f(\pmb x)是光滑的,则可采用一节泰勒公式在给定点\pmb x_t处展开,具体形式如下:
f(\pmb x_t + \Delta \pmb x) \approx f(\pmb x_t) + [\nabla f(\pmb x_t)]^\top\Delta \pmb x
令\Delta \pmb x = \gamma \pmb d,其中\gamma \in \mathbb R_{++}, \pmb d \in \mathbb R^{n}, \|\pmb d\|_2 = 1,则上式可变为:
f(\pmb x_t + \gamma \pmb d) \approx f(\pmb x_t) + \gamma [\nabla f(\pmb x_t)]^\top \pmb d
根据Cauchy-Schwarz不等式,有:
|\nabla f(\pmb x)^\top \pmb d| \le \|\nabla f(\pmb x)\|_2 \cdot \|\pmb d\|_2 = \|\nabla f(\pmb x)\|_2
进而可得下降速度最快的方向为:
\pmb d = -\frac{\nabla f(\pmb x_t)}{\|\nabla f(\pmb x_t)\|_2}
将数量值与\gamma合并,可定义下述形式:
\pmb x_{t + 1} = \pmb x_t + \eta \pmb d, \qquad \eta > 0, \pmb d \in \mathbb{R}^n
且
\pmb d = -\nabla f(\pmb x_t)
该形式即为梯度下降法的标准形式,\eta称为步长或学习率,\pmb d称为搜索方向。当\eta取固定值时,该方法称为固定步长的梯度下降法。
固定步长下梯度下降法的收敛性证明
强凸与光滑假设及问题阐述
给定最小化的无约束优化问题f(\pmb x),其中\pmb x \in \mathbb{R}^n。假设目标函数f(\pmb x)为光滑的凸函数,且函数f(\pmb x)还需要满足\mu-强凸(\mu-Strongly Convex),即满足下式:
f(\pmb x) \ge f(\pmb y) + \nabla f(\pmb y)(\pmb x - \pmb y) + \frac{\mu}{2}\|\pmb x - \pmb y\|_2^2
其中\mu > 0,且\forall \pmb x, \pmb y \in \mathbb{R}^n。与此同时,函数f(\pmb x)要满足L-光滑性(L-Smoothness),即满足下式:
f(\pmb x) \le f(\pmb y) + \nabla f(\pmb y)(\pmb x- \pmb y) + \frac{L}{2}\|\pmb x - \pmb y\|_2^2
其中L > 0,且\forall \pmb x, \pmb y \in \mathbb{R}^n。在有了上述假设后,问题变为给定任意初始点\pmb x_0,梯度下降:
\pmb x_{t + 1} = \pmb x_t + \eta \pmb d
其中搜索步长\eta \in \mathbb{R}_{++},搜索方向为负梯度方向,即:
\pmb d = -\nabla f(\pmb x_t)
是否收敛于最优解\pmb x_*。
迭代收敛过程的详细推导
由于函数f(\pmb x)是L光滑的,则有下式
\begin{align*}
f(\pmb x) \le f(\pmb y) + \nabla f(\pmb y)(\pmb x -\pmb y) + \frac{L}{2}\|\pmb x- \pmb y\|_2^2
\end{align*}
成立,则有:
\begin{align*}
f(\pmb y - \frac{\nabla f(\pmb y)}{L}) &\le f(\pmb y) + \nabla f(\pmb y)(\pmb y - \frac{\nabla f(\pmb y)}{L} - \pmb y) + \frac{L}{2}\|\pmb y -\frac{\nabla f(\pmb y)}{L} - \pmb y\|_2^2 \\
&\le f(\pmb y) - \frac{1}{L}\|\nabla f(\pmb y)\|_2^2 + \frac{L}{2} \cdot \frac{1}{L^2}\|\nabla f(\pmb y)\|_2^2 \\
&= f(\pmb y) - \frac{1}{2L} \|\nabla f(\pmb y)\|_2^2
\end{align*}
由于在点\pmb x_*处取得最小值,有:
f(\pmb x_*) \le f(\pmb y - \frac{\nabla f(\pmb y)}{L}) \le f(\pmb y) - \frac{1}{2L} \|\nabla f(\pmb y)\|_2^2
成立,进而有:
\frac{1}{2L}\|\nabla f(\pmb y)\|_2^2 \le f(\pmb y) - f(\pmb x_*) \iff \|\nabla f(\pmb y)\|_2^2 \le 2L(f(\pmb y) - f(\pmb x_*))
代入\pmb y = \pmb x_t,则有:
\|\nabla f(\pmb x_t)\|_2^2 \le 2L(f(\pmb x_t) - f(\pmb x_*))
成立。此外,由于f(\pmb x)是\mu强凸的,有:
\begin{align*}
&f(\pmb x_*) \ge f(\pmb x_t) + \nabla f(\pmb x_t)^\top (\pmb x_* - \pmb x_t) + \frac{\mu}{2}\|\pmb x_* - \pmb x_t\|_2^2 \\
\implies & \nabla f(\pmb x_t)^\top (\pmb x_* - \pmb x_t) \le f(\pmb x_*) - f(\pmb x_t) - \frac{\mu}{2}\|\pmb x_t - \pmb x_*\|_2^2
\end{align*}
计算第t+1次迭代时,点\pmb x_{t +1}与最优解点的距离:
\begin{align*}
\|\pmb x_{t + 1} - \pmb x_*\|_2^2 &= \|\pmb x_t - \eta \nabla f(\pmb x_t) - \pmb x_*\|_2^2 \\
&= (\pmb x_t - \pmb x_* - \eta \nabla f(\pmb x_t))^\top (\pmb x_t - \pmb x_* - \eta \nabla f(\pmb x_t)) \\
&= \|\pmb x_t - \pmb x_*\|_2^2 + 2\eta\left[\nabla f(\pmb x)\right]^\top (\pmb x_* - \pmb x_t) + \eta^2 \|\nabla f(\pmb x_t)\|_2^2
\end{align*}
代入由L光滑和\mu强凸所得到的不等式,有:
\begin{align*}
\|\pmb x_{t + 1} - \pmb x_*\|_2^2 &= \|\pmb x_t - \pmb x_*\|_2^2 + 2\eta\left[\nabla f(\pmb x)\right]^\top (\pmb x_* - \pmb x_t) + \eta^2 \|\nabla f(\pmb x_t)\|_2^2 \\
&\le \|\pmb x_t - \pmb x_*\|_2^2 + 2\eta \left[f(\pmb x_*) - f(\pmb x_t) - \frac{\mu}{2}\|\pmb x_t - \pmb x_*\|_2^2\right] + 2\eta^2L(f(\pmb x_t) - f(\pmb x_*)) \\
&= (1 - \eta \mu)\|\pmb x_t - \pmb x_*\|_2^2 + 2\eta(\eta L - 1)[f(\pmb x_t) - f(\pmb x_*)]
\end{align*}
为使对于任意的初始点\pmb x_0梯度下降法均收敛,应保证:
\lim_{t \to \infty} \|\pmb x_{t + 1} - \pmb x_*\|_2^2 = 0
即需要保证第二项为0,且第一项的系数的绝对值小于1。因此,可令\eta = \frac{1}{L},有:
\|\pmb x_{t + 1} - \pmb x_*\|_2^2 \le (1 - \frac{\mu}{L})\|\pmb x_t - \pmb x_*\|_2^2
根据\mu凸和L光滑的性质可知0 < \mu < L,自动保证了系数0 < 1 - \mu/L < 1成立,根据递推公式可得梯度下降法收敛。