我的知识记录

digSelf

Logistic回归的数学基石:从模型构建到梯度下降收敛性

Logistic回归的数学基石:从模型构建到梯度下降收敛性

在机器学习领域,二分类问题是基础且广泛应用的任务之一。它旨在将数据点划分为两个互斥的类别,例如垃圾邮件识别、疾病诊断等。然而,要有效地解决这类问题,我们不仅需要构建合适的模型,更需要一套行之有效的参数求解策略。当模型参数无法通过简单的代数运算获得闭式解时,数值优化方法便成为了关键。本文将深入探讨二分类问题的建模、基于最大似然估计(MLE)的参数求解方法,并详细介绍梯度下降法(Gradient Descent) 这一重要的数值优化算法,最后对其在固定步长下的收敛性进行严格的数学推导。

二分类问题的概率模型与参数估计

二分类问题的概率模型构建

给定二分类问题的标注数据集​\mathcal D = \{(\pmb x_i, y_i)|\pmb x_i \in \mathbb R^{d + 1}, y_i \in \{0, 1\}\}_{i = 1}^N,集合中的样本均是独立同分布(Independently and Identically distributed, i.i.d.)地采样自真实分布,标签集合为​K = \{0, 1\}。令类别条件概率分布:

P(Y = 1|\pmb x) \doteq \mu(\pmb x)

其中​\pmb x \in \mathbb{R}^{d+1}​\mu(\pmb x)为Logistic函数,其表达式如下所示:

\mu(\pmb x) = \frac{1}{1+\exp(-\pmb \theta^\top \pmb x)}

其中参数​\pmb \theta = [\theta_0, \cdots, \theta_{d}]^\top​\pmb x = [1, x_1, \cdots, x_{d}]。进而可采用0-1分布进行建模,令其参数

p \doteq P(Y = 1|\pmb x) = \mu(\pmb x)

根据0-1分布的概率分布函数,可得:

\begin{align*} P(Y = k|\pmb x) &= p^{k}(1-p)^{1-k} \\ &= \mu(\pmb x)^k(1 - \mu(\pmb x))^{1-k} \end{align*}

基于最大似然估计的参数求解

采用最大似然估计(Maximum Likelihood Estimation, MLE)方法对参数​\pmb \theta进行估计,首先构造似然函数:

\begin{align*} \mathcal{L}(\pmb \theta|\mathcal D) &\doteq P(\mathcal D|\pmb \theta) \\ &= P(\prod_{i = 1}^N(\pmb x_i, y_i)|\pmb \theta) \\ &\overset{i.i.d.}{=} \prod_{i = 1}^N P(\pmb x_i, y_i|\pmb \theta) \\ &= \prod_{i = 1}^N P(y_i|\pmb x_i, \pmb \theta) P(\pmb x_i) \\ &= \prod_{i = 1}^N \mu(\pmb x_i)^{y_i}(1 - \mu(\pmb x_i))^{1 - y_i} P(\pmb x_i) \end{align*}

等式两边同时取对数​\ln(\cdot),将连乘运算转换成加法运算,有下式:

\begin{align*} \ln \mathcal L(\pmb \theta|\mathcal D) &= \sum_{i = 1}^n \ln \left[ \mu(\pmb x_i)^{y_i}(1 - \mu(\pmb x_i)^{1 - y_i}) \right] + \sum_{i = 1}^N \ln P(\pmb x_i) \\ &= \sum_{i = 1}^N \left[y_i \ln \mu(\pmb x_i) + (1 - y_i) \ln(1 - \mu(\pmb x_i))\right] + \sum_{i = 1}^N \ln P(\pmb x_i) \end{align*}

​\ln(\cdot)是严格单调的,因此其并不影响优化问题的最优解。根据MLE,有:

\begin{align*} \mathop{\arg\max}_{\pmb \theta \in \mathbb{R}^{d + 1}} \ln \mathcal L(\pmb \theta|\mathcal D) &= \mathop{\arg\max}_{\pmb \theta \in \mathbb{R}^{d + 1}} \sum_{i = 1}^N \left[y_i \ln \mu(\pmb x_i) + (1 - y_i) \ln(1 - \mu(\pmb x_i))\right] + \sum_{i = 1}^N \ln P(\pmb x_i) \\ &= \mathop{\arg\max}_{\pmb \theta \in \mathbb{R}^{d + 1}} \sum_{i = 1}^N \left[y_i \ln \mu(\pmb x_i) + (1 - y_i) \ln(1 - \mu(\pmb x_i))\right] \\ &= \mathop{\arg\min}_{\pmb \theta \in \mathbb{R}^{d + 1}} \sum_{i = 1}^N -y_i\ln \mu(\pmb x_i) - (1 - y_i) \ln(1 - \mu(\pmb x_i)) \end{align*}

根据一阶最优性条件,有:

\begin{align*} &\nabla_{\pmb \theta} \left[-\ln \mathcal L(\pmb \theta |\mathcal D)\right] \\ =& \sum_{i = 1}^N -y_i \frac{1}{\mu(\pmb x_i)}\nabla_{\pmb \theta} \mu(\pmb x_i) - (1 - y_i)\frac{1}{1 - \mu(\pmb x_i)} \nabla_{\pmb \theta}(1-\mu(\pmb x_i)) \\ =&\sum_{i = 1}^N -\frac{y_i}{\mu(\pmb x_i)}[\mu(\pmb x_i)(1 - \mu(\pmb x_i)) \nabla_{\pmb\theta}(-\pmb \theta^\top \pmb x_i)] - \frac{1 - y_i}{1 -\mu(\pmb x_i)}[-\mu(\pmb x_i)(1 - \mu(\pmb x_i))\nabla_{\pmb \theta}(-\pmb \theta^\top \pmb x_i)] \\ =& \sum_{i = 1}^N y_i(1 - \mu(\pmb x_i)) \pmb x_i - (1 - y_i)\mu(\pmb x_i)\pmb x_i \\ =& \sum_{i = 1}^N [y_i(1 - \mu(\pmb x_i)) - (1 - y_i)\mu(\pmb x_i)]\pmb x_i \\ =& \sum_{i = 1}^N [y_i -y_i\mu(\pmb x_i) - \mu(\pmb x_i) + y_i\mu(\pmb x_i)] \pmb x_i \\ =& \sum_{i = 1}^N [y_i - \mu(\pmb x_i)]\pmb x_i = \pmb 0 \end{align*}

而方程​\sum_{i = 1}^N [y_i - \mu(\pmb x_i)]\pmb x_i = \pmb 0 无法通过有限次的代数运算(加减乘除和开方等)表示为代数方程,故其为超越方程,无闭式解。因此,需要通过数值方法近似该方程的解。

梯度下降法(Gradient Descent)

梯度下降法是机器学习和优化领域中一种核心的迭代优化算法,主要用于求解光滑的无约束优化问题的数值解。接下来的两个小节将详细阐述梯度下降法的标准形式,并针对固定步长下无约束凸问题的收敛性进行严谨的数学分析。

梯度下降法的标准形式

给定求解最小化的无约束优化问题​f(\pmb x),其中​\pmb x \in \mathbb{R}^n。假设目标函数​f(\pmb x)是光滑的,则可采用一节泰勒公式在给定点​\pmb x_t处展开,具体形式如下:

f(\pmb x_t + \Delta \pmb x) \approx f(\pmb x_t) + [\nabla f(\pmb x_t)]^\top\Delta \pmb x

​\Delta \pmb x = \gamma \pmb d,其中​\gamma \in \mathbb R_{++}, \pmb d \in \mathbb R^{n}, \|\pmb d\|_2 = 1,则上式可变为:

f(\pmb x_t + \gamma \pmb d) \approx f(\pmb x_t) + \gamma [\nabla f(\pmb x_t)]^\top \pmb d

根据Cauchy-Schwarz不等式,有:

|\nabla f(\pmb x)^\top \pmb d| \le \|\nabla f(\pmb x)\|_2 \cdot \|\pmb d\|_2 = \|\nabla f(\pmb x)\|_2

进而可得下降速度最快的方向为:

\pmb d = -\frac{\nabla f(\pmb x_t)}{\|\nabla f(\pmb x_t)\|_2}

将数量值与​\gamma合并,可定义下述形式:

\pmb x_{t + 1} = \pmb x_t + \eta \pmb d, \qquad \eta > 0, \pmb d \in \mathbb{R}^n

\pmb d = -\nabla f(\pmb x_t)

该形式即为梯度下降法的标准形式​\eta称为步长或学习率,​\pmb d称为搜索方向。当​\eta取固定值时,该方法称为固定步长的梯度下降法

固定步长下梯度下降法的收敛性证明

强凸与光滑假设及问题阐述

给定最小化的无约束优化问题​f(\pmb x),其中​\pmb x \in \mathbb{R}^n。假设目标函数​f(\pmb x)为光滑的凸函数,且函数​f(\pmb x)还需要满足​\mu-强凸(​\mu-Strongly Convex),即满足下式:

f(\pmb x) \ge f(\pmb y) + \nabla f(\pmb y)(\pmb x - \pmb y) + \frac{\mu}{2}\|\pmb x - \pmb y\|_2^2

其中​\mu > 0,且​\forall \pmb x, \pmb y \in \mathbb{R}^n。与此同时,函数​f(\pmb x)要满足​L-光滑性(​L-Smoothness),即满足下式:

f(\pmb x) \le f(\pmb y) + \nabla f(\pmb y)(\pmb x- \pmb y) + \frac{L}{2}\|\pmb x - \pmb y\|_2^2

其中​L > 0,且​\forall \pmb x, \pmb y \in \mathbb{R}^n。在有了上述假设后,问题变为给定任意初始点​\pmb x_0,梯度下降:

\pmb x_{t + 1} = \pmb x_t + \eta \pmb d

其中搜索步长​\eta \in \mathbb{R}_{++},搜索方向为负梯度方向,即:

\pmb d = -\nabla f(\pmb x_t)

是否收敛于最优解​\pmb x_*

迭代收敛过程的详细推导

由于函数​f(\pmb x)​L光滑的,则有下式

\begin{align*} f(\pmb x) \le f(\pmb y) + \nabla f(\pmb y)(\pmb x -\pmb y) + \frac{L}{2}\|\pmb x- \pmb y\|_2^2 \end{align*}

成立,则有:

\begin{align*} f(\pmb y - \frac{\nabla f(\pmb y)}{L}) &\le f(\pmb y) + \nabla f(\pmb y)(\pmb y - \frac{\nabla f(\pmb y)}{L} - \pmb y) + \frac{L}{2}\|\pmb y -\frac{\nabla f(\pmb y)}{L} - \pmb y\|_2^2 \\ &\le f(\pmb y) - \frac{1}{L}\|\nabla f(\pmb y)\|_2^2 + \frac{L}{2} \cdot \frac{1}{L^2}\|\nabla f(\pmb y)\|_2^2 \\ &= f(\pmb y) - \frac{1}{2L} \|\nabla f(\pmb y)\|_2^2 \end{align*}

由于在点​\pmb x_*处取得最小值,有:

f(\pmb x_*) \le f(\pmb y - \frac{\nabla f(\pmb y)}{L}) \le f(\pmb y) - \frac{1}{2L} \|\nabla f(\pmb y)\|_2^2

成立,进而有:

\frac{1}{2L}\|\nabla f(\pmb y)\|_2^2 \le f(\pmb y) - f(\pmb x_*) \iff \|\nabla f(\pmb y)\|_2^2 \le 2L(f(\pmb y) - f(\pmb x_*))

代入​\pmb y = \pmb x_t,则有:

\|\nabla f(\pmb x_t)\|_2^2 \le 2L(f(\pmb x_t) - f(\pmb x_*))

成立。此外,由于​f(\pmb x)​\mu强凸的,有:

\begin{align*} &f(\pmb x_*) \ge f(\pmb x_t) + \nabla f(\pmb x_t)^\top (\pmb x_* - \pmb x_t) + \frac{\mu}{2}\|\pmb x_* - \pmb x_t\|_2^2 \\ \implies & \nabla f(\pmb x_t)^\top (\pmb x_* - \pmb x_t) \le f(\pmb x_*) - f(\pmb x_t) - \frac{\mu}{2}\|\pmb x_t - \pmb x_*\|_2^2 \end{align*}

计算第​t+1次迭代时,点​\pmb x_{t +1}与最优解点的距离:

\begin{align*} \|\pmb x_{t + 1} - \pmb x_*\|_2^2 &= \|\pmb x_t - \eta \nabla f(\pmb x_t) - \pmb x_*\|_2^2 \\ &= (\pmb x_t - \pmb x_* - \eta \nabla f(\pmb x_t))^\top (\pmb x_t - \pmb x_* - \eta \nabla f(\pmb x_t)) \\ &= \|\pmb x_t - \pmb x_*\|_2^2 + 2\eta\left[\nabla f(\pmb x)\right]^\top (\pmb x_* - \pmb x_t) + \eta^2 \|\nabla f(\pmb x_t)\|_2^2 \end{align*}

代入由​L光滑和​\mu强凸所得到的不等式,有:

\begin{align*} \|\pmb x_{t + 1} - \pmb x_*\|_2^2 &= \|\pmb x_t - \pmb x_*\|_2^2 + 2\eta\left[\nabla f(\pmb x)\right]^\top (\pmb x_* - \pmb x_t) + \eta^2 \|\nabla f(\pmb x_t)\|_2^2 \\ &\le \|\pmb x_t - \pmb x_*\|_2^2 + 2\eta \left[f(\pmb x_*) - f(\pmb x_t) - \frac{\mu}{2}\|\pmb x_t - \pmb x_*\|_2^2\right] + 2\eta^2L(f(\pmb x_t) - f(\pmb x_*)) \\ &= (1 - \eta \mu)\|\pmb x_t - \pmb x_*\|_2^2 + 2\eta(\eta L - 1)[f(\pmb x_t) - f(\pmb x_*)] \end{align*}

为使对于任意的初始点​\pmb x_0梯度下降法均收敛,应保证:

\lim_{t \to \infty} \|\pmb x_{t + 1} - \pmb x_*\|_2^2 = 0

即需要保证第二项为​0,且第一项的系数的绝对值小于​1。因此,可令​\eta = \frac{1}{L},有:

\|\pmb x_{t + 1} - \pmb x_*\|_2^2 \le (1 - \frac{\mu}{L})\|\pmb x_t - \pmb x_*\|_2^2

根据​\mu凸和​L光滑的性质可知​0 < \mu < L,自动保证了系数​0 < 1 - \mu/L < 1成立,根据递推公式可得梯度下降法收敛。

  • 0