Smooth Gradient Descent & Adagrad

1. Convergence of Gradient Descent for Smooth Functions

Gradient Descent는 Lipschitz 연속 함수에서 $O(1/\sqrt{T})$의 수렴률을 갖습니다. 이때 사용하는 스텝 사이즈는 $O(1/\sqrt{T})$입니다. 이번 섹션에서는 smooth convex functions를 다루며, 이러한 경우에 Gradient Descent는 더 빠른 수렴률을 얻을 수 있습니다.

함수 $f : \mathbb{R}^d \to \mathbb{R}$가 $\ell_2$ 노름에 대해 $\beta$-smooth 하다고 할 때, 다음을 만족해야 합니다:

$$
|\nabla f(x) - \nabla f(y)|_2 \leq \beta |x - y|_2, \quad \forall x, y \in \mathbb{R}^d
$$

smooth 함수들은 self-tuning 특성을 가집니다. 이는 최적해에 가까워질수록 기울기가 0에 근접함을 의미합니다.

(Gradient Step의 개선)

함수 $f$가 $\ell_2$ 노름에 대해 $\beta$-smooth 하다면, 다음이 성립합니다.

$$
|f(y) - f(x) - \nabla f(x)^T (y - x)| \leq \frac{\beta}{2} |y - x|_2^2
$$

Proof

미적분학 기본정리와 Cauchy-Schwarz 부등식을 사용하여 다음을 얻습니다.

$$
|f(y) - f(x) - \nabla f(x)^T (y - x)| = \left| \int_0^1 (y - x)^T (\nabla f(x + t(y - x)) - \nabla f(x)) dt \right|
$$

이는 $\beta$-smoothness를 적용하여 다음과 같이 변형할 수 있습니다.

$$
\leq \int_0^1 |y - x|_2 |\nabla f(x + t(y - x)) - \nabla f(x)|_2 dt \leq \int_0^1 \beta t |y - x|_2^2 dt = \frac{\beta}{2} |y - x|_2^2
$$

(Gradient Descent의 수렴성)

함수 $f : \mathbb{R}^d \to \mathbb{R}$가 $\ell_2$ 노름에 대해 $\beta$-smooth 하고 convex일 때, step size $\eta_t = 1/\beta$로 Gradient Descent를 적용하면 다음이 성립합니다.

$$
f(x_{T+1}) - f(x^*) \leq \frac{\beta \|x_1 - x^*\|_2^2}{2T}

$$

여기서 반복 횟수 $T$에 대해 Gradient Descent의 수렴 속도는 $O(1/T)$입니다.

Proof

이 수렴성 증명은 smooth convex 함수에서 step size를 고정한 상태로 적용된 Gradient Descent의 개선을 나타냅니다.

2. Projected Subgradient Method

convex 함수 $f$가 미분 가능하지 않을 때, subgradient를 이용하여 최적화를 수행할 수 있습니다.

(Subdifferential)

convex 함수 $f : \mathbb{R}^d \to \mathbb{R}$와 점 $x \in \text{dom}(f)$에 대해, $f$의 subdifferential은 다음으로 정의됩니다.

$$
\partial f(x) = { g : f(y) \geq f(x) + g^T (y - x), \forall y \in \text{dom}(f) }
$$

여기서 $g \in \partial f(x)$인 경우 $g$를 $f$의 subgradient라고 합니다.

Algorithm 1: Projected Subgradient Method

초기값 $x_1 \in X$을 설정합니다.
매 반복 $t$에 대해, subgradient $g_t \in \partial f(x_t)$를 구하고, step size $\eta_t > 0$를 사용하여 다음을 계산합니다.
$$
x_{t+1} = \text{Proj}_X (x_t - \eta_t g_t)
$$

여기서 $\text{Proj}_X(\cdot)$는 projection 연산자로 다음과 같이 정의됩니다.

$$
\text{Proj}_X (x) = \arg \min_{y \in X} \|x - y\|_2

$$

(Projection Operator의 성질)

두 점 $u, v \in \mathbb{R}^d$에 대해 다음이 성립합니다.

$$
|\text{Proj}_X(u) - \text{Proj}_X(v)|_2 \leq |u - v|_2
$$

Proof

이 증명은 projection operator의 성질에 의해 주어집니다.

(Subgradient Method의 수렴성)

함수 $f : \mathbb{R}^d \to \mathbb{R}$가 convex하고, 모든 $g \in \partial f(x)$에 대해 $|g|_2 \leq L$을 만족할 때, step size $\eta_t = \frac{C}{\sqrt{T}}$로 Subgradient Method를 적용하면 다음이 성립합니다.

$$
f\left( \frac{1}{T} \sum_{t=1}^{T} x_t \right) - f(x^*) \leq \frac{C'}{\sqrt{T}}
$$

3. Adaptive Gradient Method (AdaGrad)

AdaGrad는 learning rate를 자동으로 조정하는 알고리즘입니다. 이를 통해 smooth 함수와 Lipschitz 연속 함수에 대해 학습률을 사전 지식 없이 조정할 수 있습니다.

AdaGrad

초기값 $x_1 \in X$과 $S_0 = 0$을 설정합니다.
매 반복 $t$에 대해, subgradient $g_t \in \partial f(x_t)$를 구하고, 다음을 업데이트합니다.
$$
S_t = S_{t-1} + |g_t|_2^2, \quad \eta_t = \frac{R}{\sqrt{2 S_t}}
$$
$x_{t+1} = \text{Proj}_X (x_t - \eta_t g_t)$를 업데이트합니다.

(AdaGrad의 수렴성)

함수 $f : \mathbb{R}^d \to \mathbb{R}$가 convex하고, 모든 $g \in \partial f(x)$에 대해 $|g|_2 \leq L$을 만족할 때, AdaGrad 알고리즘을 적용하면 다음이 성립합니다.

$$
f\left( \frac{1}{T} \sum_{t=1}^{T} x_t \right) - f(x^*) \leq \frac{LR \sqrt{2}}{\sqrt{T}}
$$

(Smooth Functions에서 AdaGrad의 수렴성)

함수 $f : \mathbb{R}^d \to \mathbb{R}$가 $\beta$-smooth할 때, AdaGrad 알고리즘을 적용하면 다음이 성립합니다.

$$
f\left( \frac{1}{T} \sum_{t=1}^{T} x_t \right) - f(x^*) \leq \frac{\beta R^2}{T}
$$

'Optimization' 카테고리의 다른 글

1. Convex Optimization (0)	2024.09.30

1. Convergence of Gradient Descent for Smooth Functions

(Gradient Step의 개선)

Proof

(Gradient Descent의 수렴성)

Proof

2. Projected Subgradient Method

(Subdifferential)

Algorithm 1: Projected Subgradient Method

(Projection Operator의 성질)

Proof

(Subgradient Method의 수렴성)

3. Adaptive Gradient Method (AdaGrad)

AdaGrad

(AdaGrad의 수렴성)

(Smooth Functions에서 AdaGrad의 수렴성)

'Optimization' 카테고리의 다른 글

티스토리툴바