Gradient descent

Gradient descent minimizes a differentiable function \(f(x)\) by repeatedly moving in the direction opposite to the gradient, which is the direction of steepest ascent. It is the foundational algorithm behind training neural networks, logistic regression, and virtually every large-scale machine learning model.

Intuition

Imagine standing on a hilly landscape in thick fog. You cannot see the valley, but you can feel the slope under your feet. The steepest descent strategy: at each step, move in the direction where the ground drops most steeply. Repeat until you reach a flat point.

The gradient \(\nabla f(x)\) points in the direction of steepest ascent. Moving opposite to it, \(-\nabla f(x)\), gives the direction of steepest descent. The step size (how far to move) is controlled by the learning rate \(\alpha > 0\).

The update rule

\[x_{k+1} = x_k - \alpha \nabla f(x_k)\]

Starting from an initial point \(x_0\), this is applied iteratively until a stopping criterion is met: \(\|\nabla f(x_k)\| < \varepsilon\), maximum iterations reached, or the improvement in \(f\) falls below a threshold.

For a function of one variable: \(x_{k+1} = x_k - \alpha f'(x_k)\). The update subtracts the slope scaled by \(\alpha\).

Contour plot of a 2D function with gradient descent trajectories for three different learning rates

Low learning rate (green) converges safely but slowly. Good learning rate (blue) converges in few iterations. High learning rate (red) oscillates across the valley: the elliptical shape means the \(y\)-direction has much steeper curvature, and large steps overshoot.

The learning rate

The learning rate \(\alpha\) is the most important hyperparameter in gradient descent. For a function with Lipschitz-continuous gradient (gradient does not change too fast), the optimal fixed learning rate satisfies:

\[\alpha \leq \frac{1}{L}\]

where \(L\) is the Lipschitz constant of the gradient: \(\|\nabla f(x) - \nabla f(y)\| \leq L\|x-y\|\) for all \(x, y\). For the quadratic \(f(x) = \frac{1}{2}x^T H x\), \(L\) equals the largest eigenvalue of \(H\).

Loss vs iteration curves for four learning rates showing convergence, slow convergence, oscillation and divergence

For \(f(x)=x^2\), the Lipschitz constant is \(L=2\) and the safe threshold is \(\alpha \leq 0.5\). At \(\alpha=0.49\) (orange) convergence is very slow near the limit; at \(\alpha=0.6\) (red) the iterates diverge.

Convergence theory

For a convex function with \(L\)-Lipschitz gradient, gradient descent with constant step \(\alpha = 1/L\) achieves:

\[f(x_k) - f(x^*) \leq \frac{L \|x_0 - x^*\|^2}{2k}\]

This is sublinear convergence: \(O(1/k)\). To halve the error you need four times as many iterations.

For strongly convex functions (Hessian eigenvalues bounded below by \(\mu > 0\)), convergence is linear (geometric):

\[f(x_k) - f(x^*) \leq \left(1 - \frac{\mu}{L}\right)^k (f(x_0) - f(x^*))\]

The ratio \(\kappa = L/\mu\) is the condition number. Ill-conditioned problems (\(\kappa \gg 1\), elongated contours like the ellipse above) converge very slowly. This is why preconditioning and adaptive methods (Adam, RMSProp) matter.

Batch, stochastic, and mini-batch

In machine learning, \(f(\theta) = \frac{1}{n}\sum_{i=1}^n \ell(\theta; x_i, y_i)\) is a sum over \(n\) training examples. Computing the full gradient requires evaluating all \(n\) terms, which is expensive for large datasets.

Batch gradient descent

Uses the full dataset to compute the gradient at each step. Exact gradient, stable convergence, but slow per iteration for large \(n\).

Stochastic gradient descent (SGD)

Uses a single randomly chosen example \(i\) at each step:

\[x_{k+1} = x_k - \alpha \nabla \ell(\theta; x_{i_k}, y_{i_k})\]

Much faster per iteration. The gradient is a noisy estimate: \(E[\nabla \ell_i] = \nabla f\). This noise prevents exact convergence but helps escape shallow local minima. Requires decreasing learning rate (\(\alpha_k \to 0\)) for theoretical convergence guarantees.

Mini-batch gradient descent

Uses a small batch of \(B\) examples (typically \(B = 32\) to \(512\)):

\[x_{k+1} = x_k - \frac{\alpha}{B} \sum_{i \in \mathcal{B}_k} \nabla \ell(\theta; x_i, y_i)\]

The standard in deep learning. Balances the variance reduction of batching with the speed of SGD. GPU parallelism makes batch computation nearly free for reasonable \(B\).

Loss curves comparing batch gradient descent, mini-batch and SGD showing the noise-convergence tradeoff

Batch GD (green) descends smoothly but slowly. SGD (red) descends faster initially but oscillates. Mini-batch (blue) balances both: fast descent with moderate noise.

Instead of a fixed \(\alpha\), line search finds the best step size at each iteration:

\[\alpha_k = \arg\min_{\alpha > 0} f(x_k - \alpha \nabla f(x_k))\]

Exact line search solves this minimization exactly (feasible for simple functions). Backtracking line search is practical: start with a large \(\alpha\) and halve it until the Armijo condition is satisfied:

\[f(x_k - \alpha \nabla f(x_k)) \leq f(x_k) - c \alpha \|\nabla f(x_k)\|^2\]

for some small constant \(c \in (0,1)\) (typically \(c = 10^{-4}\)). This ensures sufficient decrease without requiring exact minimization.

⚠️ Gradient descent fails for non-smooth and non-convex functions

Three situations where vanilla gradient descent is unreliable:

Non-smooth functions: if \(f\) is not differentiable at \(x_k\) (e.g., L1 regularization, ReLU), the gradient does not exist. Use subgradient methods or proximal algorithms instead.

Non-convex functions: gradient descent converges to a stationary point, not necessarily a global minimum. In deep learning this is accepted in practice (most local minima are empirically good), but there is no theoretical guarantee.

Saddle points: in high dimensions, stationary points are usually saddle points (gradient zero but not a minimum). Vanilla gradient descent can get stuck near saddle points for exponentially long. SGD noise and momentum methods (Adam) help escape them faster.

Extensions: momentum and adaptive methods

Vanilla gradient descent is rarely used in deep learning. Two important extensions:

Momentum: accumulates a velocity vector \(v\) in the direction of consistent gradients, dampening oscillations and accelerating convergence in low-curvature directions:

\[v_{k+1} = \beta v_k + \nabla f(x_k), \qquad x_{k+1} = x_k - \alpha v_{k+1}\]

Adam (Adaptive Moment Estimation): maintains per-parameter learning rates based on first and second moment estimates of the gradient. Effectively adapts \(\alpha\) individually for each dimension. The standard optimizer for deep learning.

Nesterov accelerated gradient (NAG): evaluates the gradient at a “lookahead” point \(x_k - \beta v_k\) rather than at \(x_k\), giving better theoretical convergence for convex functions (\(O(1/k^2)\) instead of \(O(1/k)\)).

💡 Gradient descent in R

# Manual gradient descent for any differentiable f
gradient_descent <- function(f, grad_f, x0, alpha=0.01,
                              max_iter=1000, tol=1e-6) {
  x    <- x0
  path <- matrix(NA, max_iter+1, length(x0))
  path[1,] <- x
  for (k in 1:max_iter) {
    g <- grad_f(x)
    x <- x - alpha * g
    path[k+1,] <- x
    if (sqrt(sum(g^2)) < tol) break
  }
  list(x=x, value=f(x), path=path[1:(k+1),])
}

# Example: minimize f(x,y) = x^2 + 4y^2
f     <- function(x) x[1]^2 + 4*x[2]^2
grad_f <- function(x) c(2*x[1], 8*x[2])
result <- gradient_descent(f, grad_f, x0=c(3,1.5), alpha=0.1)

# For machine learning with automatic differentiation: use torch or tensorflow
# For convex optimization: use CVXR which calls interior point solvers