Newton's method and BFGS

Newton’s method uses second-order information (the Hessian matrix) to take curved steps toward the optimum, achieving quadratic convergence: the number of correct digits roughly doubles at each iteration. BFGS approximates the Hessian from gradient information alone, getting near-Newton convergence without the \(O(n^3)\) cost of inverting a dense matrix.

Newton’s method

Gradient descent uses only the gradient: it moves in the direction of steepest descent with a fixed step size. This ignores the curvature of \(f\): in directions where the function curves steeply, the step is too large and the algorithm oscillates; in flat directions, it is too small.

Newton’s method fits a local quadratic approximation to \(f\) at \(x_k\) using the second-order Taylor expansion:

\[f(x) \approx f(x_k) + \nabla f(x_k)^T(x - x_k) + \frac{1}{2}(x-x_k)^T \nabla^2 f(x_k)(x-x_k)\]

Minimizing this quadratic in \(x\) gives the Newton step:

\[x_{k+1} = x_k - \left[\nabla^2 f(x_k)\right]^{-1} \nabla f(x_k)\]

\[p_k = -H_k^{-1} g_k \qquad \text{where } H_k = \nabla^2 f(x_k),\; g_k = \nabla f(x_k)\]

The Hessian \(H_k\) scales the gradient by the local curvature: in steep directions (large eigenvalues of \(H_k\)) the step is small; in flat directions (small eigenvalues) the step is large. This automatically adapts the step size to the geometry of \(f\).

Quadratic convergence

Near a non-degenerate local minimum \(x^*\) where \(H^* = \nabla^2 f(x^*)\) is positive definite, Newton’s method converges quadratically:

\[\|x_{k+1} - x^*\| \leq C \|x_k - x^*\|^2\]

for some constant \(C > 0\). If the error at step \(k\) is \(\varepsilon\), the error at step \(k+1\) is roughly \(C\varepsilon^2\). This means the number of correct decimal digits doubles at each iteration: if you have 2 correct digits, the next step gives 4, then 8, then 16.

Compare with gradient descent’s linear convergence \(\|x_{k+1} - x^*\| \leq \rho \|x_k - x^*\|\) for some \(\rho < 1\): errors decrease by a constant factor, not squaring.

Convergence curves on log scale comparing gradient descent and Newton method showing quadratic vs linear convergence

Newton’s method (red) reaches machine precision in 6 iterations. Gradient descent (blue) still has noticeable error after 40 iterations.

Limitations of Newton’s method

Despite quadratic convergence, Newton’s method has three serious practical limitations:

Cost per iteration: computing \(H_k\) requires \(O(n^2)\) entries; solving \(H_k p_k = -g_k\) requires \(O(n^3)\) operations (LU or Cholesky factorization). For \(n = 10^4\) parameters (a small neural network), this is \(10^{12}\) operations per step: completely impractical.

Hessian indefiniteness: far from the optimum, \(H_k\) may not be positive definite. The Newton step \(p_k = -H_k^{-1}g_k\) may then point uphill. Modified Newton methods add a regularization term: \(p_k = -(H_k + \lambda_k I)^{-1}g_k\) with \(\lambda_k\) chosen to ensure positive definiteness (Levenberg-Marquardt strategy).

Convergence is only local: quadratic convergence is guaranteed only near \(x^*\). Starting far away, Newton’s method can diverge or cycle. Always combine with a line search or trust region to ensure global convergence.

⚠️ Newton steps can diverge without safeguards

A pure Newton step without a line search can overshoot the minimum and diverge, especially when the Hessian is nearly singular or indefinite. In practice, Newton’s method is always implemented with either a backtracking line search (reduce step size until sufficient decrease is achieved) or a trust region (restrict the step norm to a ball where the quadratic approximation is reliable).

Quasi-Newton methods

Quasi-Newton methods approximate the Hessian (or its inverse) from gradient differences, avoiding explicit second-derivative computation. They achieve superlinear convergence: slower than Newton’s quadratic, but much faster than gradient descent’s linear convergence.

The key idea: use the change in gradient between successive iterates to update the Hessian approximation. If \(s_k = x_{k+1} - x_k\) and \(y_k = g_{k+1} - g_k\), the updated Hessian approximation \(B_{k+1}\) must satisfy the secant condition:

\[B_{k+1} s_k = y_k\]

This ensures the approximation is consistent with the observed curvature in the direction of the last step.

BFGS update

The Broyden-Fletcher-Goldfarb-Shanno (BFGS) update is the most successful quasi-Newton method. It updates the inverse Hessian approximation \(H_k^{-1} \approx B_k^{-1}\) directly:

\[B_{k+1}^{-1} = \left(I - \rho_k s_k y_k^T\right) B_k^{-1} \left(I - \rho_k y_k s_k^T\right) + \rho_k s_k s_k^T\]

\[\rho_k = \frac{1}{y_k^T s_k}\]

The BFGS update is a rank-2 update of the inverse Hessian: it adds information from the current step and gradient change. Key properties:

  • Preserves symmetry: \(B_{k+1}^{-1}\) is symmetric if \(B_k^{-1}\) is.
  • Preserves positive definiteness (when \(y_k^T s_k > 0\), guaranteed by a Wolfe line search).
  • Requires only \(O(n^2)\) storage and \(O(n^2)\) operations per update.
  • Converges superlinearly: \(\|x_{k+1} - x^*\| = o(\|x_k - x^*\|)\), meaning the convergence rate itself improves at each step.

Convergence comparison of gradient descent, BFGS and Newton on an ill-conditioned quadratic showing superlinear convergence of BFGS

BFGS (blue) sits between gradient descent (green, slow linear) and Newton (red, quadratic): it converges much faster than gradient descent without the \(O(n^3)\) cost of Newton.

L-BFGS: large-scale quasi-Newton

BFGS stores the full \(n \times n\) inverse Hessian approximation, requiring \(O(n^2)\) memory. For \(n = 10^6\) parameters (a typical deep learning model), this is completely infeasible.

Limited-memory BFGS (L-BFGS) stores only the last \(m\) curvature pairs \(\{(s_k, y_k)\}_{k=t-m}^{t-1}\) (typically \(m = 5\) to 20) and implicitly represents the Hessian approximation through these pairs. The matrix-vector product \(B_k^{-1} g_k\) is computed via a two-loop recursion in \(O(mn)\) time and \(O(mn)\) memory.

L-BFGS is the standard algorithm for large-scale smooth unconstrained optimization. It is used in logistic regression, Gaussian process fitting, and many machine learning applications where the Hessian is too large to store.

Method Convergence Cost/iteration Memory Best for
Gradient descent Linear \(O(n)\) \(O(n)\) Very large \(n\), non-smooth
BFGS Superlinear \(O(n^2)\) \(O(n^2)\) Medium \(n\) (\(\leq 10^4\))
L-BFGS Superlinear \(O(mn)\) \(O(mn)\) Large \(n\) (\(> 10^4\))
Newton Quadratic \(O(n^3)\) \(O(n^2)\) Small \(n\) (\(\leq 10^3\)), exact Hessian available

Line search and the Wolfe conditions

All quasi-Newton methods require a line search at each step to choose the step size \(\alpha_k\). The Wolfe conditions are standard:

Sufficient decrease (Armijo): \(f(x_k + \alpha_k p_k) \leq f(x_k) + c_1 \alpha_k g_k^T p_k\)

Curvature condition: \(g(x_k + \alpha_k p_k)^T p_k \geq c_2 g_k^T p_k\)

with \(0 < c_1 < c_2 < 1\) (typically \(c_1 = 10^{-4}\), \(c_2 = 0.9\)). The curvature condition ensures \(y_k^T s_k > 0\), which is required for BFGS to maintain positive definiteness of the Hessian approximation.

💡 Newton and BFGS in R

# optim() implements BFGS and L-BFGS-B
f      <- function(x) x[1]^4/4 - x[1]^2 + x[2]^2
grad_f <- function(x) c(x[1]^3 - 2*x[1], 2*x[2])

# BFGS
optim(c(2, 1), f, grad_f, method="BFGS")

# L-BFGS-B (with box constraints)
optim(c(2, 1), f, grad_f, method="L-BFGS-B",
      lower=c(-Inf,-Inf), upper=c(Inf,Inf))

# Newton's method via nlm()
nlm(f, c(2, 1), hessian=TRUE)

# For constrained problems: nloptr with BFGS
library(nloptr)
nloptr(x0=c(2,1), eval_f=f, eval_grad_f=grad_f,
       opts=list(algorithm="NLOPT_LD_LBFGS"))

In practice, always provide the gradient analytically when available: numerical differentiation is slower and less accurate.