Newton's Method

While first-order methods like Steepest Descent and Conjugate Gradient use only the gradient of the function \( f(\mathbf{x}) \) to calculate the direction of local minima. At a given function state, Newton's Method uses both first derivative and second derivative of the function to move in the direction on local minima. Using second derivative allows Newton's method to reach the local minima faster than first-order methods (Steepest descent and Conjugate gradient method). However, it is really improtant to choose the initial guess \( \mathbf{x}_0 \) close to the local minima, otherwise the method may diverge.

1. Method

Newton's method is an iterative optimization algorithm that uses the first and second derivatives of a function to find its local minima or maxima. In each iteration, the algorithm updates the current point \( \mathbf{x}_k \) using the formula: \[ \mathbf{x}_{k+1} = \mathbf{x}_k + d_k \quad \quad \text{where } \quad d_k = - H^{-1}(\mathbf{x}_k) \nabla f(\mathbf{x}_k) \] where \( d_k \) is the descent direction, \( H(\mathbf{x}_k) \) is the second derivative, and \( \nabla f(\mathbf{x}_k) \) is the gradient of the function at \( \mathbf{x}_k \). Second derivative, \( H(\mathbf{x}_k) \), is also known as the Hessian. For a function with one variable, hessian is a scalar value, while for function with multiple variables, it is a sqaure matrix. The Hessian matrix provides information about the curvature of the function, allowing for more accurate step directions than first-order methods.

2. Algorithm

The algorithm for Newton's method is as follows:

Initialize at \( \mathbf{x}_0 \in \mathbb{R}^n \), \( k = 1 \)
while \( \|\nabla f(\mathbf{x}_k)\| > \epsilon \) do

Compute the Hessian matrix \( H(\mathbf{x}_k) \)
Compute the gradient \( \nabla f(\mathbf{x}_k) \)
Descent direction \( d_k = - H^{-1}(\mathbf{x}_k) \nabla f(\mathbf{x}_k) \)
Update the point:

\( \mathbf{x}_{k+1} = \mathbf{x}_k + d_k \)

Increment \( k \)

3. Example

Consider the function \( f(x, y) = x^2 + y^2 \) with global minimum at \( (0, 0) \). The gradient and Hessian are given by: \[ \nabla f(x, y) = \begin{pmatrix} 2x \\ 2y \end{pmatrix}, \quad H(x, y) = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix} \] Starting from an initial guess \( (x_0, y_0) = (1, 1) \), the algorithm iteratively updates the point until convergence. The first iteration would yield: \[ \mathbf{x}_1 = \begin{pmatrix} 1 \\ 1 \end{pmatrix} - \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix}^{-1} \begin{pmatrix} 2 \\ 2 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix} \] The algorithm converges to the minimum \( (0, 0) \) in one step.

4. Discussions

Newton's method has several advantages:

Quadratic convergence near the minimum
More accurate step directions than first-order methods
Effective when curvature information is available and manageable

However, it also has some disadvantages:

Computing the Hessian matrix can be expensive for high-dimensional problems
Requires the Hessian to be positive definite for minimization
May diverge if the initial guess is far from the minimum