## Visualize the variations and similarities between **gradient descent**, **gradient descent with momentum**, **RMSprop**, and **Adam**

In case you are like me, equations don’t converse for themselves. To know them,** I must see what they do with a concrete instance. **On this weblog put up, I apply this visualization precept to common optimization algorithms utilized in machine studying.

These days, the Adam algorithm is a extremely popular alternative. The Adam algorithm provides **momentum **and **self-tuning of the training price** to the plain-vanilla gradient descent algorithm. However what are momentum and self-tuning precisely?

Beneath is a visible preview of what these ideas seek advice from:

To maintain issues easy, I take advantage of completely different optimization algorithms on the bivariate linear regression mannequin:

*y = a + bx*

The variable *y *represents a amount we attempt to predict/clarify utilizing one other variable *x*. The unknown parameters are the** intercept a** and

**the slope**

*b.*To suit the mannequin to the information, we decrease the imply sq. of the distinction between the mannequin and the information, which might be compactly expressed as follows:

*Loss(a,b)=1/m||y-a-bx||²*

(assuming now we have *m* observations and utilizing the Euclidean norm)

By altering the worth of *a* and *b*, we are able to hopefully enhance the match of the mannequin to the information. With the bivariate regression mannequin, an excellent factor is that we are able to plot the worth of the loss perform as a perform of the unknown parameters *a* and *b*. Beneath is a floor plot of the loss perform, with the black dot representing the minimal of the loss.

We are able to additionally visualize the loss perform utilizing a contour plot, the place the traces are degree units (factors such that *Loss(a,b) = *fixed). Beneath, the white level represents the minimal of the loss perform.

The plain-vanilla gradient descent algorithm consists in taking a step of measurement **η **within the **course of the steepest descent**, which is given by the other worth of the gradient. Mathematically, the replace rule appears like:

Within the subsequent plot, I present one trajectory implied by the gradient descent algorithm. Factors signify values of *a* and *b* throughout iterations, whereas arrows are gradients of the loss perform, telling us the place to maneuver within the subsequent iteration.

A key function is that the gradient descent algorithm would possibly create **some oscillations between degree units. **In an ideal world, we want** **as an alternative to maneuver easily within the course of the minimal. As we are going to see, including momentum is one approach to clean the trajectory towards the minimal worth.

**Momentum refers back to the tendency of transferring objects to proceed transferring in the identical course**. In apply, we are able to add momentum to gradient descent by bearing in mind earlier values of the gradient. This may be carried out as follows:

The upper the worth for **γ,** the extra previous values of the gradient are considered within the present replace.

Within the subsequent plot, I present the trajectories implied by the gradient descent algorithm **with (**in blue**) **and** with out momentum **(in white).

Momentum reduces the fluctuations alongside the worth of the slope coefficient. **The large swings up and down are inclined to cancel out as soon as the averaging results of momentum begin to kick in.** Because of this, with momentum we transfer sooner within the course of the true worth.

Momentum is a pleasant twist to gradient descent. One other line of enchancment consists in **introducing a studying price that’s tailor-made to every parameter** (in our instance: one studying price for the slope, one studying price for the intercept).

However how to decide on such a coefficient-specific studying price? Observe that the earlier plots present that the gradient doesn’t essentially level towards the minimal. No less than not throughout the first iterations.

Intuitively, we wish to give **much less weight **to the strikes within the up/down course, and **extra weight **to the strikes within the left/proper course. The RMSprop updating rule embeds this desired property:

The primary line simply defines **g **to the be the gradient of the loss perform. The second line says that we calculate a operating common of the sq. of the gradient. In third line, we take a step within the course given by the gradient, however rescaled by the sq. root of the operating common of previous gradients.

In our instance, as a result of the sq. of the gradient tends to be massive for the slope coefficient, so we take small steps in that course. The other is true for the intercept coefficient (small values, massive strikes).

The Adam optimization algorithm has **momentum**, in addition to **the adaptive studying price** of RMSprop. Beneath is *nearly* what Adam does:

The updating rule is similar to considered one of RMSprop. **The important thing distinction is momentum:** the course of change is given by a operating common of the previous gradient.

The *precise* Adam updating rule makes use of “bias-corrected” worth for *m* and *v*. In step one, Adam initialize *m* and *v* to be zero. To right for the initialization bias, the authors counsel to make use of reweighed variations of *m* and *v:*

Beneath, we see that the trajectory induced by Adam is considerably just like the one given by RMSprop, however with a slower begin.

The following plot exhibits the trajectories induced by the 4 optimization algorithms described above.

Key outcomes are as follows:

- Gradient descent with momentum has much less fluctuations than gradient descent with out momentum.
- Adam and RMSprop take a special route, transferring slower within the slope dimension and sooner within the intercept dimension.
- As anticipated, Adam shows some momentum: whereas RMSprop begins turning left in the direction of the minimal, Adam has a tougher time to show due to the collected momentum.

Beneath is similar graph, however in 3d:

On this weblog put up, my goal was for the reader to construct an intuitive understanding of key optimization algorithms utilized in machine studying.

**Beneath you’ll find the code that was used to provide the graphs used on this put up.** Don’t hesitate to change the training price and/or the loss perform to see how this impacts the completely different trajectories.

—

The next block of code hundreds dependencies, defines the loss perform and does plots the loss perform (floor and contour plots):

`# A. Dependencies `

%matplotlib inline

import matplotlib.pyplot as plt

from matplotlib import cm

from matplotlib.ticker import LinearLocatorplot_scale = 1.25

plt.rcParams["figure.figsize"] = (plot_scale*16, plot_scale*9)

import numpy as np

import pandas as pd

import random

import scipy.stats

from itertools import product

import os

import time

from math import sqrt

import seaborn as sns; sns.set()

from tqdm import tqdm as tqdm

import datetime

from typing import Tuple

class Vector: cross

from scipy.stats import norm

import torch

from torch import nn

from torch.utils.knowledge import DataLoader

import copy

import matplotlib.ticker as mtick

from torchcontrib.optim import SWA

from numpy import linalg as LA

import imageio as io #create gif

# B. Create OLS drawback

b0 = -2.0 #intercept

b1 = 2.0 #slope

beta_true = (b0 , b1)

nb_vals = 1000 #quantity attracts

mu, sigma = 0, 0.001 # imply and normal deviation

shocks = np.random.regular(mu, sigma, nb_vals)

# covariate

x0 = np.ones(nb_vals) #cst

x1 = np.random.uniform(-5, 5, nb_vals)

X = np.column_stack((x0, x1))

# Information

y = b0*x0 + b1*x1 + shocks

A = np.linalg.inv(np.matmul(np.transpose(X), X))

B = np.matmul(np.transpose(X), y)

np.matmul(A, B)

X_torch = torch.from_numpy(X).float()

y_torch = torch.from_numpy(y).float()

# Loss perform and gradient (for plotting)

def loss_function_OLS(beta_hat, X, y):

loss = (1/len(y))*np.sum(np.sq.(y - np.matmul(X, beta_hat)))

return loss

def grad_OLS(beta_hat, X, y):

mse = loss_function_OLS(beta_hat, X, y)

G = (2/len(y))*np.matmul(np.transpose(X), np.matmul(X, beta_hat) - y)

return G, mse

# C. Plots for the loss perform

min_val=-10.0

max_val=10.0

delta_grid=0.05

x_grid = np.arange(min_val, max_val, delta_grid)

y_grid = np.arange(min_val, max_val, delta_grid)

X_grid, Y_grid = np.meshgrid(x_grid, y_grid)

Z = np.zeros((len(x_grid), len(y_grid)))

for (y_index, y_value) in enumerate(y_grid):

for (x_index, x_value) in enumerate(x_grid):

beta_local = np.array((x_value, y_value))

Z[y_index, x_index] = loss_function_OLS(beta_local, X, y)

fig, ax = plt.subplots(subplot_kw={"projection": "3d"})

# Plot the floor.

surf = ax.plot_surface(X_grid, Y_grid, Z, cmap=cm.coolwarm, linewidth=0, antialiased=False, alpha=0.2)

ax.zaxis.set_major_locator(LinearLocator(10))

ax.zaxis.set_major_formatter('{x:.02f}')

ax.scatter([b0], [b1], [true_value], s=100, c='black', linewidth=0.5)

x_min = -10

x_max = -x_min

y_min = x_min

y_max = -x_min

plt.xlim(x_min, x_max)

plt.ylim(y_min, y_max)

plt.ylabel('Slope')

plt.xlabel('Intercept')

fig.colorbar(surf, shrink=0.5, facet=5)

filename = "IMGS/surface_loss.png"

plt.savefig(filename)

plt.present()

# Plot contour

cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)

plt.clabel(cp, inline=1, fontsize=10)

cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))

plt.scatter([b0], [b1], s=100, c='white', linewidth=0.5)

plt.ylabel('Slope')

plt.xlabel('Intercept')

plt.xlim(x_min, x_max)

plt.ylim(y_min, y_max)

filename = "IMGS/countour_loss.png"

plt.savefig(filename)

plt.present()

The following block of code defines features in order that we are able to resolve the OLS drawback utilizing Pytorch. Right here, utilizing Pytroch is an overkill, however the benefit is that we are able to use the pre-coded minimization algorithms (*torch.optim*):

`def loss_OLS(mannequin, y, X): `

"""

Loss perform for OLS

"""

R_squared = torch.sq.(y.unsqueeze(1) - mannequin(X[:,1].unsqueeze(1)))

return torch.imply(R_squared)def set_initial_values(mannequin, w, b):

"""

Operate to set the burden and bias to sure values

"""

with torch.no_grad():

for title, param in mannequin.named_parameters():

if 'linear_relu_stack.0.weight' in title:

param.copy_(torch.tensor([w]))

elif 'linear_relu_stack.0.bias' in title:

param.copy_(torch.tensor([b]))

def create_optimizer(mannequin, optimizer_name, lr, momentum):

"""

Operate to outline an optimizer

"""

if optimizer_name == "Adam":

optimizer = torch.optim.Adam(mannequin.parameters(), lr)

elif optimizer_name == "SGD":

optimizer = torch.optim.SGD(mannequin.parameters(), lr)

elif optimizer_name == "SGD-momentum":

optimizer = torch.optim.SGD(mannequin.parameters(), lr, momentum)

elif optimizer_name == "Adadelta":

optimizer = torch.optim.Adadelta(mannequin.parameters(), lr)

elif optimizer_name == "RMSprop":

optimizer = torch.optim.RMSprop(mannequin.parameters(), lr)

else:

increase("optimizer unknown")

return optimizer

def train_model(optimizer_name, initial_guess, true_value, lr, momentum):

"""

Operate to coach a mannequin

"""

# initialize a mannequin

mannequin = NeuralNetwork().to(system)

#print(mannequin)

set_initial_values(mannequin, initial_guess[0], initial_guess[1])

for title, param in mannequin.named_parameters():

print(title, param)

mannequin.prepare()

nb_epochs = 100

use_scheduler = False

freq_scheduler = 100

freq_gamma = 0.95

true_b = torch.tensor([true_value[0], true_value[1]])

print(optimizer_name)

optimizer = create_optimizer(mannequin, optimizer_name, lr, momentum)

# A LOOP OVER EACH POINT OF THE CURRENT GRID

# retailer imply loss by epoch

scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=freq_gamma)

loss_epochs = torch.zeros(nb_epochs)

list_perc_abs_error = [] #retailer abs worth proportion error

list_perc_abs_error_i = [] #retailer index i

list_perc_abs_error_loss = [] #retailer loss

list_norm_gradient = [] #retailer norm of gradient

list_gradient = [] #retailer the gradient itself

list_beta = [] #retailer parameters

calculate_variance_grad = False

freq_loss = 1

freq_display = 10

for i in tqdm(vary(0, nb_epochs)):

optimizer.zero_grad()

# Calculate the loss

loss = loss_OLS(mannequin, y_torch, X_torch)

loss_epochs[[i]] = float(loss.merchandise())

# Retailer the loss

with torch.no_grad():

# Extract weight and bias

b_current = np.array([k.item() for k in model.parameters()])

b_current_ordered = np.array((b_current[1], b_current[0])) #reorder (bias, weight)

list_beta.append(b_current_ordered)

perc_abs_error = np.sum(np.sq.(b_current_ordered - true_b.detach().numpy()))

list_perc_abs_error.append(np.median(perc_abs_error))

list_perc_abs_error_i.append(i)

list_perc_abs_error_loss.append(float(loss.merchandise()))

# Calculate the gradient

loss.backward()

# Retailer the gradient

with torch.no_grad():

grad = np.zeros(2)

for (index_p, p) in enumerate(mannequin.parameters()):

grad[index_p] = p.grad.detach().knowledge

#reorder (bias, weight)

grad_ordered = np.array((grad[1], grad[0]))

list_gradient.append(grad_ordered)

# Take a gradient steps

optimizer.step()

if i % freq_display == 0: #Monitor the loss

loss, present = float(loss.merchandise()), i

print(f"loss: {loss:>7f}, proportion abs. error {list_perc_abs_error[-1]:>7f}, [{current:>5d}/{nb_epochs:>5d}]")

if (i % freq_scheduler == 0) & (i != 0) & (use_scheduler == True):

scheduler.step()

print("i : {}. Reducing studying price: {}".format(i, scheduler.get_last_lr()))

return mannequin, list_beta, list_gradient

def create_gif(filenames, output_name):

"""

Operate to create a gif, utilizing a listing of photographs

"""

with io.get_writer(output_name, mode='I') as author:

for filename in filenames:

picture = io.imread(filename)

author.append_data(picture)

# Take away information, besides the ultimate one

for index_file, filename in enumerate(set(filenames)):

if index_file < len(filenames) - 1:

os.take away(filename)

# Outline a neural community with a single node

# Get cpu or gpu system for coaching.

system = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Utilizing {system} system")

nb_nodes = 1

# Outline mannequin

class NeuralNetwork(nn.Module):

def __init__(self):

tremendous(NeuralNetwork, self).__init__()

self.flatten = nn.Flatten()

self.linear_relu_stack = nn.Sequential(

nn.Linear(1, nb_nodes)

)

def ahead(self, x):

out = self.linear_relu_stack(x)

return out

Minimization utilizing gradient descent:

`lr = 0.10 #studying price`

alpha = lr

init = (9.0, 2.0) #preliminary guess

true_value = [-2.0, 2.0] #true worth for parameters# I. Resolve

optimizer_name = "SGD"

momentum = 0.0

model_SGD, list_beta_SGD, list_gradient_SGD = train_model(optimizer_name , init, true_value, lr, momentum)

# II. Create gif

filenames = []

zoom=1 #to extend/lower the size of vectors on the plot

max_index_plot = 30 #when to cease plotting

# Plot contour

cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)

plt.clabel(cp, inline=1, fontsize=10)

cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))

# Add factors and arrows

for (index, (bb, grad)) in enumerate(zip(list_beta_SGD, list_gradient_SGD)):

if index>max_index_plot:

break

if index == 0:

label_1 = "SGD"

else:

label_1 = ""

# Level

plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)

# Arrows

plt.arrow(bb[0], bb[1], - zoom * alpha* grad[0], - zoom * alpha * grad[1], coloration='white')

# Add arrows for gradient:

# create file title and append it to a listing

filename = "IMGS/path_SGD_{}.png".format(index)

filenames.append(filename)

plt.xlabel('cst')

plt.ylabel('slope')

plt.legend()

plt.savefig(filename)

filename = "IMGS/path_SGD.png"

plt.savefig(filename)

create_gif(filenames, "SGD.gif")

plt.present()

Minimization utilizing gradient descent with momentum:

`optimizer_name = "SGD-momentum"`

momentum = 0.2# I. Resolve

model_momentum, list_beta_momentum, list_gradient_momentum = train_model(optimizer_name , init, true_value, lr, momentum)

# II. Create gif

filenames = []

zoom=1 #to extend/lower the size of vectors on the plot

max_index_plot = 30 #when to cease plotting

# Plot contour

cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)

plt.clabel(cp, inline=1, fontsize=10)

cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))

# Add factors and arrows

for (index, (bb, grad, bb_momentum, grad_momentum)) in enumerate(zip(list_beta_SGD, list_gradient_SGD, list_beta_momentum, list_gradient_momentum)):

if index>max_index_plot:

break

if index == 0:

label_1 = "SGD"

label_2 = "SGD-momentum"

else:

label_1 = ""

label_2 = ""

# Level

plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)

plt.scatter([bb_momentum[0]], [bb_momentum[1]], s=10, c='blue', linewidth=5.0, alpha=0.5, label=label_2)

# Arrows

#plt.arrow(bb_momentum[0], bb_momentum[1], - zoom * alpha* grad[0], - zoom * alpha * grad[1], coloration='white')

plt.arrow(bb_momentum[0], bb_momentum[1], - zoom * alpha* grad_momentum[0], - zoom * alpha * grad_momentum[1], coloration="blue")

# create file title and append it to a listing

filename = "IMGS/path_SGD_momentum_{}.png".format(index)

filenames.append(filename)

plt.xlabel('cst')

plt.ylabel('slope')

plt.legend()

plt.savefig(filename)

filename = "IMGS/path_SGD_momentum.png"

plt.savefig(filename)

create_gif(filenames, "SGD_momentum.gif")

plt.present()

Minimization utilizing RMSprop:

`optimizer_name = "RMSprop"`

momentum = 0.0

# I. Resolve

model_RMSprop, list_beta_RMSprop, list_gradient_RMSprop = train_model(optimizer_name , init, true_value, lr, momentum)# II. Create gif

filenames = []

zoom=1 #to extend/lower the size of vectors on the plot

max_index_plot = 30 #when to cease plotting

# Plot contour

cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)

plt.clabel(cp, inline=1, fontsize=10)

cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))

# Add factors and arrows

for (index, (bb, grad, bb_RMSprop, grad_RMSprop)) in enumerate(zip(list_beta_SGD, list_gradient_SGD, list_beta_RMSprop, list_gradient_RMSprop)):

if index>max_index_plot:

break

if index == 0:

label_1 = "SGD"

label_2 = "RMSprop"

else:

label_1 = ""

label_2 = ""

# Level

plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)

plt.scatter([bb_RMSprop[0]], [bb_RMSprop[1]], s=10, c='blue', linewidth=5.0, alpha=0.5, label=label_2)

# Arrows

plt.arrow(bb_RMSprop[0], bb_RMSprop[1], - zoom * alpha* grad_RMSprop[0], - zoom * alpha * grad_RMSprop[1], coloration="blue")

# create file title and append it to a listing

filename = "IMGS/path_RMSprop_{}.png".format(index)

filenames.append(filename)

plt.xlabel('cst')

plt.ylabel('slope')

plt.legend()

plt.savefig(filename)

filename = "IMGS/path_RMSprop.png"

plt.savefig(filename)

create_gif(filenames, "RMSprop.gif")

plt.present()

Minimization utilizing Adam:

`optimizer_name = "Adam"`

momentum = 0.0# I. Resolve

model_Adam, list_beta_Adam, list_gradient_Adam = train_model(optimizer_name , init, true_value, lr, momentum)

# II. Create gif

filenames = []

zoom=1 #to extend/lower the size of vectors on the plot

max_index_plot = 30 #when to cease plotting

# Plot contour

cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)

plt.clabel(cp, inline=1, fontsize=10)

cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))

# Add factors and arrows

for (index, (bb, grad, bb_Adam, grad_Adam)) in enumerate(zip(list_beta_SGD, list_gradient_SGD, list_beta_Adam, list_gradient_Adam)):

if index>max_index_plot:

break

if index == 0:

label_1 = "SGD"

label_2 = "Adam"

else:

label_1 = ""

label_2 = ""

# Level

plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)

plt.scatter([bb_Adam[0]], [bb_Adam[1]], s=10, c='blue', linewidth=5.0, alpha=0.5, label=label_2)

# Arrows

plt.arrow(bb_Adam[0], bb_Adam[1], - zoom * alpha* grad_Adam[0], - zoom * alpha * grad_Adam[1], coloration="blue")

# create file title and append it to a listing

filename = "IMGS/path_Adam_{}.png".format(index)

filenames.append(filename)

plt.xlabel('cst')

plt.ylabel('slope')

plt.legend()

plt.savefig(filename)

filename = "IMGS/path_Adam.png"

plt.savefig(filename)

create_gif(filenames, "Adam.gif")

plt.present()

Creating the “Grasp plot” with the 4 trajectories collectively:

`max_iter = 100`

filenames = []

cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)

plt.clabel(cp, inline=1, fontsize=10)

cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))

colours = ["white", "blue", "green", "red"]# Add factors:

for (index, (bb_SGD, bb_momentum, bb_RMSprop, bb_Adam)) in enumerate(zip(list_beta_SGD, list_beta_momentum, list_beta_RMSprop, list_beta_Adam)):

if index % freq_plot == 0:

if index == 0:

label_1 = "SGD"

label_2 = "SGD-momentum"

label_3 = "RMSprop"

label_4 = "Adam"

else:

label_1, label_2, label_3, label_4 = "", "", "", ""

plt.scatter([bb_SGD[0]], [bb_SGD[1]], s=10, linewidth=5.0, label=label_1, coloration=colours[0])

plt.scatter([bb_momentum[0]], [bb_momentum[1]], s=10, linewidth=5.0, alpha=0.5, label=label_2, coloration=colours[1])

plt.scatter([bb_RMSprop[0]], [bb_RMSprop[1]], s=10, linewidth=5.0, alpha=0.5, label=label_3, coloration=colours[2])

plt.scatter([bb_Adam[0]], [bb_Adam[1]], s=10, linewidth=5.0, alpha=0.5, label=label_4, coloration=colours[3])

if index > max_iter:

break

# create file title and append it to a listing

filename = "IMGS/img_{}.png".format(index)

filenames.append(filename)

# Add arrows for gradient:

plt.xlabel('cst')

plt.ylabel('slope')

plt.legend()

# save body

plt.savefig(filename)

#plt.shut()# construct gif

create_gif(filenames, "compare_optim_algos.gif")

Creating the 3D “Grasp plot”:

`max_iter = 100`

fig, ax = plt.subplots(subplot_kw={"projection": "3d"})# Plot the floor.

surf = ax.plot_surface(X_grid, Y_grid, Z, cmap=cm.coolwarm, linewidth=0, antialiased=False, alpha=0.1)

ax.zaxis.set_major_locator(LinearLocator(10))

ax.zaxis.set_major_formatter('{x:.02f}')

ax.view_init(60, 35)

colours = ["black", "blue", "green", "red"]

x_min = -10

x_max = -x_min

y_min = x_min

y_max = -x_min

# Add factors:

for (index, (bb_SGD, bb_momentum, bb_RMSprop, bb_Adam)) in enumerate(zip(list_beta_SGD, list_beta_momentum, list_beta_RMSprop, list_beta_Adam)):

if index == 0:

label_1 = "SGD"

label_2 = "SGD-momentum"

label_3 = "RMSprop"

label_4 = "Adam"

else:

label_1, label_2, label_3, label_4 = "", "", "", ""

ax.scatter([bb_SGD[0]], [bb_SGD[1]], s=100, linewidth=5.0, label=label_1, coloration=colours[0])

ax.scatter([bb_momentum[0]], [bb_momentum[1]], s=100, linewidth=5.0, alpha=0.5, label=label_2, coloration=colours[1])

ax.scatter([bb_RMSprop[0]], [bb_RMSprop[1]], s=100, linewidth=5.0, alpha=0.5, label=label_3, coloration=colours[2])

ax.scatter([bb_Adam[0]], [bb_Adam[1]], s=100, linewidth=5.0, alpha=0.5, label=label_4, coloration=colours[3])

if index > max_iter:

break

# create file title and append it to a listing

filename = "IMGS/img_{}.png".format(index)

filenames.append(filename)

# Add arrows for gradient:

plt.xlim(x_min, x_max)

plt.ylim(y_min, y_max)

plt.ylabel('Slope')

plt.xlabel('Intercept')

plt.legend()

# save body

plt.savefig(filename)

filename = "IMGS/surface_loss.png"

plt.savefig(filename)

plt.present()

create_gif(filenames, "surface_compare_optim_algos.gif")

—

- Ruder, Sebastian. “An summary of gradient descent optimization algorithms.”
*arXiv preprint arXiv:1609.04747*(2016) - Sutskever, Ilya, et al. “On the significance of initialization and momentum in deep studying.”
*Worldwide convention on machine studying*. PMLR, 2013.

**Actually good collection of movies on this matter:**