Visualize the variations and similarities between gradient descent, gradient descent with momentum, RMSprop, and Adam
In case you are like me, equations don’t converse for themselves. To know them, I must see what they do with a concrete instance. On this weblog put up, I apply this visualization precept to common optimization algorithms utilized in machine studying.
These days, the Adam algorithm is a extremely popular alternative. The Adam algorithm provides momentum and self-tuning of the training price to the plain-vanilla gradient descent algorithm. However what are momentum and self-tuning precisely?
Beneath is a visible preview of what these ideas seek advice from:
To maintain issues easy, I take advantage of completely different optimization algorithms on the bivariate linear regression mannequin:
y = a + bx
The variable y represents a amount we attempt to predict/clarify utilizing one other variable x. The unknown parameters are the intercept a and the slope b.
To suit the mannequin to the information, we decrease the imply sq. of the distinction between the mannequin and the information, which might be compactly expressed as follows:
Loss(a,b)=1/m||y-a-bx||²
(assuming now we have m observations and utilizing the Euclidean norm)
By altering the worth of a and b, we are able to hopefully enhance the match of the mannequin to the information. With the bivariate regression mannequin, an excellent factor is that we are able to plot the worth of the loss perform as a perform of the unknown parameters a and b. Beneath is a floor plot of the loss perform, with the black dot representing the minimal of the loss.
We are able to additionally visualize the loss perform utilizing a contour plot, the place the traces are degree units (factors such that Loss(a,b) = fixed). Beneath, the white level represents the minimal of the loss perform.
The plain-vanilla gradient descent algorithm consists in taking a step of measurement η within the course of the steepest descent, which is given by the other worth of the gradient. Mathematically, the replace rule appears like:
Within the subsequent plot, I present one trajectory implied by the gradient descent algorithm. Factors signify values of a and b throughout iterations, whereas arrows are gradients of the loss perform, telling us the place to maneuver within the subsequent iteration.
A key function is that the gradient descent algorithm would possibly create some oscillations between degree units. In an ideal world, we want as an alternative to maneuver easily within the course of the minimal. As we are going to see, including momentum is one approach to clean the trajectory towards the minimal worth.
Momentum refers back to the tendency of transferring objects to proceed transferring in the identical course. In apply, we are able to add momentum to gradient descent by bearing in mind earlier values of the gradient. This may be carried out as follows:
The upper the worth for γ, the extra previous values of the gradient are considered within the present replace.
Within the subsequent plot, I present the trajectories implied by the gradient descent algorithm with (in blue) and with out momentum (in white).
Momentum reduces the fluctuations alongside the worth of the slope coefficient. The large swings up and down are inclined to cancel out as soon as the averaging results of momentum begin to kick in. Because of this, with momentum we transfer sooner within the course of the true worth.
Momentum is a pleasant twist to gradient descent. One other line of enchancment consists in introducing a studying price that’s tailor-made to every parameter (in our instance: one studying price for the slope, one studying price for the intercept).
However how to decide on such a coefficient-specific studying price? Observe that the earlier plots present that the gradient doesn’t essentially level towards the minimal. No less than not throughout the first iterations.
Intuitively, we wish to give much less weight to the strikes within the up/down course, and extra weight to the strikes within the left/proper course. The RMSprop updating rule embeds this desired property:
The primary line simply defines g to the be the gradient of the loss perform. The second line says that we calculate a operating common of the sq. of the gradient. In third line, we take a step within the course given by the gradient, however rescaled by the sq. root of the operating common of previous gradients.
In our instance, as a result of the sq. of the gradient tends to be massive for the slope coefficient, so we take small steps in that course. The other is true for the intercept coefficient (small values, massive strikes).
The Adam optimization algorithm has momentum, in addition to the adaptive studying price of RMSprop. Beneath is nearly what Adam does:
The updating rule is similar to considered one of RMSprop. The important thing distinction is momentum: the course of change is given by a operating common of the previous gradient.
The precise Adam updating rule makes use of “bias-corrected” worth for m and v. In step one, Adam initialize m and v to be zero. To right for the initialization bias, the authors counsel to make use of reweighed variations of m and v:
Beneath, we see that the trajectory induced by Adam is considerably just like the one given by RMSprop, however with a slower begin.
The following plot exhibits the trajectories induced by the 4 optimization algorithms described above.
Key outcomes are as follows:
- Gradient descent with momentum has much less fluctuations than gradient descent with out momentum.
- Adam and RMSprop take a special route, transferring slower within the slope dimension and sooner within the intercept dimension.
- As anticipated, Adam shows some momentum: whereas RMSprop begins turning left in the direction of the minimal, Adam has a tougher time to show due to the collected momentum.
Beneath is similar graph, however in 3d:
On this weblog put up, my goal was for the reader to construct an intuitive understanding of key optimization algorithms utilized in machine studying.
Beneath you’ll find the code that was used to provide the graphs used on this put up. Don’t hesitate to change the training price and/or the loss perform to see how this impacts the completely different trajectories.
—
The next block of code hundreds dependencies, defines the loss perform and does plots the loss perform (floor and contour plots):
# A. Dependencies
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocatorplot_scale = 1.25
plt.rcParams["figure.figsize"] = (plot_scale*16, plot_scale*9)
import numpy as np
import pandas as pd
import random
import scipy.stats
from itertools import product
import os
import time
from math import sqrt
import seaborn as sns; sns.set()
from tqdm import tqdm as tqdm
import datetime
from typing import Tuple
class Vector: cross
from scipy.stats import norm
import torch
from torch import nn
from torch.utils.knowledge import DataLoader
import copy
import matplotlib.ticker as mtick
from torchcontrib.optim import SWA
from numpy import linalg as LA
import imageio as io #create gif
# B. Create OLS drawback
b0 = -2.0 #intercept
b1 = 2.0 #slope
beta_true = (b0 , b1)
nb_vals = 1000 #quantity attracts
mu, sigma = 0, 0.001 # imply and normal deviation
shocks = np.random.regular(mu, sigma, nb_vals)
# covariate
x0 = np.ones(nb_vals) #cst
x1 = np.random.uniform(-5, 5, nb_vals)
X = np.column_stack((x0, x1))
# Information
y = b0*x0 + b1*x1 + shocks
A = np.linalg.inv(np.matmul(np.transpose(X), X))
B = np.matmul(np.transpose(X), y)
np.matmul(A, B)
X_torch = torch.from_numpy(X).float()
y_torch = torch.from_numpy(y).float()
# Loss perform and gradient (for plotting)
def loss_function_OLS(beta_hat, X, y):
loss = (1/len(y))*np.sum(np.sq.(y - np.matmul(X, beta_hat)))
return loss
def grad_OLS(beta_hat, X, y):
mse = loss_function_OLS(beta_hat, X, y)
G = (2/len(y))*np.matmul(np.transpose(X), np.matmul(X, beta_hat) - y)
return G, mse
# C. Plots for the loss perform
min_val=-10.0
max_val=10.0
delta_grid=0.05
x_grid = np.arange(min_val, max_val, delta_grid)
y_grid = np.arange(min_val, max_val, delta_grid)
X_grid, Y_grid = np.meshgrid(x_grid, y_grid)
Z = np.zeros((len(x_grid), len(y_grid)))
for (y_index, y_value) in enumerate(y_grid):
for (x_index, x_value) in enumerate(x_grid):
beta_local = np.array((x_value, y_value))
Z[y_index, x_index] = loss_function_OLS(beta_local, X, y)
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
# Plot the floor.
surf = ax.plot_surface(X_grid, Y_grid, Z, cmap=cm.coolwarm, linewidth=0, antialiased=False, alpha=0.2)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter('{x:.02f}')
ax.scatter([b0], [b1], [true_value], s=100, c='black', linewidth=0.5)
x_min = -10
x_max = -x_min
y_min = x_min
y_max = -x_min
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.ylabel('Slope')
plt.xlabel('Intercept')
fig.colorbar(surf, shrink=0.5, facet=5)
filename = "IMGS/surface_loss.png"
plt.savefig(filename)
plt.present()
# Plot contour
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
plt.scatter([b0], [b1], s=100, c='white', linewidth=0.5)
plt.ylabel('Slope')
plt.xlabel('Intercept')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
filename = "IMGS/countour_loss.png"
plt.savefig(filename)
plt.present()
The following block of code defines features in order that we are able to resolve the OLS drawback utilizing Pytorch. Right here, utilizing Pytroch is an overkill, however the benefit is that we are able to use the pre-coded minimization algorithms (torch.optim):
def loss_OLS(mannequin, y, X):
"""
Loss perform for OLS
"""
R_squared = torch.sq.(y.unsqueeze(1) - mannequin(X[:,1].unsqueeze(1)))
return torch.imply(R_squared)def set_initial_values(mannequin, w, b):
"""
Operate to set the burden and bias to sure values
"""
with torch.no_grad():
for title, param in mannequin.named_parameters():
if 'linear_relu_stack.0.weight' in title:
param.copy_(torch.tensor([w]))
elif 'linear_relu_stack.0.bias' in title:
param.copy_(torch.tensor([b]))
def create_optimizer(mannequin, optimizer_name, lr, momentum):
"""
Operate to outline an optimizer
"""
if optimizer_name == "Adam":
optimizer = torch.optim.Adam(mannequin.parameters(), lr)
elif optimizer_name == "SGD":
optimizer = torch.optim.SGD(mannequin.parameters(), lr)
elif optimizer_name == "SGD-momentum":
optimizer = torch.optim.SGD(mannequin.parameters(), lr, momentum)
elif optimizer_name == "Adadelta":
optimizer = torch.optim.Adadelta(mannequin.parameters(), lr)
elif optimizer_name == "RMSprop":
optimizer = torch.optim.RMSprop(mannequin.parameters(), lr)
else:
increase("optimizer unknown")
return optimizer
def train_model(optimizer_name, initial_guess, true_value, lr, momentum):
"""
Operate to coach a mannequin
"""
# initialize a mannequin
mannequin = NeuralNetwork().to(system)
#print(mannequin)
set_initial_values(mannequin, initial_guess[0], initial_guess[1])
for title, param in mannequin.named_parameters():
print(title, param)
mannequin.prepare()
nb_epochs = 100
use_scheduler = False
freq_scheduler = 100
freq_gamma = 0.95
true_b = torch.tensor([true_value[0], true_value[1]])
print(optimizer_name)
optimizer = create_optimizer(mannequin, optimizer_name, lr, momentum)
# A LOOP OVER EACH POINT OF THE CURRENT GRID
# retailer imply loss by epoch
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=freq_gamma)
loss_epochs = torch.zeros(nb_epochs)
list_perc_abs_error = [] #retailer abs worth proportion error
list_perc_abs_error_i = [] #retailer index i
list_perc_abs_error_loss = [] #retailer loss
list_norm_gradient = [] #retailer norm of gradient
list_gradient = [] #retailer the gradient itself
list_beta = [] #retailer parameters
calculate_variance_grad = False
freq_loss = 1
freq_display = 10
for i in tqdm(vary(0, nb_epochs)):
optimizer.zero_grad()
# Calculate the loss
loss = loss_OLS(mannequin, y_torch, X_torch)
loss_epochs[[i]] = float(loss.merchandise())
# Retailer the loss
with torch.no_grad():
# Extract weight and bias
b_current = np.array([k.item() for k in model.parameters()])
b_current_ordered = np.array((b_current[1], b_current[0])) #reorder (bias, weight)
list_beta.append(b_current_ordered)
perc_abs_error = np.sum(np.sq.(b_current_ordered - true_b.detach().numpy()))
list_perc_abs_error.append(np.median(perc_abs_error))
list_perc_abs_error_i.append(i)
list_perc_abs_error_loss.append(float(loss.merchandise()))
# Calculate the gradient
loss.backward()
# Retailer the gradient
with torch.no_grad():
grad = np.zeros(2)
for (index_p, p) in enumerate(mannequin.parameters()):
grad[index_p] = p.grad.detach().knowledge
#reorder (bias, weight)
grad_ordered = np.array((grad[1], grad[0]))
list_gradient.append(grad_ordered)
# Take a gradient steps
optimizer.step()
if i % freq_display == 0: #Monitor the loss
loss, present = float(loss.merchandise()), i
print(f"loss: {loss:>7f}, proportion abs. error {list_perc_abs_error[-1]:>7f}, [{current:>5d}/{nb_epochs:>5d}]")
if (i % freq_scheduler == 0) & (i != 0) & (use_scheduler == True):
scheduler.step()
print("i : {}. Reducing studying price: {}".format(i, scheduler.get_last_lr()))
return mannequin, list_beta, list_gradient
def create_gif(filenames, output_name):
"""
Operate to create a gif, utilizing a listing of photographs
"""
with io.get_writer(output_name, mode='I') as author:
for filename in filenames:
picture = io.imread(filename)
author.append_data(picture)
# Take away information, besides the ultimate one
for index_file, filename in enumerate(set(filenames)):
if index_file < len(filenames) - 1:
os.take away(filename)
# Outline a neural community with a single node
# Get cpu or gpu system for coaching.
system = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Utilizing {system} system")
nb_nodes = 1
# Outline mannequin
class NeuralNetwork(nn.Module):
def __init__(self):
tremendous(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(1, nb_nodes)
)
def ahead(self, x):
out = self.linear_relu_stack(x)
return out
Minimization utilizing gradient descent:
lr = 0.10 #studying price
alpha = lr
init = (9.0, 2.0) #preliminary guess
true_value = [-2.0, 2.0] #true worth for parameters# I. Resolve
optimizer_name = "SGD"
momentum = 0.0
model_SGD, list_beta_SGD, list_gradient_SGD = train_model(optimizer_name , init, true_value, lr, momentum)
# II. Create gif
filenames = []
zoom=1 #to extend/lower the size of vectors on the plot
max_index_plot = 30 #when to cease plotting
# Plot contour
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
# Add factors and arrows
for (index, (bb, grad)) in enumerate(zip(list_beta_SGD, list_gradient_SGD)):
if index>max_index_plot:
break
if index == 0:
label_1 = "SGD"
else:
label_1 = ""
# Level
plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)
# Arrows
plt.arrow(bb[0], bb[1], - zoom * alpha* grad[0], - zoom * alpha * grad[1], coloration='white')
# Add arrows for gradient:
# create file title and append it to a listing
filename = "IMGS/path_SGD_{}.png".format(index)
filenames.append(filename)
plt.xlabel('cst')
plt.ylabel('slope')
plt.legend()
plt.savefig(filename)
filename = "IMGS/path_SGD.png"
plt.savefig(filename)
create_gif(filenames, "SGD.gif")
plt.present()
Minimization utilizing gradient descent with momentum:
optimizer_name = "SGD-momentum"
momentum = 0.2# I. Resolve
model_momentum, list_beta_momentum, list_gradient_momentum = train_model(optimizer_name , init, true_value, lr, momentum)
# II. Create gif
filenames = []
zoom=1 #to extend/lower the size of vectors on the plot
max_index_plot = 30 #when to cease plotting
# Plot contour
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
# Add factors and arrows
for (index, (bb, grad, bb_momentum, grad_momentum)) in enumerate(zip(list_beta_SGD, list_gradient_SGD, list_beta_momentum, list_gradient_momentum)):
if index>max_index_plot:
break
if index == 0:
label_1 = "SGD"
label_2 = "SGD-momentum"
else:
label_1 = ""
label_2 = ""
# Level
plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)
plt.scatter([bb_momentum[0]], [bb_momentum[1]], s=10, c='blue', linewidth=5.0, alpha=0.5, label=label_2)
# Arrows
#plt.arrow(bb_momentum[0], bb_momentum[1], - zoom * alpha* grad[0], - zoom * alpha * grad[1], coloration='white')
plt.arrow(bb_momentum[0], bb_momentum[1], - zoom * alpha* grad_momentum[0], - zoom * alpha * grad_momentum[1], coloration="blue")
# create file title and append it to a listing
filename = "IMGS/path_SGD_momentum_{}.png".format(index)
filenames.append(filename)
plt.xlabel('cst')
plt.ylabel('slope')
plt.legend()
plt.savefig(filename)
filename = "IMGS/path_SGD_momentum.png"
plt.savefig(filename)
create_gif(filenames, "SGD_momentum.gif")
plt.present()
Minimization utilizing RMSprop:
optimizer_name = "RMSprop"
momentum = 0.0
# I. Resolve
model_RMSprop, list_beta_RMSprop, list_gradient_RMSprop = train_model(optimizer_name , init, true_value, lr, momentum)# II. Create gif
filenames = []
zoom=1 #to extend/lower the size of vectors on the plot
max_index_plot = 30 #when to cease plotting
# Plot contour
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
# Add factors and arrows
for (index, (bb, grad, bb_RMSprop, grad_RMSprop)) in enumerate(zip(list_beta_SGD, list_gradient_SGD, list_beta_RMSprop, list_gradient_RMSprop)):
if index>max_index_plot:
break
if index == 0:
label_1 = "SGD"
label_2 = "RMSprop"
else:
label_1 = ""
label_2 = ""
# Level
plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)
plt.scatter([bb_RMSprop[0]], [bb_RMSprop[1]], s=10, c='blue', linewidth=5.0, alpha=0.5, label=label_2)
# Arrows
plt.arrow(bb_RMSprop[0], bb_RMSprop[1], - zoom * alpha* grad_RMSprop[0], - zoom * alpha * grad_RMSprop[1], coloration="blue")
# create file title and append it to a listing
filename = "IMGS/path_RMSprop_{}.png".format(index)
filenames.append(filename)
plt.xlabel('cst')
plt.ylabel('slope')
plt.legend()
plt.savefig(filename)
filename = "IMGS/path_RMSprop.png"
plt.savefig(filename)
create_gif(filenames, "RMSprop.gif")
plt.present()
Minimization utilizing Adam:
optimizer_name = "Adam"
momentum = 0.0# I. Resolve
model_Adam, list_beta_Adam, list_gradient_Adam = train_model(optimizer_name , init, true_value, lr, momentum)
# II. Create gif
filenames = []
zoom=1 #to extend/lower the size of vectors on the plot
max_index_plot = 30 #when to cease plotting
# Plot contour
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
# Add factors and arrows
for (index, (bb, grad, bb_Adam, grad_Adam)) in enumerate(zip(list_beta_SGD, list_gradient_SGD, list_beta_Adam, list_gradient_Adam)):
if index>max_index_plot:
break
if index == 0:
label_1 = "SGD"
label_2 = "Adam"
else:
label_1 = ""
label_2 = ""
# Level
plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)
plt.scatter([bb_Adam[0]], [bb_Adam[1]], s=10, c='blue', linewidth=5.0, alpha=0.5, label=label_2)
# Arrows
plt.arrow(bb_Adam[0], bb_Adam[1], - zoom * alpha* grad_Adam[0], - zoom * alpha * grad_Adam[1], coloration="blue")
# create file title and append it to a listing
filename = "IMGS/path_Adam_{}.png".format(index)
filenames.append(filename)
plt.xlabel('cst')
plt.ylabel('slope')
plt.legend()
plt.savefig(filename)
filename = "IMGS/path_Adam.png"
plt.savefig(filename)
create_gif(filenames, "Adam.gif")
plt.present()
Creating the “Grasp plot” with the 4 trajectories collectively:
max_iter = 100
filenames = []
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
colours = ["white", "blue", "green", "red"]# Add factors:
for (index, (bb_SGD, bb_momentum, bb_RMSprop, bb_Adam)) in enumerate(zip(list_beta_SGD, list_beta_momentum, list_beta_RMSprop, list_beta_Adam)):
if index % freq_plot == 0:
if index == 0:
label_1 = "SGD"
label_2 = "SGD-momentum"
label_3 = "RMSprop"
label_4 = "Adam"
else:
label_1, label_2, label_3, label_4 = "", "", "", ""
plt.scatter([bb_SGD[0]], [bb_SGD[1]], s=10, linewidth=5.0, label=label_1, coloration=colours[0])
plt.scatter([bb_momentum[0]], [bb_momentum[1]], s=10, linewidth=5.0, alpha=0.5, label=label_2, coloration=colours[1])
plt.scatter([bb_RMSprop[0]], [bb_RMSprop[1]], s=10, linewidth=5.0, alpha=0.5, label=label_3, coloration=colours[2])
plt.scatter([bb_Adam[0]], [bb_Adam[1]], s=10, linewidth=5.0, alpha=0.5, label=label_4, coloration=colours[3])
if index > max_iter:
break
# create file title and append it to a listing
filename = "IMGS/img_{}.png".format(index)
filenames.append(filename)
# Add arrows for gradient:
plt.xlabel('cst')
plt.ylabel('slope')
plt.legend()
# save body
plt.savefig(filename)
#plt.shut()# construct gif
create_gif(filenames, "compare_optim_algos.gif")
Creating the 3D “Grasp plot”:
max_iter = 100
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})# Plot the floor.
surf = ax.plot_surface(X_grid, Y_grid, Z, cmap=cm.coolwarm, linewidth=0, antialiased=False, alpha=0.1)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter('{x:.02f}')
ax.view_init(60, 35)
colours = ["black", "blue", "green", "red"]
x_min = -10
x_max = -x_min
y_min = x_min
y_max = -x_min
# Add factors:
for (index, (bb_SGD, bb_momentum, bb_RMSprop, bb_Adam)) in enumerate(zip(list_beta_SGD, list_beta_momentum, list_beta_RMSprop, list_beta_Adam)):
if index == 0:
label_1 = "SGD"
label_2 = "SGD-momentum"
label_3 = "RMSprop"
label_4 = "Adam"
else:
label_1, label_2, label_3, label_4 = "", "", "", ""
ax.scatter([bb_SGD[0]], [bb_SGD[1]], s=100, linewidth=5.0, label=label_1, coloration=colours[0])
ax.scatter([bb_momentum[0]], [bb_momentum[1]], s=100, linewidth=5.0, alpha=0.5, label=label_2, coloration=colours[1])
ax.scatter([bb_RMSprop[0]], [bb_RMSprop[1]], s=100, linewidth=5.0, alpha=0.5, label=label_3, coloration=colours[2])
ax.scatter([bb_Adam[0]], [bb_Adam[1]], s=100, linewidth=5.0, alpha=0.5, label=label_4, coloration=colours[3])
if index > max_iter:
break
# create file title and append it to a listing
filename = "IMGS/img_{}.png".format(index)
filenames.append(filename)
# Add arrows for gradient:
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.ylabel('Slope')
plt.xlabel('Intercept')
plt.legend()
# save body
plt.savefig(filename)
filename = "IMGS/surface_loss.png"
plt.savefig(filename)
plt.present()
create_gif(filenames, "surface_compare_optim_algos.gif")
—
- Ruder, Sebastian. “An summary of gradient descent optimization algorithms.” arXiv preprint arXiv:1609.04747 (2016)
- Sutskever, Ilya, et al. “On the significance of initialization and momentum in deep studying.” Worldwide convention on machine studying. PMLR, 2013.
Actually good collection of movies on this matter: