dtorch.optim
Description
This module contain optimizers for ML.
The role of an optimizer is to apply the gradients calculated with optimization for a quicker training of the model.
The most currently used is the Adam
optimizer.
Here’s an example of it’s usage
from dtorch import optim
# create it by passing the model parameters to it
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
...
# during training
for input, target in dataset:
# setting all the gradients in the network to 0 or else it will accumulate
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
# after calculating the gradient with the .backward method, calling the optimizer to apply the gradients.
optimizer.step()
The lr
parameter is the learning rate. It like a velocity for a fast the weight should be going torwards the solution.
A high learning rate seems good at first but is usually enclined to perform poorly when the approximation need more precision.
You may think of it as a video game character with a speed of a 1000 trying to go in a specific location.
Optimizers
- class Optimizer
The optimizer base module
- zero_grad()
Reinitialize gradients for parameters managed by the optimizer
- step()
Apply gradient with optimizations provided
- class SGD(Optimizer)
An implementation of the Stochastic Gradient Descent algorithm update part with momentum if precised.
- __init__(params: list[Parameter] | list[OptimizerParam], lr=1e-3, momentum: float = 0.0)
- Parameters:
params (Union[list[Parameter],list[OptimizerParam]]) – the parameters of a model or a list of parameters and its settings.
lr (float) – The learning rate of the optimizer
momentum (float) – the amount of velocity that should be used when applyong the weights. Between [0, 1].
With \(p_i\) being the parameter i, \(g_i\) being the gradient of the parameter i.
If there is no velocity parameter: \(p_i = p_{i-1} - lr * g_{i, t}\)
Otherwise, with v being the velocity, m being the momentum:
\(v_t = m * v_{t-1} + (1 - m) * g_{i, t}\)\(p_{i,t} = p_{i,t-1} - lr * v_t\)
- class Adam(Optimizer)
Adaptive Moment Estimation is an algorithm for gradient descent that adapt the learning over time.
- __init__(params: list[Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0)
- Parameters:
params (list[Parameter]) – the parameters of a model or a list of parameters and its settings.
lr (float) – The learning rate of the optimizer
betas (Tuple[float]) – the momentum multiplier and the learning rate speed of decrease
eps (float) – Additionnal base reduction data for the second mutliplier.
weight_decay (float) – Basicly a way to implement a Regularization multiplier (L2 type).
With \(p_i\) being the parameter i, \(g_i\) being the gradient of the parameter i.
If there is a weight_decay (wd) parameter: \(g_i = g_i + wd * p_{i, t}\)
Then, with \(m_t, v_t\) being the momentum vectors, \(B_1, B_2\) being the betas multipliers:
\(m_t = B_1 * m_{t-1} + (1 - B_1) * g_i\)\(v_t = B_2 * v_{t-1} + (1 - B_2) * g_i^2\)\(a = m_t / (1 - B_1^t)\)\(b = v_t / (1 - B_2^t)\)\(p_{i, t} = p_{i, t-1} - lr * a / (\sqrt{b} + eps)\)