dtorch.optim ============ Description ----------- This module contain optimizers for ML. The role of an optimizer is to apply the gradients calculated with optimization for a quicker training of the model. The most currently used is the ``Adam`` optimizer. Here's an example of it's usage .. code:: python from dtorch import optim # create it by passing the model parameters to it optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) ... # during training for input, target in dataset: # setting all the gradients in the network to 0 or else it will accumulate optimizer.zero_grad() output = model(input) loss = loss_fn(output, target) loss.backward() # after calculating the gradient with the .backward method, calling the optimizer to apply the gradients. optimizer.step() The ``lr`` parameter is the learning rate. It like a velocity for a fast the weight should be going torwards the solution. A high learning rate seems good at first but is usually enclined to perform poorly when the approximation need more precision. You may think of it as a video game character with a speed of a 1000 trying to go in a specific location. Optimizers ---------- .. py:class:: Optimizer The optimizer base module .. py:method:: zero_grad() Reinitialize gradients for parameters managed by the optimizer .. py:method:: step() Apply gradient with optimizations provided .. py:class:: SGD(Optimizer) An implementation of the Stochastic Gradient Descent algorithm update part with momentum if precised. .. py:method:: __init__(params : list[Parameter] | list[OptimizerParam], lr = 1e-3, momentum : float = 0.0) :param Union[list[Parameter],list[OptimizerParam]] params: the parameters of a model or a list of parameters and its settings. :param float lr: The learning rate of the optimizer :param float momentum: the amount of velocity that should be used when applyong the weights. Between [0, 1]. With :math:`p_i` being the parameter i, :math:`g_i` being the gradient of the parameter i. If there is no velocity parameter: :math:`p_i = p_{i-1} - lr * g_{i, t}` Otherwise, with *v* being the velocity, *m* being the momentum: | :math:`v_t = m * v_{t-1} + (1 - m) * g_{i, t}` | :math:`p_{i,t} = p_{i,t-1} - lr * v_t` .. py:class:: Adam(Optimizer) Adaptive Moment Estimation is an algorithm for gradient descent that adapt the learning over time. .. py:method:: __init__(params : list[Parameter], lr : float = 0.001, betas : Tuple[float, float] = (0.9, 0.999), eps : float = 1e-08, weight_decay : float = 0.0) :param list[Parameter] params: the parameters of a model or a list of parameters and its settings. :param float lr: The learning rate of the optimizer :param Tuple[float] betas: the momentum multiplier and the learning rate speed of decrease :param float eps: Additionnal base reduction data for the second mutliplier. :param float weight_decay: Basicly a way to implement a Regularization multiplier (L2 type). With :math:`p_i` being the parameter i, :math:`g_i` being the gradient of the parameter i. If there is a weight_decay (*wd*) parameter: :math:`g_i = g_i + wd * p_{i, t}` Then, with :math:`m_t, v_t` being the momentum vectors, :math:`B_1, B_2` being the betas multipliers: | :math:`m_t = B_1 * m_{t-1} + (1 - B_1) * g_i` | :math:`v_t = B_2 * v_{t-1} + (1 - B_2) * g_i^2` | :math:`a = m_t / (1 - B_1^t)` | :math:`b = v_t / (1 - B_2^t)` | :math:`p_{i, t} = p_{i, t-1} - lr * a / (\sqrt{b} + eps)`