The Theory Behind the Learners
******************************

.. currentmodule:: autoaug.autoaugment_learners


Random Search Learner (:class:`RsLearner`)
##########################################

This learner is a purely randomised searcher, yet despite this it 
is a hard baseline to beat, as it is in many automatic hyperparameter 
tuning problem settings :cite:t:`randomsearch`. Each subpolicy is found by the following 
process; an augmentation is chosen from the available selection with 
an equal probability, followed by a probability from 0.0 to 1.0 in 
incrementations of 0.1, which, again, are equally likely. If the 
selected augmentation requires a magnitude, we use Python's random 
module to select a random integer between 0 and 9 inclusive, otherwise 
the magnitude is set to None. This is repeated to generate a complete 
subpolicy, and can be joined into a policy to be tested. The 
corresponding accuracy of the child network is then stored with the 
policy for logging.


Genetic Learner (:class:`GenLearner`)
#####################################

The genetic learner has similar elements to the Random Search 
learner, but uses information from previous sub-policies when generating 
new ones to more efficiently search for optimal augmentation parameters. 
The methodology for this learner is shown in :ref:`gen_pseudocode`;


In this algorithm each subpolicy is represented as a binary string. 
For example the subpolicy ((ShearX, 0.9, 3), (TranslateY, 0.1, 7)) 
would be represented as '000010010011001100010111' where each 
augmentation, probability, and magnitude is parsed as a 4-bit long 
digit. We then rank the effectiveness of a policy by the accuracy of 
the trained child network implementing it, and carry this by storing 
tuples of (policy, accuracy). We randomly select parents as previous 
policies, where the probability of selection is weighted by the 
corresponding accuracy. A child is then produced using the random 
crossover method, and this is taken as the next policy to test. In the
case that the binary number produced from this method did not 
correspond to an available augmentation, probability, or magnitude, 
or if edge cases were found (e.g. an augmentation with required 
magnitude None was assigned a float value), such cases were resolved 
with uniform random selection.


.. _gen_pseudocode:

.. figure:: /img/GenLearner_pseudo.jpg
   :alt: Genlutionary Learner (GenLearner) pseudocode
   :align: center
   :width: 700px

   GenLearner Pseudocode


Evolutionary Learner (:class:`EvoLearner`)
##########################################


The Evolutionary learner is similar in principle to the Genetic 
learner except that instead of the subpolicy being expressed as 
genes, the genes are expressed via the weights of a neural network. 
In this library :cite:t:`pygad` is used, which allows an easy 
implementation of such an evolutionary approach that is compatible 
with PyTorch networks. The algorithm is shown in :numref:`evo_pseudocode`;. 
For this learner a child network and controller are required. The 
architecture of the child network must be able to pass images and 
classify them, whereas the controller network must be able to take 
the same input images and output a policy. In this work, this output 
consisted of the number of functions used in the search, in addition 
to 2 nodes for the probability and magnitudes respectively, and then 
doubling this to account for a subpolicy consisting of pairs of 
transformations.


In order to calculate which augmentations are optimal, we 
consider the strength of pairs of augmentations, as well as the 
presence of those exact pairs in the network output. We first find 
the covariance of the outputs for the first and second augmentations, 
and chose the most strongly correlated pair of augmentations. We 
then returned to the batched network output to find all instances 
where these pairs of augmentations were chosen, and found the mean 
magnitude and probability from these. This could be repeated to 
generate multiple different subpolicies, at which point it was 
passed to the child network for training.


.. _evo_pseudocode:

.. figure:: /img/EvoLearner_pseudo.jpg
   :alt: Evolutionary Learner (EvoLearner) pseudocode
   :align: center
   :width: 700px

   EvoLearner Pseudocode


GRU Learner (:class:`GruLearner`)
#################################

The algorithm is shown in :numref:`gru_pseudocode`.
This model nearly identical to what was used in 
:cite:t:`autoaugment`, where the authors borrowed the model from this 
:cite:t:`neural_arch_search`. In their works an LSTM 
controller was used, which output a policy in the form of a 
length 10 sequence of vectors, each vector representing an 
operation. The LSTM controller was updated using proximal policy 
optimization (PPO) :cite:t:`ppo`, using the accuracy of the child 
network as the reward value. 

In the context of the PPO update, which was developed in the 
reinforcement learning literature, the subpolicies are the 
'actions' of our RL agent. 

We use GRU instead of LSTM because it's faster while empirically 
being functionally equivalent :cite:t:`gru`. 

For more details on how an RNN (LSTM or GRU) network can be 
updated using PPO see Section 3.2 of :cite:t:`neural_arch_search`. 
For more details on the hyperparameters of the LSTM they used during 
training, see 'Training Details' in the same paper.


.. _gru_pseudocode:

.. figure:: /img/GruLearner_pseudo.jpg
   :alt: GRU Learner (GruLearner) pseudocode
   :align: center
   :width: 700px

   GruLearner Pseudocode


UCB Learner (:class:`UcbLearner`)
#################################

UCB1 is often described as optimism in the face of uncertainty, 
and is described in :numref:`ucb_pseudocode`. We are trying to 
find the action with the highest reward given some uncertainty of 
the current expected rewards (:math:`q`-values). As the number of times an 
action has been taken  increases (relative to the total number of 
actions taken), its outcome becomes more certain. Instead of looking 
at raw :math:`q`-values, the algorithm takes the action counts into account, 
which can be seen in Equation :eq:`ucbeq`.


.. math::
    :label: ucbeq

    q_{i} + \sqrt{\frac{2ln(n)}{n_i}}


where :math:`i`` represents an action, :math:`q_i` is the associated :math:`q`-value, 
:math:`n` is the total action count, and :math:`n_i` is the action count. UCB1 
is a typical algorithm used in multi-arm bandit problems 
:cite:t:`kexugit_test_nodate` and, since the auto-augmentation problem 
can be rephrased as a multi-arm bandit problem (where each arm is 
applying an augmentation and each return is an accuracy score on a 
validation set), it is simple to use UCB1 as a starting case before 
moving to more complex algorithms. UCB1 is easy to understand and easy 
to implement but does not use neural networks, and so offers no real 
way to generalise between different types of augmentations.

It works by keeping track of how many times each augmentation has been 
attempted, and adjusting the raw :math:`q`-values (mean accuracy on validation 
set) by these counts (seen in the formula above). This new adjusted 
q-value is then used to decide which augmentation is applied in the next 
iteration (that is, which bandit arm is pulled).

Due to the fact UCB1 doesn't generalise between types of augmentation, 
it is quite slow compared to methods that involve neural networks (for 
example GRU). For this reason, we used a similar methodology to 
:cite:t:`cubuk_autoaugment_2019` however we limit the 
search space to 5 policies each with 5 subpolicies (the original paper 
has 14,000 policies each with 5 subpolicies). The raw accuracy on a 
fixed validation set is fed in as the reward.

UCB1 is very good at balancing the exploration-exploitation trade-off 
in an efficient manner that minimises regret, which is why it's often 
the go-to algorithm for multi-arm bandit problems.


.. _ucb_pseudocode:

.. figure:: /img/UcbLearner_pseudo.jpg
   :alt: UCB1 Learner (UcbLearner) pseudocode
   :align: center
   :width: 700px

   UcbLearner Pseudocode


.. bibliography::