The structure discovery algorithm proposed here takes an on-line, stepwise approach. Weights are added and removed as the algorithm progresses. Regularisation is applied by the choice of weights to add or remove, but can also be introduced into the regression algorithm used to learn the weight values.
For networks of even moderate size, it is impossible to test every possible weight, even in isolation, so a method for choosing which weights to consider is needed. A probability distribution is maintained, from which candidate weights are sampled and added to the model. The model then undergoes a training phase after which all the weights are tested for significance. Insignificant weights are removed and as the model grows, the weight picking distribution is altered to reflect its emerging structure.
At its most abstracted level, the algorithm proceeds as follows:
A number of decisions are required when implementing Algorithm ?? in detail. They are:
-
A representation of the probability distribution over unpicked weights
-
A method for updating the weight picking distribution
-
A choice of learning rule for calculating the new weight values
-
A choice of regularisation method for removing weights
The following sections consider these points in more detail. These sections compare a number of choices for each step and are followed by an example algorithm based on one choice from each.
Representing the probability distribution across weights
The structure discover algorithm is based on the premise that as not all possible weights can even be considered, heuristics for picking weights that have a higher chance of proving useful must be used. The solution is to maintain a probability distribution over the possible weights where the probability of a weight being selected is proportional to its chance of being useful. This requires a representation of the space of possible weights and a method for shaping a function to reflect a weight’s potential usefulness.
Rather than use a single distribution that spans every possible weight, in this work two distributions are used. One covers the order of the weight and the other covers the probability of each neuron in the network being connected to that weight. Let the probability distribution over the weight orders of an n neuron MOHN be
$$ Po(o) : o \in \{1 \dots n\} $$
(7)
and the probability of picking a neuron to be connected by the current choice of order, o be
$$ Pn(i) : i \in \{0 \dots n-1\} $$
(8)
The order, o is sampled first, and then a subset, Q of o neurons are sampled without replacement from P
n(i). Both distributions are discrete—there are n possible orders and n possible neurons to choose from—so their representation need not be from any parametrised class. The probabilities can be represented as a vector of size n with the usual constraint that each must be between 0 and 1 and they must sum to 1. How P
o(o) and P
n(i) evolve as the algorithm progresses is addressed next.
Updating the weight picking distributions
At the first iteration of the algorithm, the distributions P
o(o) and P
n(i) must be set up manually. This presents an opportunity to include any prior knowledge that exists about the function to be modelled and also allows some control over the complexity of the model to be imposed.
Distribution over weight orders
The choice made for the algorithm described here is to initialise P
o(o) with a discrete Laplace centred at order c where initially c=1 and c is incremented as lower order weights are either used or discarded.
$$ Po(o) = \frac{1}{2b} e^{-\frac{|c-o|}{b}} $$
(9)
where b controls the width of the distribution. As the distribution is only sampled at integer points in a limited range, it must be normalised so that the probabilities sum to 1. This is done by summing Eq. 9 over the range of possible orders (1…n) to give a constant, Z and then calculating the probability of picking a weight from each order as \(\frac {po(o)}{Z}\).
In the early iterations of the algorithm where c=1, there is a high probability of picking first order weights and an exponentially decreasing probability of picking weights of higher order. In subsequent iterations, P
o(o) is updated in two ways. Firstly, c is increased to allow the algorithm to pick weights with higher orders and secondly the values of existing weights are used to shape the distribution to guide the algorithm towards orders that have yielded high value weights already.
The weight order probability distribution, P
o(o) is updated by counting the proportion of weights in the current network that are of each order. Let this vector of proportions be p=p
1…p
n
where p
i
is the number of weights at order i divided by the total number of weights in the network. These proportions are then used to update P
o() along with an updated version of the discrete Laplace distribution as follows:
$$ Po(i) \gets (1-(\alpha+\beta)) Po(i) +\alpha p_{i} + \beta \frac{1}{2b} e^{-\frac{|c-o|}{b}} $$
(10)
where α is the proportion of the weight order counts, p to include in the update and β is the proportion of the current order mode, c that is included such that 0≤α≤1, 0≤β≤1 and 0<α+β≤1.
If α+β=1 the new distribution is a mixture of the current distribution of weight orders in the MOHN and the discrete Laplace distribution with a mode of c. If α+β<1 the distribution retains some memory of its previous shape, weighted by 1−(α+β). In the experiments reported in this paper, the values α=0.6, β=0.2 were used and found to work well.
The weight order mode, c needs to be manipulated as learning progresses. In the work reported here, c was set to equal the lowest order with remaining unsampled weights. As lower weight orders are exhausted, the mode naturally moves up. Of course, this does not rule out higher rates being sampled - the α component will bias the sampling towards higher orders if they prove useful. The smaller the value of b, the faster the weight order distribution drops towards zero as it moves away from c.
Distribution over neurons
Once the order, o of a new candidate weight has been sampled, the o neurons that it connects must be picked. These neurons are picked from a distribution, P
n() that evolves as each neuron is picked. The shape of P
n() is determined by a number of factors. Prior knowledge can be included by increasing the probability of variables that are known to be useful. If no prior knowledge is available, then P
n() starts off as a uniform distribution. Once there are some weights in the network, P
n() is determined by a mixture of the prior knowledge and the role played by each neuron in the existing network. To connect a weight of order o, there are two phases to the neuron picking procedure. The distribution from which the first neuron is picked is shaped by the contribution each neuron is already making. In exploratory mode, neurons that have not yet played a role are favoured and in exploitative mode, neurons that are already well connected are more likely to be picked. Subsequent neurons, up to o, are picked from a distribution that is reshaped by the set of neurons that are already connected to the existing set under construction at orders other than o.
The trade-off between exploration and exploitation can be managed. Exploration in this case means favouring neurons that have few or weak connections on the assumption that they do have a role to play, but it has yet to be found. Exploitation refers to picking neurons that already have connections on the assumption that those which have proved useful at some orders will also be useful at others.
The first step in picking the o neurons is to pick the first with a probability proportional to the contribution it makes to the model. Define the contribution made by neuron i as being the sum of the absolute values of the weights connected to neuron i.
$$ C(i) = \sum_{w_{j}:x_{i} \in Q_{j}} |w_{j}| $$
(11)
where w:x
i
∈Q
j
iterates over the weights connected to x
i
. The proportion of the total contribution of all neurons made by neuron i is
$$ Cp(i) = \frac{C(i)}{\sum_{j=0}^{n-1} C(j)} $$
(12)
Now let ρ control the level of exploration, such that ρ=−1 means full exploration (bias the search towards unused neurons), ρ=1 means full exploitation (bias the search towards well used neurons) and ρ=0 leads to a uniformly random choice among the neurons. Any other value of −1<ρ<1 balances the degree to which exploration or exploitation is made. The probability of picking neuron i is
$$ Pn(i) = \begin{cases} (1-\rho)\frac{1}{n}+\rho Cp(i), & \text{if} \rho > 0 \\ (1+\rho)\frac{1}{n}-\rho (1-Cp(i)), & \text{otherwise} \\ \end{cases} $$
(13)
Equation 13 causes the degree of exploration to vary when ρ<0 and causes the degree of exploitation to vary when ρ>0. The closer to zero the value of ρ gets, the more uniformly random the neuron selection becomes.
The first neuron connected by the weight being built is selected by sampling the density defined in Eq. 13. Once the first neuron, x
i
is picked, P
n(i) is updated in two ways. Firstly, the chosen neuron has its probability set to zero, so P
n(i)=0, to prevent it being picked again. Then, P
n() is updated so that other neurons that are already connected to x
i
at other orders have their probability of being chosen increased as follows
$$ Pn(i) \gets (1-(\delta)) Pn(i) +\delta \sum_{w: w \in V, x_{i} \in w} |w| $$
(14)
where V is the set of weights that are connected to any of the neurons that have been picked for the weight currently under construction. The sum is over all weights that are connected to both x
i
and any of the other neurons already chosen for the new weight. The parameter δ∈{0…1} controls the mix of the previous shape of P
n() and the update. High values of δ cause the algorithm to favour neurons that are connected to those already in the set being built, and low values cause it to favour the contribution of each neuron in isolation. In this way sets of neurons that form cliques due to low order connections have a higher probability of being connected at higher orders. Finally, when the number of neurons picked equals o−1, the probability associated with all neurons already connected to those neurons at order o is set to zero to ensure an existing weight is not picked.
Weights are not added and learned one at a time, they are added in batches. In the experiments reported here, the number of weights added at each iteration of the algorithm was set to equal the number of input neurons.
Efficient weight picking
Once a weight is already in the model or has been tested and discarded, it is considered used. Only unused weights should be considered for addition to the model. When the ratio of available weights to used weights is high, it is efficient to simply pick a random weight using the procedure above and check that it is not already in the network or in a list of weights that have been considered but removed from the network. To avoid unuseful weights being repeatedly added and removed, a list of discarded weights is maintained. Newly sampled prospective weights are first compared to the members of this list and not added if they have been recently tried. As weights may appear unuseful as part of a poorly structured network, but later prove to be of use when the rest of the structure is in place, the discard list is periodically emptied to allow weights a second chance of inclusion.
This approach becomes inefficient when there are very few (or no) weights available at the chosen order, meaning very many choices are required before an available weight is found. To ensure that there are available weights at the chosen order, the algorithm keeps count of how many weights of each order have been used. There are \(n \choose o\) possible weights at each order, o, so when the order o count reaches this figure, the probability of picking a weight at that order is forced to zero.
Another efficiency enhancement to the algorithm is the inclusion of a ‘mopping up’ procedure that is activated when the number of used weights at order o reaches a certain percentage of the total (a threshold of 90 % is used in this paper). When the order o count reaches the threshold, the few remaining weights at order o are automatically added to the model and assessed. This allows the probability of picking from order o to then be forced to zero, thus avoiding many fruitless picks from that order.
Learning rules for the weights
We have described two learning rules for a fixed structured MOHN: the online delta method and the offline LASSO. Each method has different attractive properties for estimating weight values during structure discovery. At each iteration of the structure discovery algorithm, a small proportion of new weights are added to a network whose existing weight values are likely to already be close to the correct value. As the delta rule is incremental, it can take advantage of this fact rather than starting a new, empty network. New candidate weights can be initially set using Eq. 5, after which the entire new network is improved using Eq. 4. Algorithm ?? describes this process.
The nature of the regularisation in LASSO means that weights that are not needed have values forced to zero, removing the need for an additional weight removal decision, but at the cost of estimating the entire network structure from scratch at each iteration. Algorithm ?? describes the LASSO network update method. A single value for λ may be chosen or, as is usual in the application of LASSO, a number of different settings for λ may be tried.
Regularisation and weight removal
Regularisation refers to the process of introducing additional constraints to a machine learning process to prevent over fitting. This often takes the form of a penalty on complexity or a bound on the norm of the learned parameters. Regularisation can also involve the use of an out of sample test set. All of these methods may be applied to a MOHN but the main means of regularising a MOHN is the removal of insignificant weights. In this section, two options for weight removal are considered. It is important to remove weights because the rules for updating the probability distributions from which new weights are chosen depend on the presence or absence of weights in the model. It is also desirable to keep the model small for reasons of parsimony, to avoid over fitting and to reduce the time required during learning and inference.
Equation 5 shows a first approximation to the correct value of a weight based on the difference between the mean function output for even and odd parity inputs to the weight. In cases where the difference between the distributions of the function output under each of the two parity conditions is not statistically significant, the weight may be excluded. A t-test is used to compare the mean function output between the odd and even parity input sets, allowing weights with a p-value above a chosen threshold to be removed. Some fine tuning of the critical p-value (pcrit) is required to ensure that the algorithm does not discard too many or too few weights. In this work, the critical p-value starts high (p
c
r
i
t=0.3) and diminishes towards a lower limit (p
c
r
i
t=0.001) as the network grows. The t-value is calculated as follows
$$ t=\frac{w}{\sqrt{\frac{\sigma_{w}}{|D|}}} $$
(15)
where w is the weight value, σ
w
is the variance of f(x) and |D| is the number of training data points.
Learning the weight values using LASSO forces some weights to zero, making the choice of which weights to remove from the network almost trivial. Removing all the weights with zero value is the simple part, but it is still necessary to choose the value of the regularisation parameter, λ. One approach is to calculate the coefficients at a number of different settings of λ and choose which weights to remove from that set.
The full algorithm
The full structure discovery algorithm is presented below, with reference to partial algorithms already described above.
One advantage of this approach to regression is that a lot of information is available during network learning. Firstly, the maximum number of possible weights at each order is calculated as \(n \choose o\) where o is the order and n is the total number of inputs. As the algorithm progresses, the number of weights of each order in the network may be reported and compared to the possible total. This gives a measure of the complexity of the network compared to possible complexity. By reporting the list of tried and discarded weights, it is also possible to monitor how much of the weight space the algorithm has sampled.
The user might choose to set an upper limit on the order of weights added to the network according to the size of the training sample.
Setting the control parameters
The discussion so far has introduced a number of control parameters for controlling the speed at which probability distributions evolve, the trade-off between exploration and exploitation, and the sensitivity of the weight removal process. It may appear that there are a lot of parameters to balance, but in reality, most of them can be fixed at default values and never changed. For example, α,β and δ all control the rate at which the weight probability distributions forget their previous shape. In the work reported here they were set at α=0.6,β=0.2 and δ=0.8. The critical value for p in the t-test, pcrit starts at 0.3 and reduces to 0.001, reducing by a factor of 0.7 on each iteration of the algorithm. The parameter b, which controls the rate at which the discrete Laplace distribution over the weight orders drops to zero was fixed at 1, which concentrates the sample on the mode and 1 step either side of it.