Structure discovery in mixed order hyper networks
 Kevin Swingler^{1}Email authorView ORCID ID profile
DOI: 10.1186/s410440160009x
© The Author(s) 2016
Received: 15 October 2015
Accepted: 14 September 2016
Published: 1 October 2016
Abstract
Background
Mixed Order Hyper Networks (MOHNs) are a type of neural network in which the interactions between inputs are modelled explicitly by weights that can connect any number of neurons. Such networks have a human readability that networks with hidden units lack. They can be used for regression, classification or as content addressable memories and have been shown to be useful as fitness function models in constraint satisfaction tasks. They are fast to train and, when their structure is fixed, do not suffer from local minima in the cost function during training. However, their main drawback is that the correct structure (which neurons to connect with weights) must be discovered from data and an exhaustive search is not possible for networks of over around 30 inputs.
Results
This paper presents an algorithm designed to discover a set of weights that satisfy the joint constraints of low training error and a parsimonious model. The combined structure discovery and weight learning process was found to be faster, more accurate and have less variance than training an MLP.
Conclusions
There are a number of advantages to using higher order weights rather than hidden units in a neural network but discovering the correct structure for those weights can be challenging. With the method proposed in this paper, the use of high order networks becomes tractable.
Keywords
High order neural networks Structure discovery Linkage learningBackground
Mixed order hypernetworks (MOHNs) [1] are neural networks in which weights can connect any number of neurons, rather than the usual two. They can be used as regression models or classifiers like MLPs, as content addressable memories like Hopfield networks [2], and as probability density estimators and fitness function models for use in optimisation [3]. MOHNs can form a basis for functions in \(f:\{1,1\}^{n} \rightarrow \mathbb {R}\), making them universal function models in that space. They contain no hidden units, using higher order weights instead, which has advantages including improved human readability, faster and more stable weight learning, the ability to compare multiple networks like for like, and finer control over the complexity of the function (and so improved regularisation).
Higher order weights also have a disadvantage: the correct weights to include in a network must be identified. A network of n inputs can contain up to 2^{ n } different weights, so small networks may be fully connected and then pruned by removing insignificant weights. In larger networks, many of the possible weights cannot even be considered as the computational time required to do is prohibitive. In these cases, weights must be chosen using heuristics that improve the chance of a weight contributing to the function being selected.
This paper addresses the question of how to choose which weights to include in a MOHN when it is impossible to test more than a small fraction of the possible weights. The inclusion or exclusion of different weights has an effect on the variance/bias tradeoff of a model. Adding weights improves the training error but can, after a point, increase the test error. Adding weights will also increase the complexity of calculations made by the network, slowing both the learning process and inference.
The remainder of the paper is organised as follows. This section concludes with a short look at existing approaches to discovering structure in graphical models and is followed by a summary of mixed order hyper networks. “Statement of the problem” section describes the problem addressed by this paper and is followed by “Methods” section, which describes the proposed algorithm in detail. The “Results and discussion” section describes some experimental results and is followed by conclusions and suggestions for some future directions for this work.
Existing work
Inspiration for a method of discovering the structure of a MOHN can be sought from other fields where computational networks are built from data. This section considers methods for learning the structure of multi layer perceptrons (MLPs), Bayesian belief networks (BNNs) and Markov random fields (MRFs) and considers methods of feature selection.
Multi layer perceptrons
The standard structure of an MLP contains an input layer, one or more hidden layers and an output layer. Each layer is fully connected to the neurons in the layer above. This structure can be changed dynamically during learning by adding or removing weights or neurons. Approaches to dynamically changing the structure of an MLP during training involve adding or removing weights or neurons and their associated set of weights. Bartlett [4], for example proposed an algorithm that added hidden units each time the training error flattened, and removed units based on an information theoretic measure. He also pointed out that the network weights were often optimised to the structure, and adding new ones didn’t allow the network to escape the local optimum it was in. LeCun et al. [5] proposed the Optimal Brain Damage algorithm, which removes weights with low saliency, which is defined based on the second derivative of the cost function. Some algorithms continue to train all of the weights after each iteration of adding or removing weights. Others, such as DMP3 [6] freeze existing weights and only train the newly added ones. Some algorithms add neurons in a restricted structure, for example in the Upstart algorithm [7], the network becomes a tree structure as new neurons are added below existing parent neurons. Although not strictly a structure discovery approach, dropout [8] is a method that drops random neurons during training and then approximates the average output of all the resulting smaller networks at test time. There have also been evolutionary approaches to MLP structure discovery, for example [9] introduces a new crossover operator to allow a GA to discover MLP structure, solving the permutation problem (being that network structure tells you little about network function). See [10] for a review of evolutionary approaches to neural network learning.
There is some meaning attached to each weight in a MOHN in a way that is not present in an MLP. Adding an additional fully connected hidden unit to a single hidden layer MLP (or removing one) involves all of the input units in a way that is not defined until the weight values are learned, and even then the unit’s contribution is unclear. In a MOHN, a weight connects a subset of the inputs and its contribution to the output value is clearly defined.
Bayesian belief networks
A BBN is a directed graph of conditional probability functions in which child nodes represent the distribution across a variable conditional on the values of its immediate parents. Structure discovery in a BNN involves finding a pattern of connectivity that accurately represents the joint distribution across the variables while keeping the complexity of the model low. Complexity can be controlled by limiting the number of parents a node can take (as used in the original K2 algorithm [11]) or by minimum description length (e.g. [12]). Genetic algorithms [13] and evolutionary programming [14, 15] have both been used to discover BNN structure, as have other methods such as branch and bound [16]. In all of these cases, the goal is to reduce the number of possible connections the algorithm has to choose from when adding more.
Markov random fields
A MRF is an alternative representation for joint probability functions using an undirected graph. MRF structure is very similar to that of a MOHN, with connections possible at any order. Structure discovery in MRFs has received less attention than that in MLPs and BNNs. Ravikumar et al. [17] and Lee et al. [18] have both used LASSO in the discovery of structure in MRFs, but address graphs with only pairwise connections.
McCall et al. [19] use statistical tests of independence to discover second order connections between pairs of variables subject to a limit on the number of connections a node can have and follow this with a clique finding algorithm to infer higher order connections [20].
Feature selection
Observing that each possible subset of k variables chosen from all n has an associated candidate weight, the problem of structure discovery may be thought of as a feature selection task where each subset represents a single feature. In data mining, feature selection is generally applied to choosing a subset of variables, and can be split into two main approaches. Wrapper methods [21] combine feature selection with a chosen modelling method, and filter methods work independently of a modelling method as a prelearning step. Many methods, such as stepwise regression [22] use a greedy approach that involves growing a feature set incrementally. Other methods employ an evolutionary approach [23, 24].
Mixed order hyper networks
where \(\langle f(\mathbf {x}Q_{j}^{+})\rangle \) is the expected value of f(x) when the parity of the values in Q _{ j } is positive, and \(\langle f(\mathbf {x}Q_{j}^{}) \rangle \) is the expectation when the parity across Q _{ j } is negative.
where λ controls the degree of regularisation. When λ=0, the LASSO solution becomes the ordinary least squares solution. With λ>0 the regularisation causes the sum of the absolute weight values to shrink such that weight values can be forced to zero. This not only allows LASSO to reject input variables that contribute little, but also to reject higher order weights that are not needed. When training a MOHN, the input to the LASSO regression algorithm is a vector containing a variable for each weight in the model. The value associated with each weight on which the regression is performed is the product of the input values connected to that weight. For a comparison of the MOHN learning rules, see [26].
Swingler [1] showed how a fully connected MOHN, trained on an exhaustive sample of input, output pairs from any function is able to represent that function perfectly. Such a MOHN forms a basis for such functions that is equivalent to the Walsh basis [27]. Any weight that is not required to model the function will go to zero, and may be removed, leaving a sparse structure that still fully represents the function. Normally, an exhaustive sample of data is not available and the network cannot be fully connected due to the number of connections required. In these circumstances, weights must be added and removed in an iterative process designed to discover the nonzero weights. No weight value will go completely to zero based on only a sample of data, so a significance test is required when considering whether to keep or remove a weight. The goal of structure discovery in a MOHN is to add and learn coefficients for those weights that would not go to zero in the full model and to exclude those that would. In principle, if the right weights can be found, any function may be represented to arbitrary accuracy.
Statement of the problem
By including the right weights with the right values, a MOHN can represent any function in \(f:\{1,1\}^{n} \rightarrow \mathbb {R}\). The set of weights present in a given MOHN define its structure, while the values on those weights further define the function’s shape. Each subset of weights (i.e structure) is capable of representing a subset of the possible function shapes (i.e. input → output mappings). For example, the subset containing all of the first order weights and no others is capable of representing any linear function of the inputs. There are \(2^{2^{n}}\) possible structures.
Functions may be designed by hand by specifying the structure and values of the weights. This has been done with Hopfield networks to use them as optimisation fitness functions to model the travelling salesman problem [28, 29] and graph colouring problems [30]. More often, however, the structure and shape of the function must be discovered from data.
Consider two types of scenario in which it is useful to build a model of a function. In the first, data representing observations or measurements have been collected and the requirement is to model the underlying structure of the relationship between inputs and output in the data. This is the traditional approach in machine learning, data mining or statistical learning. Often the data is fixed (cannot be sampled at random) and noisy.
In the second scenario, a function that produces the output resulting from a given input already exists and can be sampled randomly. Reasons for sampling this function to generate data to train an alternative model (in this case a MOHN) are most often found when the output of the function needs to be optimised and evaluating the existing function is costly. A MOHN can be used as a fitness function model as part of an optimisation task as local optima can be quickly identified and even removed [31]. In such cases, it is usually assumed that there is no noise in the output (the fitness score) and that the function can be sampled at random.
In regression models such as these, the weight values may be calculated independently (for example, in single linear regressions) and joined if the variables are uncorrelated (orthogonal). However, if pairs of variables are correlated, then the weights must be calculated by considering all of the variables together. The correlation between variables depends to some extent on whether the training data is a fixed sample (as is usual in machine learning) or random samples from a function (as is usual in optimisation), the latter case allowing correlations to be reduced by the sampling regime. With a uniformly random sampling regime, correlations between inputs diminish as the sample size grows. A consequence of this is that the value taken by any single weight will depend in part on the presence of other weights in the network, so a weight that appears insignificant on its own may gain significance as the network grows.
The problem of MOHN structure discovery involves discovering a sufficient number of the nonzero weights to achieve an acceptably low error without the need to test every possible weight. The difference between MOHN structure discovery and feature selection lies in the fact that many candidate weights will never be tested. The choice of new weights must be guided by the existing network structure.
To discover the correct structure in a MOHN requires a sample of training data, which may be a fixed set of samples or arbitrary samples from the function to be learned. A method for learning the value to assign to a weight is also needed along with a method for testing the statistical significance of that value. Of course, a method for choosing which weights to consider (and so expend the computational effort of calculating a weight) is also required.
It is also important to impose some form of regularisation on the structure being built to avoid over fitting. Generally, regularisation is done by limiting the size of the weights, choosing a subset of variables to include in a model (feature selection) or controlling some aspect of a statistical model’s complexity (the number of hidden units in an MLP, for example). With a MOHN, one can (indeed, must) be more explicit about complexity by choosing the correct order for the weights.
Inspiration from existing work
We have already discussed various existing methods for discovering structure in different graphical function models. A common approach to MLP structure learning, Bayesian network learning and feature selection is to use evolutionary computation. An evolutionary approach has been discounted for this work for a number of reasons. Firstly, a population of networks with different structures would share parameters, leading to duplicated weight estimation calculations. This could be solved by maintaining a super set of weights so that each was only estimated once, but that super set would constitute one large MOHN in its own right, which would allow a better estimation of weight importance than that taken from a population of smaller networks.
Approaches to growing and pruning MLPs are hindered by the fact that there is no accessible meaning attached to hidden units. A MOHN, on the other hand, is transparent in the sense that the role of a weight is clearly defined in terms of a combination of input variables. The concept of training a small network until the error flattens and then adding more weights may be used, but the weights that are added may be chosen with some knowledge of their role, unlike the approach for an MLP. Much of the difficulty in estimating a MRF is due to the problem of evaluating the partition function, but the use of LASSO to regularise a network has been demonstrated, and will be considered in this work too.
Greedy approaches to feature selection often require the entire set of individual features to be evaluated at the first step. In subsequent steps, new combinations that are limited to those containing the best feature from the first step are explored. A MOHN is an unsuitable subject for this approach as there are 2^{ n } possible features—too many to consider even once. When building a large MOHN, there will be a great many weights that can never be evaluated. New candidate weights must be picked without knowing the estimate of their value, but based on the two things that can be known about them: the number of neurons they connect (their order) and the role of those neurons in the current network.
Many methods limit the degree of connectivity allowed in a graphical structure and these are often applied on at the level of single nodes (for example, no node may have more than x links). This idea is extended in this work, firstly by controlling the order of connections (the number of nodes each weight is connected to) and secondly by allowing both an exploratory phase, where nodes that have not yet found a use are preferred over those with several connections already, and an exploitative phase where neurons that have proved useful already are more likely to be considered as part of higher order connections. Exploration has the effect of restricting the number of connections on nodes and exploitation is more akin to the results of greedy feature selection or clique based MRF structure discovery.
Methods
The structure discovery algorithm proposed here takes an online, stepwise approach. Weights are added and removed as the algorithm progresses. Regularisation is applied by the choice of weights to add or remove, but can also be introduced into the regression algorithm used to learn the weight values.
For networks of even moderate size, it is impossible to test every possible weight, even in isolation, so a method for choosing which weights to consider is needed. A probability distribution is maintained, from which candidate weights are sampled and added to the model. The model then undergoes a training phase after which all the weights are tested for significance. Insignificant weights are removed and as the model grows, the weight picking distribution is altered to reflect its emerging structure.
At its most abstracted level, the algorithm proceeds as follows:

A representation of the probability distribution over unpicked weights

A method for updating the weight picking distribution

A choice of learning rule for calculating the new weight values

A choice of regularisation method for removing weights
The following sections consider these points in more detail. These sections compare a number of choices for each step and are followed by an example algorithm based on one choice from each.
Representing the probability distribution across weights
The structure discover algorithm is based on the premise that as not all possible weights can even be considered, heuristics for picking weights that have a higher chance of proving useful must be used. The solution is to maintain a probability distribution over the possible weights where the probability of a weight being selected is proportional to its chance of being useful. This requires a representation of the space of possible weights and a method for shaping a function to reflect a weight’s potential usefulness.
The order, o is sampled first, and then a subset, Q of o neurons are sampled without replacement from P n(i). Both distributions are discrete—there are n possible orders and n possible neurons to choose from—so their representation need not be from any parametrised class. The probabilities can be represented as a vector of size n with the usual constraint that each must be between 0 and 1 and they must sum to 1. How P o(o) and P n(i) evolve as the algorithm progresses is addressed next.
Updating the weight picking distributions
At the first iteration of the algorithm, the distributions P o(o) and P n(i) must be set up manually. This presents an opportunity to include any prior knowledge that exists about the function to be modelled and also allows some control over the complexity of the model to be imposed.
Distribution over weight orders
where b controls the width of the distribution. As the distribution is only sampled at integer points in a limited range, it must be normalised so that the probabilities sum to 1. This is done by summing Eq. 9 over the range of possible orders (1…n) to give a constant, Z and then calculating the probability of picking a weight from each order as \(\frac {po(o)}{Z}\).
In the early iterations of the algorithm where c=1, there is a high probability of picking first order weights and an exponentially decreasing probability of picking weights of higher order. In subsequent iterations, P o(o) is updated in two ways. Firstly, c is increased to allow the algorithm to pick weights with higher orders and secondly the values of existing weights are used to shape the distribution to guide the algorithm towards orders that have yielded high value weights already.
where α is the proportion of the weight order counts, p to include in the update and β is the proportion of the current order mode, c that is included such that 0≤α≤1, 0≤β≤1 and 0<α+β≤1.
If α+β=1 the new distribution is a mixture of the current distribution of weight orders in the MOHN and the discrete Laplace distribution with a mode of c. If α+β<1 the distribution retains some memory of its previous shape, weighted by 1−(α+β). In the experiments reported in this paper, the values α=0.6, β=0.2 were used and found to work well.
The weight order mode, c needs to be manipulated as learning progresses. In the work reported here, c was set to equal the lowest order with remaining unsampled weights. As lower weight orders are exhausted, the mode naturally moves up. Of course, this does not rule out higher rates being sampled  the α component will bias the sampling towards higher orders if they prove useful. The smaller the value of b, the faster the weight order distribution drops towards zero as it moves away from c.
Distribution over neurons
Once the order, o of a new candidate weight has been sampled, the o neurons that it connects must be picked. These neurons are picked from a distribution, P n() that evolves as each neuron is picked. The shape of P n() is determined by a number of factors. Prior knowledge can be included by increasing the probability of variables that are known to be useful. If no prior knowledge is available, then P n() starts off as a uniform distribution. Once there are some weights in the network, P n() is determined by a mixture of the prior knowledge and the role played by each neuron in the existing network. To connect a weight of order o, there are two phases to the neuron picking procedure. The distribution from which the first neuron is picked is shaped by the contribution each neuron is already making. In exploratory mode, neurons that have not yet played a role are favoured and in exploitative mode, neurons that are already well connected are more likely to be picked. Subsequent neurons, up to o, are picked from a distribution that is reshaped by the set of neurons that are already connected to the existing set under construction at orders other than o.
The tradeoff between exploration and exploitation can be managed. Exploration in this case means favouring neurons that have few or weak connections on the assumption that they do have a role to play, but it has yet to be found. Exploitation refers to picking neurons that already have connections on the assumption that those which have proved useful at some orders will also be useful at others.
Equation 13 causes the degree of exploration to vary when ρ<0 and causes the degree of exploitation to vary when ρ>0. The closer to zero the value of ρ gets, the more uniformly random the neuron selection becomes.
where V is the set of weights that are connected to any of the neurons that have been picked for the weight currently under construction. The sum is over all weights that are connected to both x _{ i } and any of the other neurons already chosen for the new weight. The parameter δ∈{0…1} controls the mix of the previous shape of P n() and the update. High values of δ cause the algorithm to favour neurons that are connected to those already in the set being built, and low values cause it to favour the contribution of each neuron in isolation. In this way sets of neurons that form cliques due to low order connections have a higher probability of being connected at higher orders. Finally, when the number of neurons picked equals o−1, the probability associated with all neurons already connected to those neurons at order o is set to zero to ensure an existing weight is not picked.
Weights are not added and learned one at a time, they are added in batches. In the experiments reported here, the number of weights added at each iteration of the algorithm was set to equal the number of input neurons.
Efficient weight picking
Once a weight is already in the model or has been tested and discarded, it is considered used. Only unused weights should be considered for addition to the model. When the ratio of available weights to used weights is high, it is efficient to simply pick a random weight using the procedure above and check that it is not already in the network or in a list of weights that have been considered but removed from the network. To avoid unuseful weights being repeatedly added and removed, a list of discarded weights is maintained. Newly sampled prospective weights are first compared to the members of this list and not added if they have been recently tried. As weights may appear unuseful as part of a poorly structured network, but later prove to be of use when the rest of the structure is in place, the discard list is periodically emptied to allow weights a second chance of inclusion.
This approach becomes inefficient when there are very few (or no) weights available at the chosen order, meaning very many choices are required before an available weight is found. To ensure that there are available weights at the chosen order, the algorithm keeps count of how many weights of each order have been used. There are \(n \choose o\) possible weights at each order, o, so when the order o count reaches this figure, the probability of picking a weight at that order is forced to zero.
Another efficiency enhancement to the algorithm is the inclusion of a ‘mopping up’ procedure that is activated when the number of used weights at order o reaches a certain percentage of the total (a threshold of 90 % is used in this paper). When the order o count reaches the threshold, the few remaining weights at order o are automatically added to the model and assessed. This allows the probability of picking from order o to then be forced to zero, thus avoiding many fruitless picks from that order.
Learning rules for the weights
We have described two learning rules for a fixed structured MOHN: the online delta method and the offline LASSO. Each method has different attractive properties for estimating weight values during structure discovery. At each iteration of the structure discovery algorithm, a small proportion of new weights are added to a network whose existing weight values are likely to already be close to the correct value. As the delta rule is incremental, it can take advantage of this fact rather than starting a new, empty network. New candidate weights can be initially set using Eq. 5, after which the entire new network is improved using Eq. 4. Algorithm ?? describes this process.
The nature of the regularisation in LASSO means that weights that are not needed have values forced to zero, removing the need for an additional weight removal decision, but at the cost of estimating the entire network structure from scratch at each iteration. Algorithm ?? describes the LASSO network update method. A single value for λ may be chosen or, as is usual in the application of LASSO, a number of different settings for λ may be tried.
Regularisation and weight removal
Regularisation refers to the process of introducing additional constraints to a machine learning process to prevent over fitting. This often takes the form of a penalty on complexity or a bound on the norm of the learned parameters. Regularisation can also involve the use of an out of sample test set. All of these methods may be applied to a MOHN but the main means of regularising a MOHN is the removal of insignificant weights. In this section, two options for weight removal are considered. It is important to remove weights because the rules for updating the probability distributions from which new weights are chosen depend on the presence or absence of weights in the model. It is also desirable to keep the model small for reasons of parsimony, to avoid over fitting and to reduce the time required during learning and inference.
where w is the weight value, σ _{ w } is the variance of f(x) and D is the number of training data points.
Learning the weight values using LASSO forces some weights to zero, making the choice of which weights to remove from the network almost trivial. Removing all the weights with zero value is the simple part, but it is still necessary to choose the value of the regularisation parameter, λ. One approach is to calculate the coefficients at a number of different settings of λ and choose which weights to remove from that set.
The full algorithm
The full structure discovery algorithm is presented below, with reference to partial algorithms already described above.
One advantage of this approach to regression is that a lot of information is available during network learning. Firstly, the maximum number of possible weights at each order is calculated as \(n \choose o\) where o is the order and n is the total number of inputs. As the algorithm progresses, the number of weights of each order in the network may be reported and compared to the possible total. This gives a measure of the complexity of the network compared to possible complexity. By reporting the list of tried and discarded weights, it is also possible to monitor how much of the weight space the algorithm has sampled.
The user might choose to set an upper limit on the order of weights added to the network according to the size of the training sample.
Setting the control parameters
The discussion so far has introduced a number of control parameters for controlling the speed at which probability distributions evolve, the tradeoff between exploration and exploitation, and the sensitivity of the weight removal process. It may appear that there are a lot of parameters to balance, but in reality, most of them can be fixed at default values and never changed. For example, α,β and δ all control the rate at which the weight probability distributions forget their previous shape. In the work reported here they were set at α=0.6,β=0.2 and δ=0.8. The critical value for p in the ttest, pcrit starts at 0.3 and reduces to 0.001, reducing by a factor of 0.7 on each iteration of the algorithm. The parameter b, which controls the rate at which the discrete Laplace distribution over the weight orders drops to zero was fixed at 1, which concentrates the sample on the mode and 1 step either side of it.
Results and discussion
Many neural network training algorithms are tested on data sets that are either from a public source, such as the UCI database or are of specific interest to the authors of the paper. This paper takes a different approach and uses a set of known test functions that generate training data. Functions with a known structure are chosen so that the results of the algorithm may be evaluated on how well a model structure matches the desired structure as well as by training error. One of the motivations behind the design of the MOHN is that it may be used as a fitness function model for heuristic optimisation tasks [3]. Taking inspiration from this fact, the functions to be learned in this paper are fitness functions to optimisation problems. In such problems, the MOHN acts as a metaheuristic as it knows nothing about the nature of the problem. It simply has a fitness function that it must learn to replicate. Having learned the function, the solutions will be attractors in the energy function of the network.
Graph colouring function
where e _{ d } is the number of edges with a different colour at each end, e _{ t } is the number of edges in the graph and i _{1} is the number of inputs in block i with a value 1. The output of the function is 1 when a correct colouring for the graph is present and each block has only one bit set to one. The function has interactions within each block at orders up to k, which control the onlyonecolour constraint and additional high order weights between blocks that are connected in the graph.
Comparing LASSO and delta learning
This paper has presented two possible learning rules for estimating the weight values on the network structure as it evolves: Delta learning and LASSO. The design justification for using the delta rule after the addition of new weights is that the existing weights should already be close to their desired values so intuition suggests that this will be faster than using LASSO across the whole network. This section presents some experimental results comparing the two using another well known fitness function, known as the kbit trap.
The ‘trap’ part of the function is defined by the second case in Eq. 18, which requires the output to be high when there are no zero values set in the inputs, counter to the first case, which causes the increased presence of zeros in the inputs to increase the function output. Most input patterns suggest a first order model would suffice, with only the patterns where all k bits are set to 1 contradicting that. The trap is that learning algorithms might favour a simple model and fail to add the high order components required to avoid a large error whenever all the inputs are set to 1.
Local minima
A well known limitation of error descent learning algorithms for MLPs is the tendency to settle into a state that represents a local minimum in the cost function. Based on experimental data, [32, 33] suggests that this is due to the fact that the MLP must combine learning the structure of the function and the parameters that fit that structure to the data at the same time, using the same weights. Training a fixed structure MOHN does not run the risk of settling in local minima—all of the learning rules are deterministic and will produce the same result, which represents the global error minimum for the given structure. However, the global minimum differs from structure to structure, so any fixed structure can be considered a local minimum. If a structure discovery algorithm is not able to move from any one structure to a better one, then its current state is a local minimum. The proposed algorithm can be made to avoid local minima simply by ensuring some degree of exploration. Of course, it might take an impractically long time to find the right weights in this way, but the algorithm will not become trapped.
Visualising network structure
During the training of the MLP, very little information is available compared to that from a MOHN. As the MOHN learns, the weight profile and the weight probability distributions may be reported and analysed to understand the progress being made. This section illustrates the structure discovery process using the same kbit trap function described above. A visual representation of network structure is used to produce an image with n columns and W rows where each column represents a neuron and each row represents a single weight. The pixel at coordinate (i,j) is plotted if w _{ j } is connected to u _{ i } and its colour reflects the strength of the connection. If w _{ j } is not connected to u _{ i }, then no pixel is plotted. The weights are sorted in combinatoric order, with first order weights at the top of the image, second order weights below them, and so on. If a weight is not present in the network, it does not appear in the image, so the height of the image depends on the number of weights in the network.
MOHN classifiers
It is also possible to train a MOHN as a classifier by assigning some neurons the role of inputs and others the role of outputs. The pattern of connectivity is restricted so that each weight connects m>0 inputs and exactly one output. There are no weights between subsets of inputs alone or outputs alone. The structure discovery algorithm works in the same way as described above with the one modification that a new weight is added for each output unit in turn, but the order and the connected inputs are chosen in the same way.
A classifier MOHN was tested on the MNIST [34] hand written digit data set and compared to a standard MLP. A threshold was applied to the input data to create a binary training set but no other preprocessing was applied. The purpose of this test was to establish whether or not a MOHN could produce similar performance to an MLP on a given data set, not to achieve a new best accuracy for the MNIST data.
An MLP with 30 hidden units was trained on the thresholded MNIST training set data and acheived a correct classification rate of 95 % on training and 94 % on the test data. A MOHN was trained on the same data and acheived 94 % on training and 93 % on test. The results are comparable, but the MLP allows very little analysis of the structure of the resulting function. The MOHN, however, exposes the structure of the solution.
Conclusion
Mixed Order Hyper Networks offer a number of attractions for model fitting. The complexity of the model is easily controlled and understood, offering more information than a simple parameter count. The structure of the model is human readable and allows prior knowledge to be included. The model can handle missing values, or impute them if required and feature selection is an integral part of the learning process. However, the challenge of discovering the correct structure for the network must be addressed explicitly as the network cannot discover high order features as part of the weight estimation process, as an MLP can. This paper proposes a method for discovering the weight structure of a MOHN in a way that allows different levels of prior knowledge to be included. In addition, as solutions are human readable, it is possible for a number of human driven iterations of the algorithm to allow the pattern discovery ability of the human eye to drive the search. More work is required in this area, in particular on methods for visualising the network structure. Such an approach would work well with a suitable human computer interface and development of an intuitive GUI is required.
Declarations
Authors’ information
The author lectures at the University of Stirling in the department of Computing Science and Mathematics. He is programme director of an MSc. in Big Data and also runs a spin out company providing hardware, software and consultancy for data collection and analytics. His research interests concern neural networks that replace hidden units with high order connections. This paper is the latest in a line of publications on this topic.
Competing interests
The author declares that he has no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Swingler K, Smith LS. Training and making calculations with mixed order hypernetworks. Neurocomputing. 2014; 141:65–75. doi:10.1016/j.neucom.2013.11.041.
 Swingler K, Smith LS. An analysis of the local optima storage capacity of hopfield network based fitness function models. Trans Comput Collective Intel XVII, LNCS 8790. 2014; 8790:248–71.Google Scholar
 Swingler K. Local optima suppression search in mixed order hyper networks. In: Proc. UKCI 2015. Setúbal: ScITePress: 2015.Google Scholar
 Bartlett EB. Dynamic node architecture learning: An information theoretic approach. Neural Networks. 1994; 7(1):129–40.MathSciNetView ArticleGoogle Scholar
 LeCun Y, Denker JS, Solla SA, Howard RE, Jackel LD. Optimal brain damage. In: NIPs. San Francisco: Morgan Kaufmann: 1989.Google Scholar
 Andersen TL, Martinez TR. Dmp3: A dynamic multilayer perceptron construction algorithm. Int J Neural Syst. 2001; 11(02):145–65.Google Scholar
 Frean M. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Comput. 1990; 2(2):198–209.View ArticleGoogle Scholar
 Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014; 15:1929–58.MathSciNetMATHGoogle Scholar
 GarcíaPedrajas N, OrtizBoyer D, HervásMartínez C. An alternative approach for neural network evolution with a genetic algorithm: Crossover by combinatorial optimization. Neural Networks. 2006; 19(4):514–28.View ArticleMATHGoogle Scholar
 Yao X, Liu Y. Towards designing artificial neural networks by evolution. Appl Math Comput. 1998; 91(1):83–90.View ArticleMATHGoogle Scholar
 Cooper GF, Herskovits E. A bayesian method for the induction of probabilistic networks from data. Mach Learn. 1992; 9:309–47.MATHGoogle Scholar
 Bouckaert RR. Probalistic network construction using the minimum description length principle. In: ECSQARU. Berlin Heidelberg: Springer: 1993. p. 41–8.Google Scholar
 Larrañaga P, Kuijpers CMH, Murga RH, Yurramendi Y. Learning bayesian network structures by searching for the best ordering with genetic algorithms. IEEE Trans Syst Man Cybernet Part A. 1996; 26(4):487–93.View ArticleGoogle Scholar
 Wong ML, Lee SY, Leung KS. A hybrid data mining approach to discover bayesian networks using evolutionary programming. In: GECCO. San Francisco: Morgan Kaufmann: 2002. p. 214–22.Google Scholar
 Wong ML, Lam W, Leung KS. Using evolutionary programming and minimum description length principle for data mining of bayesian networks. IEEE Trans Pattern Anal Mach Intell. 1999; 21(2):174–8.View ArticleGoogle Scholar
 De Campos CP, Ji Q. Efficient structure learning of bayesian networks using constraints. J Mach Learn Res. 2011; 12:663–89.MathSciNetMATHGoogle Scholar
 Ravikumar P, Wainwright MJ, Lafferty J. Highdimensional graphical model selection using l 1regularized logistic regression. Neural Information Processing Systems. San Francisco: Morgan Kaufmann. 2006.Google Scholar
 Lee SI, Ganapathi V, Koller D. Efficient structure learning of markov networks using l_1regularization. In: Advances in Neural Information Processing Systems. San Francisco: Morgan Kaufmann: 2006. p. 817–24.Google Scholar
 Brownlee AE, McCall JA, Shakya SK, Zhang Q. Structure learning and optimisation in a markov network based estimation of distribution algorithm. In: Exploitation of Linkage Learning in Evolutionary Algorithms. Berlin Heidelberg: Springer: 2010. p. 45–69.Google Scholar
 Brownlee A, McCall J, Lee C. Structural coherence of problem and algorithm: An analysis for EDAs on all 2bit and 3bit problems. In: 2015 IEEE Congress on Evolutionary Computation (CEC). IEEE Press: 2015. p. 2066–73.Google Scholar
 Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell. 1997; 97(1–2):273–324.View ArticleMATHGoogle Scholar
 Hocking RR. A biometrics invited paper. the analysis and selection of variables in linear regression. Biometrics. 1976; 32(1):1–49.MathSciNetView ArticleMATHGoogle Scholar
 Bala J, Jong KD, Huang J, Vafaie H, Wechsler H. Using learning to facilitate the evolution of features for recognizing visual concepts. Evolutionary Computation. 1996; 4:297–311.View ArticleGoogle Scholar
 CantúPaz E. Feature subset selection with hybrids of filters and evolutionary algorithms. In: Scalable Optimization Via Probabilistic Modeling. Berlin Heidelberg: Springer: 2006. p. 291–314.Google Scholar
 Tibshirani R. Regression shrinkage and selection via the lasso. Royal Stat Soc Ser B (Methodological), J. 1996; 58:267–88.MathSciNetMATHGoogle Scholar
 Swingler K. A comparison of learning rules for mixed order hyper networks. In: Proc. IJCCI (NCTA). Setúbal: SciTePress: 2015.Google Scholar
 Davidor Y. Epistasis variance: A viewpoint on GAhardness. In: Foundations of Genetic Algorithms. San Francisco: Morgan Kaufmann: 1990. p. 23–35.Google Scholar
 Hopfield JJ, Tank DW. Neural computation of decisions in optimization problems. Biol Cybernet. 1985; 52:141–52.MathSciNetMATHGoogle Scholar
 Wilson GV, Pawley GS. On the stability of the travelling salesman problem algorithm of hopfield and tank. Biol Cybern. 1988; 58(1):63–70. doi:10.1007/BF00363956.
 Caparrós GJ, Ruiz MAA, Hernández FS. Hopfield neural networks for optimization: study of the different dynamics. Neurocomputing; 43(1–4):219–37.Google Scholar
 Swingler K. Local optima suppression search in mixed order hyper networks. In: Computational Intelligence (UKCI), 2015 15th UK Workshop On: 2015.Google Scholar
 Swingler K. A walsh analysis of multilayer perceptron function. In: Proc. IJCCI (NCTA). Setúbal: ScITePress: 2014.Google Scholar
 Swingler K. Computational Intelligence In: Merelo JJ, Rosa A, Cadenas JM, Dourado A, Madani K, Filipe J, editors. Studies in Computational Intelligence. Springer: 2016. p. 303–23, doi:10.1007/9783319263939_18. http://dx.doi.org/10.1007/9783319263939_18.
 LeCun Y, Bottou L, Bengio Y, Haffner P. Gradientbased learning applied to document recognition. Proc IEEE. 1998; 86(11):2278–324.View ArticleGoogle Scholar