Data pre-processing
Suppose that there are m ariables x
1,…,x
m
as predictors. The data are divided into two parts: training dataset and test dataset. The algorithm takes three input files: TrX, TeX and TrY. TrX and TeX are predictor matrices for the training and test datasets, respectively. Each row represents a sample and each column represents a variable. TrY is a target vector or a response vector, which can have a real valued or binary. We standardize (subtract the mean and divide by the standard deviation) TrX and TeX to ease subsequent processing.
Intermediate feature generation
We generate 104 ~ 106 random binary intermediate features for each sample. Let K be the number of features to be generated and \( F=\left[\begin{array}{ccc}\hfill {f}_{11}\hfill & \hfill \cdots \hfill & \hfill {f}_{1K}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill {f}_{n1}\hfill & \hfill \cdots \hfill & \hfill {f}_{nK}\hfill \end{array}\right] \) be the feature matrix where fij is the jth intermediate feature of the ith sample. The kth intermediate feature vector f
k
= [f
1k
, …, f
nk
]T is generated as follows:
-
(1)
Randomly select a small subset of variables, e.g. x1, x3, x6.
-
(2)
Randomly assign weights to each selected variables. The weights are sampled from standard normal distribution, for example, w1, w3, w6 ~ N(0,1)
-
(3)
Obtain the weighted sum for each sample, for example z
i
= w
1
x
1i
+ w
3
x
3i
+ w
6
x
6i
for the ith sample.
-
(4)
Randomly pick one z
i
from the n generated z
i
, i = 1, …, n as the threshold T.
-
(5)
Assign bits values to fk according to the threshold T, \( {f}_{ik}=\left\{\begin{array}{c}\hfill 1,{z}_i\ge T\hfill \\ {}\hfill 0,{z}_i<T\hfill \end{array}\right. \), i = 1, …, n.
The process is repeated K times. The first feature is fixed to 1 to act as the interceptor. The bits are stored in a compact way that is memory efficient (32 times smaller than the real valued counterpart). Once the binary intermediate features matrix F is generated, it is used as the only predictors.
L2 regularized linear regression/logistic regression
For real valued TrY, we apply L2 regularized regression (ridge regression) on F and TrY. We model \( {\widehat{Y}}_i={\displaystyle \sum_j{\beta}_j{F}_{ij}} \), where β is the regression coefficient. The loss function to be minimized is \( Loss={\displaystyle \sum_i{\left(Tr{Y}_i-{\widehat{Y}}_i\right)}^2}+\frac{\lambda }{2}{\displaystyle \sum_{j\ne 1}{\beta}_j^2} \), where λ is a regularization parameter which can be selected by cross validation or provided by the user. The β is estimated by \( \widehat{\beta}= \arg { \min}_{\beta } Loss \).
For binary valued TrY, we apply L2 regularized logistic regression on F and TrY. We model \( {\widehat{Y}}_i=\frac{1}{1+ \exp \left(-{\displaystyle \sum_j{\beta}_j{F}_{ij}}\right)} \), where β is the regression coefficient. The loss function to be minimized is \( Loss={\displaystyle \sum_i- TrY \ln \widehat{Y}-\left(1- TrY\right) \ln \left(1-\widehat{Y}\right)}+\frac{\lambda }{2}{\displaystyle \sum_{j\ne 1}{\beta}_j^2} \), where λ is a regularization parameter. The β is estimated by \( \widehat{\beta}= \arg { \min}_{\beta } Loss \).
These models are standard statistical models [19]. The L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm) library was employed to perform the parameter estimation. The L-BFGS method only requires the gradient of the loss function and approximates the Hessian matrix with limited memory cost. Prediction is performed once the model parameters are estimated. Specifically, the same weights that generated the intermediate features in the training dataset were used to generate the intermediate features in the test dataset and use the estimated \( \widehat{\beta} \) in the training dataset to predict the phenotype Y in the test dataset.
Some optimization techniques are used to speed up the estimation: (1) using a relatively large memory (~1GB) to further speed up the convergence of L-BFGS by a factor of 5, (2) using SSE (Streaming SIMD Extensions) hardware instructions to perform bit-float calculations which speeds up the naive algorithm by a factor of 5, and (3) using multi-core parallelism with OpenMP (Open Multi-Processing) to speed up the algorithm.
Benchmarking
We benchmarked nine methods including linear regression (Linear), logistic regression (LR), k-nearest neighbor (KNN), neural network (NN), support vector machine (SVM), extreme learning machine (ELM), random forest (RF), generalized boosted regression models (GBM) and random bit regression (RBR). Our RBR method and usage are available on the website (https://sourceforge.net/projects/rbr/). The KNN method was implemented by our own C++ code. The other seven methods were implemented by R (version: 3.0.2) package: stats, nnet (version: 7.3–8), kernlab (version: 0.9–19), randomForest (version: 4.6–10), elmNN (version: 1.0), gbm (version: 2.1) accordingly. The major differences between C++ and R platforms were runtime, and we didn’t see any significant difference in prediction performance. Ten-fold cross validation was used to evaluate their performance. For methods that are sensitive to parameters, the parameters were manually tuned to obtain the best performances. The benchmarking was performed on a desktop PC, equipped with an AMD FX-8320 CPU and 32GB memory. The SVM on some large sample datasets failed to finish the benchmarking within a reasonable time (2 week). Those results are left as blank.
We first benchmarked all methods on a simulated dataset. The dataset contains 1000 training samples and 1000 testing samples. It contains two variables (X, Y) and is created with the simple formula: Y = Sin(X) + N(0, 0.1), X ∈ (−10π, 10π).
We then benchmarked all datasets from the UCI machine learning repository [20] with the following inclusion criterion: (1) the dataset contains no missing values; (2) the dataset is in dense matrix form; (3) for classification, only binary classification datasets are included; and (4) the included dataset should have a clear instruction and the target variable should be specified.
Overall, we tested 14 regression datasets. They are: 1) 3D Road Network [21], 2) Bike sharing [22], 3) buzz in social media tomhardware, 4) buzz in social media twitter, 5) computer hardware [23], 6) concrete compressive strength [24], 7) forest fire [25], 8) Housing [26], 9) istanbul stock exchange [27], 10) parkinsons telemonitoring [28], 11) Physicochemical properties of protein tertiary structure, 12) wine quality [29], 13) yacht hydrodynamics [30], and 14) year prediction MSD [31]. In addition, we tested 15 classification datasets: 1) banknote authentication, 2) blood transfusion service center [32], 3) breast cancer wisconsin diagnostic [33], 4) climate model simulation crashes [34], 5) connectionist bench [35], 6) EEG eye state, 7) fertility [36], 8) habermans survival [37], 9) hill valley with noise [38], 10) hill valley without noise [38], 11) Indian liver patient [39], 12) ionosphere [40], 13) MAGIC gamma telescope [41], 14) QSAR biodegradation [42], and 15) skin segmentation [43].
All methods were also applied on one psoriasis [44, 45] GWAS genetic dataset to predict disease outcomes. We used a SNP ranking method for feature selection which was based on allelic association p-values in the training datasets, and selected top associated SNPs as input variables. To ensure the SNP genotyping quality, we removed SNPs that were not in HWE (Hardy-Weinberg Equilibrium) (p-value < 0.01) in the control population.