Random Bits Regression: a Strong General Predictor for Big Data

To improve accuracy and speed of regressions and classifications, we present a data-based prediction method, Random Bits Regression (RBR). This method first generates a large number of random binary intermediate/derived features based on the original input matrix, and then performs regularized linear/logistic regression on those intermediate/derived features to predict the outcome. Benchmark analyses on a simulated dataset, UCI machine learning repository datasets and a GWAS dataset showed that RBR outperforms other popular methods in accuracy and robustness. RBR (available on https://sourceforge.net/projects/rbr/) is very fast and requires reasonable memories, therefore, provides a strong, robust and fast predictor in the big data era.


INTRODUCTION
Data-based modeling is becoming practical in predicting outcomes. We are interested in a general data-based prediction task: given a training data matrix (TrX), a training outcome vector (TrY) and a test data matrix (TeX), predict test outcome vector (Yˆ). In the era of big data, two practically conflicting challenges are eminent: (1) the prior knowledge on the subject (also known as domain specific knowledge) is largely insufficient; (2) computation and storage cost of big data is unaffordable.
To meet these aforementioned challenges, this paper is devoted to modeling large number of observations without domain specific knowledge, using regression and classification. The methods widely used for regression and classification can be classified as: linear regression, k nearest neighbor(KNN) [1], support vector machine (SVM) [2], neural network (NN) [3,4], extreme learning machine (ELM) [5], deep learning (DL) [6], random forest (RF) [7] and boosting (GBM) [8] among others. Each method performs well on some types of datasets but has its own limitations on others [9][10][11][12]. A method with reasonable performance on boarder, if not universe, datasets is highly desired. features. Despite their successes, each has its own drawbacks: SVM kernel and its parameters need to be tuned by the user, and the requirement for memory is large: O(sample 2 ). NN and DL's features are learnt and tuned iteratively which is computationally expensive. The number of ELM's features is usually too small for 4 complex tasks. These drawbacks limit their applicabilities on complex tasks, especially when the data is big.
In this report, we propose a novel strategy to take advantage of large number of intermediate features following Cover's theorem [13], which is named Random Bits Regression (RBR). We first generate a huge number of (10 4~1 0 6

Data Pre-processing
Suppose that there are m variables m x x ,..., 1 as predictors. The data are divided into two parts: training dataset and test dataset. The algorithm takes three input files: TrX, TeX and TrY. TrX and TeX are predictor matrices for the training and test datasets, respectively. Each row represents a sample and each column represents a variable. TrY is a target vector or a response vector, which can have a real valued or binary. We standardize (subtract the mean and divide by the standard deviation) TrX and TeX to ease subsequent processing.
(2) Randomly assign weights to each selected variables. The weights are sampled from standard normal distribution, for example, w1, w3, w6~N(0,1) (3) Obtain the weighted sum for each sample, for example The process is repeated K times. The first feature is fixed to 1 to act as the interceptor. The bits are stored in a compact way that is memory efficient (32 times smaller than the real valued counterpart). Once the binary intermediate features matrix F is generated, it is used as the only predictors.

L2 Regularized Linear Regression/Logistic Regression
For real valued TrY, we apply L2 regularized regression (ridge regression) on F and TrY. We model , where  is a regularization parameter which can be selected by cross validation or provided by the user. The  is estimated by Loss For binary valued TrY, we apply L2 regularized logistic regression on F and TrY.
, where  is the regression coefficient. The loss function to be minimized is These models are standard statistical models [14].

Benchmarking
We benchmarked nine methods including linear regression (Linear), logistic For methods that are sensitive to parameters, the parameters were manually tuned to obtain the best performances. The benchmarking was performed on a desktop PC, equipped with an AMD FX-8320 CPU and 32GB memory. The SVM on some large 8 sample datasets failed to finish the benchmarking within a reasonable time (2 week).
Those results are left as blank.
All methods were also applied on one psoriasis [39] GWAS genetic dataset to predict disease outcomes. We used a SNP ranking method for feature selection which 9 was based on allelic association p-values in the training datasets, and selected top associated SNPs as input variables. To ensure the SNP genotyping quality, we removed SNPs that were not in HWE (Hardy-Weinberg Equilibrium) (p-value < 0.01) in the control population. 10

RESULTS
We first examined the nonlinear approximation accuracy of the 8 methods.
Figure1 shows the curve fitting for the sine function with several learning algorithms.
We observed that linear regression, ELM and GBM failed on this dataset and the SVM's fitting was also not satisfactory. On the contrary, KNN, NN, RF and RBR produced good results.
Next we evaluated the performance of the eight methods for regression analysis. the difference between the RBR and the best prediction was within 2%. RBR did not experience any breakdown for all 14 datasets. The random forest was the second best method, however, it suffered from failure on the yacht hydrodynamics dataset.
Finally, we investigated the performance of the RBR for classification. Table 2 showed the classification error percentages of different methods on 16 datasets. RBR took 12 first places, and 4 second places. In the cases when the RBR was not the first place, the difference between the RBR method and the best classification was small and no failure was observed. Despite its simplicity, KNN was the second best method and took 3 first places. However, it suffered from failure/breakdown on the Climate

Model Simulation Crashes, EEG Eye State, Hill Valley with noise, Hill Valley
without noise, and the Ionosphere dataset.

11
The RBR is also reasonably fast on big datasets. For example, it took two hours to process the largest dataset year prediction MSD (515,345 samples, 90 features, and 10 5 intermediate features).

DISCUSSIONS
Big data analysis consists of three scenarios: (1)  The second issue is how the results from each of the subsets are then combined to obtain an overall result. The RBR is closely related to boosting. Each RBR random bit can be viewed as a weak classifier. Logistic regression is the same as one kind of boosting algorithm named logit-boost. The RBR method boosts those weak bits to form a strong classifier. The RBR is closely related to neural networks. The RBR is equivalent to a single hidden layer neural network and the bits are the hidden units.
Large number of bits is a conjugate fashion (we call it wide learning) to deep learning. As no back-propagation is required, the learning rule is quite simple, thus is biologically feasible. Biologically, the brain has the capacity to form a huge feature layer (maybe 10 8~1 0 10 ) to approximates functions well.
The third issue is computational cost. The RBR scales well in memory and computation time compared to the SVM due to a fixed number of binary features. The RBR is faster than the random forest or boosting trees due to the light weight nature of the bits. 14

CONCLUSION
In conclusion, we can confidently conclude that the RBR is a strong, robust and fast off-the-shelf predictor especially in the big data era.

CONFLICT OF INTEREST
There are no conflicts of interest.

ACKNOWLEDGMENTS
There are no funding support.