 Research
 Open Access
 Published:
A subspace recursive and selective feature transformation method for classification tasks
Big Data Analyticsvolume 2, Article number: 10 (2017)
Abstract
Background
Practitioners and researchers often found the intrinsic representations of highdimensional problems has much fewer independent variables. However such intrinsic structure may not be easily discovered due to noises and other factors.
A supervised transformation scheme RST is proposed to transform features into lower dimensional spaces for classification tasks. The proposed algorithm recursively and selectively transforms the features guided by the output variables.
Results
We compared the classification performance of linear classifier and random forest classifier on the original data sets, data sets being transformed with RST and data sets being transformed by principle component analysis and linear discriminant analysis. On 7 out 8 data sets RST shows superior classification performance with linear classifiers but less ideal with random forest classifiers.
Conclusions
Our test shows the proposed method’s capability to reduce features dimensions in general classification tasks and preserve useful information using linear transformations. Some limitations of this method are also pointed out.
Background
In machine learning tasks the intrinsic representations of highdimensional data may have much fewer independent variables, as suggested by Hastie [1] in hand written recognitions, the motion of objects [2], and array signal processing [3]. Most of methods trying to solve this problem is domainspecific, like in image processing, the learning of representation often relies on the locality and smoothness assumptions [4]. We are interested in a generally applicable transformation method to transform feature into lower dimensions while preserving useful information for classification tasks.
The proposed algorithm reduces the dataset dimensionality by selectively projecting data points on to the decision plane determined by fitted linear discriminative model. The algorithm is able to run recursively to make better projections.
Notation
Bold lowercase letters denote vectors, e.g. v, and capital letters for matrices, e.g. M. A vector v’s L1 and L2 norm are denoted by ∥v∥ and ∥v∥_{2} respectively. The ith row and jth column of a matrix M are denoted by m(i) and m _{ j }, respectively. We use sample and data point interchangeably to refer to an observation in the data set.
Classification tasks
Given a dataset X∈R ^{m×n} (m samples and n features) and its corresponding class y∈{c _{1},…,c _{ c }}^{m}. The classes are c; y _{ i } is sample X(i)’s ground truth class. Let x be a row/sample in X, θ be the model parameters.
A classification task is to construct classifier h(θ):x↦y from the seen examples (X,y) such that for an unseen set of examples X _{ pred }, h(X _{ pred },θ) will be as close as possible to y _{ pred }. The modelling process involves minimizing the empirical error between y and h(X,θ), denoted as \(\hat {E}(\boldsymbol {y}, h(X, \theta))\). To avoid h overfits (X,y), the complexity of h is penalized in terms of its some kind of norm h(·)_{ p }. Therefore a classifier can be trained by solving:
Support vector machine
Support vector machines (SVM) is a classic classification algorithm. In the case of twoclass classification task (y∈{c _{1},c _{2}}^{m}), it minimizes \(\hat {E}(\boldsymbol {y}, h(X, \theta))\) by attempting to place a hyperplane, f(X)=w X+w _{0}, between data points from class c _{1} and c _{2}. The classifier h(X _{ pred }) returns a vector of positive or negative signs to indicate the labels of X _{ pred }.
The hyperplane, often known as decision plane, is then optimized to maximize it’s minimal distance to data points from c _{1} and c _{2} (also known as the margin). Vapnik showed that \(\hat {E}(y, h(X))\) can be minimized by maximizing the margin [5]. A penalizing term C is used to control the magnitude of the decision plane misplacing a particular sample x _{ i }∈X on the other side. The model complexity can be penalized by minimizing w’s L _{1} norm w or L _{2} norm w_{2} [6]
In the proposed algorithm w are used to indicate the feature’s contribution to a discriminative model. Since w defines the direction of the hyperplane separating two classes of data points in the n dimensional space, a w _{ i }∈w(i=1,…,n) closing to zero indicates the plane is nearly orthogonal to the i ^{th} axis. This intuition has been used in Recursive Feature Elimination [7] method. Figure 1 is a plot of such weights.
Recursive feature elimination
Recursive Feature Elimination (RFE) is a supervised feature ranking and selection technique. It use a classifier’s featurerelated weights as the feature importance metrics, such w in SVMs or the coefficients in Fischer’s linear discriminator [7, 8].
With a desired numbers of features fixed and number of features s to be removed in each step, RFE starts by training a classifier on all training samples. Recursively it eliminates features with the s lowest importance until the desired numbers of features has been reached.
Principle component analysis
Principle Component Analysis (PCA) is a feature transformation method that finds a sets of orthogonal components that explain the maximum variance out of the samples. It can also be used to reduce dimensionality of the samples by projecting them to a lower dimensional space (principle component) [9].
PCA algorithm starts with subtracting the mean of X for all x _{ i }∈X. Then it computes a covariance matrix \(\frac {Cov(X, X)}{m} \) by singular value decomposition. Afterwards it finds the eigenvectors with k(k<m) greatest eigen value as the orthogonal components. For a transformation matrix A of size m×k using k eigenvectors. The projection of X to a kdimensional space is XA.
Feature extraction with SVM
Tajiri et al. proposed a method based for feature extraction using the weight coefficients of a learnt linear SVM classifier for binary classification problems [10]. The intuition is that assume the decision boundary of a linear SVM perfectly discriminates both classes, then the orthogonal hyper plane of the decision boundary, which is determined by the weight coefficients, is an ideal projection plane for the data points to be projected on to embed the discriminative information into the transformed data points.
Recursive Selective Feature Transformation (RST) for classification tasks
Feature importance
w from the linear SVM model can be seen as how much contribution does each of feature make to form the decision plane.
From a geometric perspective, if w _{ i }=0, the axis of dimension i is then parallel to the decision plane. Such situation indicates that the decision plane does not split the samples into two classes in dimension i; in other words, with feature i the SVM model cannot discriminate the samples.
If w _{ i }=−1 or 1, the axis of dimension i is then orthogonal to the decision plane, which indicates the only by feature i can the decision plane discriminate the samples. In most cases w _{ i } does not reach − 1 and 1.
Feature importance vector v
In a binary classification setting where c=2, our feature importance vector v∈ [ 0,1]^{n} takes the absolute values of weight vector w∈ [ −1,1]^{n}.
In a multiclass setting c>2, a kclass classification problem is solved by OneversusRest [11] scheme, which is an ensemble of k classifiers with each of them trained by discriminating training data from c _{ i }∈c against the rest (c/c _{ i }). In a k class setting there are k ws, denoted as W∈ [ −1,1]^{k×n} from all sub classifiers in the ensemble, then v is taken as the mean vector of absolute values of W.
Recursive Selective Feature Transformation (RST)
Consider the hypothetical dataset X has m rows of examples and n columns of features.
We adopted an improved version of the SVM feature extraction method by Tajiri et al. [10]. In short, Tajiri, et al.’s method projects data points on the linear decision plane’s orthogonal plane. Naturally projecting the data points to a hyperplane in the feature space reduces dimensionality by 1, in other words the projected data points have dimensionality of n−1, thus this approach maintains data points’ interclass separation while reduces dimensionality.
Therefore reducing the dimensionality to a much smaller number, say k, requires data points being projected n−k times, which involves training SVM model n−k times. Since the approximated time complexity of training a linear SVM model is O(m a x(m,n)m i n(m,n)^{2}) [12], assume data sets normally have much more rows than columns, that is m>>n, the time complexity of Tajiri, et al.’s method in this scenario is around O(m(n−k)n ^{2}), which is expensive to compute for high dimensional data.
To speed up the dimensionality reduction process, RST uses SVM’s weight vector to selected topk important features from the feature importance vector (Eq. 4) to reduce the weight vector to length k, and project data points onto a kdimensional feature subspace determined by the reduced weight vector. Thus k can be used as a parameter to determine the desired dimensionality. (RST Step 1)
Projecting data points using weights corresponding to those less important features is not ideal since these features contributed less to form current decision plane. They may represent redundant or other noninformative information and therefore can normally be discarded as seen in RFE. A linear decision plane may not generalise on those less important features, therefore RST performs dimensionality reduction on these features with RST Step 1. (RST Step 2).
To further improve the transformation quality while reduce the dimensionality, RST Step 1 and RST Step 2 will run recursively. The time complexity of a combined step of RST step 1 and 2 is around O(m n ^{2}). In implementation we substitute linear SVM [13] for logit model (l2regularised, stochastic gradient descent solver with hinge loss) on problems with over 10,000 samples to cope with the potential cache issue of SVM.
Furthermore, multiclass problem can be difficult for feature extraction based on linear SVM models. The usual approach dealing with multiclass problem is OneversusRest (OvR) ensemble [11]. Through this approach, a k class problem leads to k decision functions, i.e., k weight vectors and k corresponding intercepts, each of them represents the model for each “one class versus the rest of classes” problems. Obviously averaging the weight vectors, as in the linear models, brings little benefits in terms of generalisation. In RST the ith (i∈[1,k]) projection matrix is learnt from the ith weight vectors in the ensemble. Thus upon training a classifier, the ith subclassifier in the OvR ensemble learns on data set that has been transformed by the ith projection matrix, see Fig. 2 for illustration. Upon predicting an unknown sample, RST transforms the sample will be by the k projection matrices into k transformed samples for the ensemble to predict.
Methods
Evaluation procedures
The experiments runs for 32 iterations. At each iteration the data set is randomly split with traintest ratio of 3:2. Transformations are learnt by RST, linear discriminant analysis (LDA) [14] and PCA on the training set, then both of the training and testing set are transformed with the learnt transformer. RST runs for 6 recursions. Transformation learnt from each recursion were benchmarked. For comparison the number of output features from PCA and LDA is set the same as the number of dimensions of each transformed data set by RST. Due to the nature of LDA, the number of features after dimensionality reduction is strictly less than the number of classes.
The values of data sets has been scaled to [ 0,1] without any centering, no missing values or outliers (within +1.5 IQR) present. Random forest classier and Linear SVM classifier with L _{2} norm penalty and hinge loss is used as the benchmark classifier.
We evaluate the classification performance via multiclass logarithmic loss [15]:
where N is the number of samples being tested, C is the number of classes, log is the natural logarithm, y _{ i,c } is 1 if observation i is in class c and 0 otherwise, and p _{ i,c } is the predicted probability that sample i is in class c. This metric takes into account the uncertainty of the classifier’s prediction according to how much it varies from the actual class label. That is, lower log loss indicates higher confidence that a classifier makes a correct prediction. Incorrect predictions or uncertain predictions will yield higher log loss.
However naturally linear SVM does not make probabilistic prediction, we obtained it by calibrating our linear SVM model using NiculescuMizil’s method [16] via a 3fold crossvalidation.
Data sets
Eight data sets are used to evaluate the proposed method:

1.
otto group classification [17] (61878 samples, 93 dimensions, 9 classes)

2.
mnist digits recognition[18] (70000 samples, 784 dimensions, 10 classes)

3.
olivetti faces recognition [19] (400 samples, 4096 dimensions, 40 classes)

4.
sonar: rock vs mine sensory readings [20] (207 samples, 60 dimensions, 2 classes)

5.
hand written digits [21] (1797 samples, 64 dimensions, 10 classes)

6.
hand written letter recognition [22] (20000 samples, 16 dimensions, 26 classes)

7.
glass identification [23] (214 samples, 9 dimensions, 6 classes)

8.
iris[24] (150 samples, 4 dimensions, 3 classes)
Discussion
Among all the transformer (original  no transformation, RST, LDA, PCA) + classifier (Random Forest, Linear SVM) combinations, SVM combined with RST transformation has the best classification performance (lowest logarithmic loss) over all other combinations on all the data sets except for sonar and letter. It is noteworthy to mention that none of the transformations improves the classification performance over the original (untransformed) data set. While on letter data set RST+SVM performs better than original+SVM by a small margin but being outperformed by original+RandomForest.
Generally RST works better with linear SVM classifiers than with the nonlinear Random Forest classifiers. With nonlinear Random Forest Classifier RST, along with other two feature transformers even deteriorates the classification performance. This is not surprising since Random Forest’s feature extraction techniques is to use multiple random feature subspaces, i.e. multiple random splits over features; if the set of extracted features are optimised to be useful and compact, its random subspace is surely a less useful representation.
In most of the cases RST reduces the dimensionality and steadily reduces the log loss at the same time, up to a point where the log loss stops to decrease or starts to increase. Therefore in practice is preferable to add a sub set of data to validate the feature transformer learnt by RST while training RST so that RST stops learning when not seeing any improvement over log loss.
Conclusion
The results show that RST is able to extract a fraction of the features from high dimensional data set that improves the classification performance from our empirical experiments. However it is performance is not up to the stateofart. For instance, in Otto Group Classification Challenge data set [17], our result is only comparable to the results from the second quartile, which is typically from gradient boosted tree models with feature engineering.
It is noteworthy to mention that three major limitations of RST needs further improvement. Firstly, the process of feature selection via SVM weights is not a smooth process, thus optimising RST via efficient numeric methods, for instances, stochastic gradient descent, is infeasible. Secondly, RST algorithm essentially stacks multiple linear transformations. Although fitting data with linear models are fast and efficient, stacked linear transformations are still not capable enough capture the nonlinearity in the feature space. Thirdly, RST embeds discriminative information into the input space (feature space) from the output space (class labels). If some outputs (classes) are similar [25] or some samples are mislabelled, RST will make less ideal transformations.
Abbreviations
 LDA:

Linear discriminate analysis
 RFE:

Recursive feature elimination
 RST:

Recursive selective feature transformation
 SVM:

Support vector machine
 PCA:

Principle component analysis
References
 1
Hastie T, Simard PY. Metrics and models for handwritten character recognition. Stat Sci. 1998; 13(1):54–65. doi:10.1214/ss/1028905973.
 2
Tomasi C, Kanade T. Shape and motion from image streams under orthography: a factorization method. Int J Comput Vis. 1992; 9(2):137–54.
 3
Markovsky I. Low Rank ApproximationLondon: Springer London; 2012. doi:10.1007/9781447122272. http://link.springer.com/10.1007/9781447122272.
 4
Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013; 35(8):1798–828.
 5
Vapnik VN, Vol. 1. Statistical Learning Theory.1998.
 6
Weston J, Elisseeff A, Schölkopf B, Tipping M. Use of the zeronorm with linear models and kernel methods. J Mach Learn Res. 2003; 3:1439–61.
 7
Gysels E, Renevey P, Celka P. Svmbased recursive feature elimination to compare phase synchronization computed from broadband and narrowband eeg signals in brain–computer interfaces. Signal Process. 2005; 85(11):2178–89.
 8
Louw N, Steel S. Variable selection in kernel fisher discriminant analysis by means of recursive feature elimination. Comput Stat Data Anal. 2006; 51(3):2043–55.
 9
Jolliffe I. Principal Component Analysis.New York: SpringerVerlag; 2002.
 10
Tajiri Y, Yabuwaki R, Kitamura T, Abe S. Feature Extraction Using Support Vector Machines Vol 6444, LNCS. 2010. p. 108–15.
 11
Rocha A, Goldenstein SK. Multiclass from binary: Expanding oneversusall, oneversusone and ecocbased approaches. IEEE Trans. Neural Netw Learn Syst. 2014; 25(2):289–302.
 12
Chapelle O. Training a Support Vector Machine in the Primal. Neural Computation. 2007; 19(5):1155–78. doi:10.1162/neco.2007.19.5.1155. http://www.mitpressjournals.org/doi/10.1162/neco.2007.19.5.1155.
 13
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ. LIBLINEAR: A Library for Large Linear Classification. J Mach Learn Res. 2008; 9(2008):1871–4. doi:10.1038/oby.2011.351.
 14
Yu H, Yang J. A direct LDA algorithm for highdimensional data – with application to face recognition. Pattern Recogn. 2001; 34(10):2067–70. doi:10.1016/S00313203(00)00162X.
 15
Rosasco L, Vito ED, Caponnetto A, Piana M, Verri A. Are Loss Functions All the Same?Neural Comput. 2004; 16(5):1063–76. doi:10.1162/089976604773135104.
 16
NiculescuMizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22Nd International Conference on Machine Learning. ICML ’05. New York: ACM Press: 2005. p. 625–32. doi:10.1145/1102351.1102430. http://doi.acm.org/10.1145/1102351.1102430.
 17
Kaggle. Otto Group Product Classification Challenge. 2015. https://www.kaggle.com/c/ottogroupproductclassificationchallenge/data. Accessed 05 June 2017.
 18
LeCun Y, Cortes C, Burges CJC. Mnist handwritten digit database. AT&T Labs. 2010; 2. [Online]. Available: http://yann.lecun.com/exdb/mnist.
 19
Ahonen T, Hadid A, Pietikäinen M. Face recognition with local binary patterns. Computer visioneccv 2004. 2004:469–481. doi:10.1007/9783540246701_36. http://link.springer.com/10.1007/9783540246701_36.
 20
Gorman RP, Sejnowski TJ. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw. 1988; 1(1):75–89.
 21
Denker JS, Gardner W, Graf HP, Henderson D, Howard R, Hubbard WE, Jackel LD, Baird HS, Guyon I. Neural network recognizer for handwritten zip code digits. In: NIPS.1988. p. 323–31.
 22
Frey PW, Slate DJ. Letter recognition using hollandstyle adaptive classifiers. Mach Learn. 1991; 6(2):161–82.
 23
Evett IW, Spiehler EJ. Rule induction in forensic science. In: Knowledge based systems. Halsted Press: 1988. p. 152–60. http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/evett_1987_rifs.pdf.
 24
Hertz Ja, Krogh AS, Palmer RG, Weigend AS. Introduction to the Theory of Neural Computation. Artificial Intelligence. 1993; I(June):1–17.
 25
Zhao X, Guan S, Man KL. An Output Grouping Based Approach to Multiclass Classification Using Support Vector Machines. In: Advanced Multimedia and Ubiquitous Engineering: 2016. p. 389–395. doi:10.1007/9789811015366_51. http://link.springer.com/10.1007/9789811015366_51.
Acknowledgments
Not applicable.
Funding
This research is supported by Jiangsu Provincial Science and Technology under Grant No. BK20131182, China.
Availability of data and materials
All the data are available from online open data repositories:
1. iris [24]
https://archive.ics.uci.edu/ml/datasets/iris
2. hand written letter recognition [22]
https://archive.ics.uci.edu/ml/datasets/letter+recognition
3. glass identification [23]
https://archive.ics.uci.edu/ml/datasets/glass+identification
4. sonar: rock vs mine sensory readings [20]
http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks)
5. hand written digits [21]
http://scikitlearn.org/stable/modules/generated/sklearn.datasets.load_digits.html
6. mnist digits recognition [18]
http://yann.lecun.com/exdb/mnist/
7. olivetti faces recognition [19]
http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
8. Otto Group Product Classification Challenge [17]
https://www.kaggle.com/c/ottogroupproductclassificationchallenge/data
Author information
Affiliations
Contributions
Conception and design of study: XZ, SSUG. Experiment: XZ. Drafting the manuscript: XZ. Revising the manuscript critically for important intellectual content: Xuan Zhao, Steven ShengUei Guan. Both authors read and approved the final manuscript.
Corresponding author
Correspondence to Xuan Zhao.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Machine learning
 Feature transformation
 Feature selection
 Classification
 Subspace learning