 Review
 Open Access
A review on multitask metric learning
 Peipei Yang^{1}Email authorView ORCID ID profile,
 Kaizhu Huang^{2} and
 Amir Hussain^{3}
 Received: 10 August 2017
 Accepted: 30 January 2018
 Published: 27 February 2018
Abstract
Distance metric plays an important role in machine learning which is crucial to the performance of a range of algorithms. Metric learning, which refers to learning a proper distance metric for a particular task, has attracted much attention in machine learning. In particular, multitask learning deals with the scenario where there are multiple related metric learning tasks. By jointly training these tasks, useful information is shared among the tasks, which significantly improves their performances. This paper reviews the literature on multitask metric learning. Various methods are investigated systematically and categorized into four families. The central ideas of these methods are introduced in detail, followed by some representative applications. Finally, we conclude the review and propose a number of future work directions.
Keywords
 Multitask learning
 Metric learning
 Review
Background
In the area of machine learning, pattern recognition, and data mining, the concept of distance metric usually plays an important role. For many algorithms, a proper distance metric is critical to their performances. For example, the nearest neighbor classification relies on the metric to identify the nearest neighbor and determine their class, whilst kmeans clustering uses the metric to determine which cluster a sample should belong to.
The metric is usually used as a measure of the similarity or dissimilarity, and there are various types of predefined distance metrics, such as Euclidean distance, cosine similarity, Hamming distance, etc. However, in practical applications, these generalpurpose metrics are insufficient to catch the sundry particular properties of various tasks. Therefore, researchers propose learning a metric from data for particular tasks, to improve algorithm performance. This is termed metric learning [1–7].
With the advent of data science, challenging and evolving problems have arisen. Obtaining training data is a costly process, hence complex models are being trained on small datasets, resulting in poor generalization. Alongside this the number of tasks to be learnt has increased significantly. To overcome these problems, multitask learning is proposed [8–13]. It aims to consider multiple tasks simultaneously at a higher level, whilst transferring useful information among different tasks to improve their performances.
After multitask learning was proposed by Caruana [8] in 1997, various strategies have been designed based on different assumptions. There are also some closely related topics, such as transfer learning [14, 15], domain adaptation [16], metalearning [17], lifelong learning [18], learning to learn [19], etc. In spite of some minor discrepancies among them, they share the same basic idea that the performance is improved by considering multiple learning tasks jointly and sharing information with other tasks.
Under such a background, it is natural to consider the problem of multitask metric learning. However, most multitask learning algorithms designed for traditional models are difficult for applying to metric learning algorithms due to the obvious differences between the two kinds of models. To resolve this problem, a series of multitask metric learning approaches are specifically designed for the metric learning models. By properly coupling multiple metric learning tasks, their performances are effectively improved.
Metric learning has the particularity that its effect on performance can be only given indirectly by the algorithm relying on the metric. This requires a different way to construct the multitask learning framework from traditional models. As far as we know, there is no review at present on the multitask metric learning, hence this paper will give a general overview of the existing works.
The rest of the paper is organized as follows. First we provide an overview of the basic concepts of metric learning and briefly introduce multitask metric learning. Next, various strategies of multitask metric learning approaches are reviewed. We then introduce some representative applications of multitask metric learning, and conclude with a discussion on potential future issues.
Overview
In this section, we first provide an overview of metric learning, including its concept and several representative algorithms. Then a general description about multitask metric learning is presented, leaving the details of the algorithms for the next section.
A brief review on metric learning
The notion of distance metric was originally a concept in mathematics, which refers to a function defined on \(\mathcal {X}\) as \(d:\mathcal {X}\times \mathcal {X}\rightarrow \mathbf {R}_{+}=[\!0,+\infty)\) satisfying positiveness, symmetry, and triangle inequality [20]. In the community of machine learning, metric is unnecessary to keep its original definition from mathematics, but usually refers to a general measure of dissimilarity or similarity. A lot of machine learning algorithms use it to measure the dissimilarity between samples without explicitly referring its definition, such as nearest neighbor classification, kmeans clustering, etc.
There have been various types of predefined metrics for general purposes. For example, assuming two points in the ddimensional space \(\mathbf {x}_{i},\mathbf {x}_{j}\in \mathcal {X}=\mathbb {R}^{d}\), the most frequently used Euclidean distance is defined as d(x_{ i },x_{ j })=∥x_{ i }−x_{ j }∥_{2}. Another example is the Mahalanobis metric [21] that is defined as \(d_{\mathbf {M}}(\mathbf {x}_{i},\mathbf {x}_{j})=\sqrt {(\mathbf {x}_{i}\mathbf {x}_{j})^{\top }\mathbf {M}(\mathbf {x}_{i}\mathbf {x}_{j})}\), where the symmetric positive semidefinite matrix M is the Mahalanobis matrix which determines the metric.
In spite of their widelyspread usage, the predefined metrics are incapable to capture the variety of real applications. Considering its importance to the performances of algorithms, researchers propose to learn a metric from the data instead of using the predefined metrics directly. By adapting the metric to the specific data for some algorithm, the performance is expected to be effectively improved. This is the central idea of the metric learning.
However, it is hardly practicable to learn a general metric by directly finding an optima in the functional space. A practical way is to define a family of metrics determined by some parameters, and transform the problem into the solving of the optimal parameters. Mahalanobis metric provides a perfect candidate for such a family of metrics, which has a simple formulation and is uniquely determined by the Mahalanobis matrix. In this case, the metric learning is equivalent to learning the Mahalanobis matrix.
Eric Xing et al. [1] proposes the idea of metric learning with the first algorithm in 2002. Since then, various metric learning methods have been proposed based on different strategies. Since metrics can be categorized into several families according to their properties, such as global vs. local, or linear vs. nonlinear, the metric learning approaches can also be categorized accordingly. Mahalanobis metric is a typical global linear metric. Because existing multitask learning approaches are almost based on global metrics, we focus on this type in this review, especially for the global linear metrics. Please refer to [22, 23] and their references for other types.

Mustlink / cannotlink constraints (positive/negative pairs)$$ \begin{aligned} \mathcal{S}=&\{(\mathbf{x}_{i},\mathbf{x}_{j}): \mathbf{x}_{i}\ \text{and}\ \mathbf{x}_{j}\ \text{should be similar}\},\\ \mathcal{D}=&\{(\mathbf{x}_{i},\mathbf{x}_{j}): \mathbf{x}_{i}\ \text{and}\ \mathbf{x}_{j}\ \text{should be dissimilar}\}. \end{aligned} $$

Relative constraints (training triplets)$$ \mathcal{R}=\{(\mathbf{x}_{i},\mathbf{x}_{j},\mathbf{x}_{k}): \mathbf{x}_{i}\ \text{should be more similar to}\ \mathbf{x}_{j}\ \text{than to}\ \mathbf{x}_{k}\}. $$
Using these constraints, we briefly introduce the strategies of some metric learning approaches. Xing’s method [1] aims to maximize the sum of distances between dissimilar pairs while keeping the sum of squared distances between similar pairs to be small. It is an example of learning with positive/negative pairs. Large Margin Nearest Neighbors (LMNN) [2, 24] requires the k nearest neighbors to belong to the same class and pushes out all the imposters (instances of other classes existing in the neighborhood). The sideinformation is provided by the relative constraints. Informationtheoretic metric learning (ITML) [3], which is also built with positive/negative pairs, models the problem with logdeterminant. Sparse Metric Learning [6] uses the mixed L_{2,1} norm to obtain a joint feature selection during metric learning, and Huang et al. [4, 5] proposes a unified framework for Generalized Sparse Metric Learning (GSML). Robust Metric Learning (RML) [25] deals with the noisy training constraints based on robust optimization.
Then, the metric is obtained by learning an appropriate nonlinear transformation f. Since the deep learning has achieved remarkable successes in computer vision and machine learning [27], some researchers proposed the deep metric learning recently [28, 29]. These methods resort to the deep neural network to learn a nonlinear transformation, which are different from a traditional neural network in that their learning objective are given by constraints on distances.
There are a lot of metric learning methods because the metric plays an important role in many applications. We cannot introduce them in detail due to the limit of the space. Readers can refer to the paper [22] for a systematic review on metric learning.
An overview of multitask metric learning
Since the concept of multitask learning was proposed by Caruana [8] in 1997, this topic has attracted much attention from researchers in machine learning. Multiple different methods are proposed to construct a framework for simultaneously learning multiple tasks for conventional models, such as linear classifier or support vector machines. The performances of the original models are effectively improved by learning simultaneously.
However, these methods cannot be used directly for metric learning since there exist significant discrepancies between the conventional learning models and metric learning models. Taking the popular support vector machine (SVM) [30] as an example of conventional models, we can show the differences between it and metric learning. First, the training data of the two models are of different structures. For SVM, the training samples are given by points with a label for each one, while for metric learning they are given by pairs or triplets with a label for each one. Second, their models are of different types. The model of SVM is a singleinput singleoutput function parameterized by a weight vector and a bias, while the model of metric learning is a doubleinput singleoutput function parameterized by a symmetric positive semidefinite matrix. Third, the algorithms take effect on the performance in different ways. For SVM, the classification accuracy is given by the algorithm directly, while for metric learning, the performance has to be evaluated indirectly by other algorithms working with the learned metric.
 1
Assume that the Mahalanobis matrix of each metric is composed of several components and share some composition.
 2
Predefine the relationship among tasks or learn such relationship from data, and constrain the learning process with this relationship.
 3
Use a common metric with proper regularization to couple the metrics.
 4
Consider metric learning from the perspective of learning a transformation and share some parts of transformation.
Review on multitask metric learning approaches
In this section, we investigate the multitask metric learning approaches published todate and provide a detailed review on them. The methods are organized according to the type of strategies. We focus on only the models and algorithms in this section without mentioning their application backgrounds, which are left for the next section. The discussion about the relation between some closely related methods is also included.
Main features of multitask metric learning methods
Name  Year  Multitask Strategy  Solver  Dimension Reduction  Sideinformation  Regularizer 

mtLMNN  2010  Shared composition  Projected gradient descent  No  Triplets  Frobenius norm 
TML  2010  Task relationship learning  Alternating Optimization  No  Pairs  Task covariance 
mtMLCS  2011  Shared subspace  Gradient descent  Yes  Triplets   
M^{2}SL  2012  Shared composition  Coordinate gradient descent  No  Pairs  Frobenius norm 
GPmtML  2012  Geometry preserving  Alternating optimization  No  Triplets  Von Neumann divergence 
mtSCML  2014  Shared sparse representation  Regularized dual averaging  Yes  Triplets  ℓ_{2}/ℓ_{1} norm 
MtMCML  2014  Graph regularization  Alternating optimization  No  Pairs  Laplacian 
TMTL  2015  Metric weighted sum  Direct calculation  No  Covariance   
onlineSMDM  2016  Shared composition  Online projected gradient descent  No  Pairs  Frobenius norm 
CPmtML  2016  Coupled projection  Stochastic gradient projection  Yes  Pairs   
DMML  2016  Shared subnetwork  Subgradient descent  No  Pairs   
HMML  2017  Shared composition  Not mentioned  No  Triplets  Trace norm 
mtDCML  2017  Shared network  Gradient descent  No  Pairs   
Sharing composition of Mahalanobis matrices
Since the Mahalanobis metric is uniquely determined by the Mahalanobis matrix, a natural way to couple multiple related metrics is to share some composition of their Mahalanobis matrices. Specifically, the Mahalanobis matrix of each task is assumed to be composed of both common composition shared by all tasks and taskspecific composition preserving its specific properties. This strategy is the most popular way to construct a multitask metric learning model and we introduce some representative ones below.
In (3), the sideinformation is incorporated by constraints generated from triplets as LMNN [2]. The regularization on taskspecific matrices M_{ t }’s represses the specialty of each task and encourages the shared part of all tasks, while the regularization on M_{0} restricts the common part to be close to the identity. They further make the learnt metric of each task not far from the Euclidean metric.
The hyperparameters γ_{t>0}’s control the balance between the commonness and speciality, while γ_{0} controls the regularization of the common part. As the increasing of γ_{t>0}, the taskspecific parts become small and the learnt metrics of all tasks tend to be similar. When γ_{t>0}→∞, the algorithm learns a unique metric M_{0} for all tasks, while when γ_{t>0}→0, all tasks tend to be learnt individually. On the other hand, when γ_{0}→∞, the common part M_{0} becomes identity and Euclidean metric is obtained. When γ_{0}→0, there tends to be no regularization on the common part. This model is convex and can be solved effectively.
This is the first attempt to apply a multitask approach to metric learning problem. It provides a simple yet effective way to improve the performance of metric learning by jointly learning multiple tasks. However, the idea of splitting each Mahalanobis matrix into a common part and an individual part is not easy to explain from the perspective of distance metric and can only deal with some simple cases.
The variable \(\delta _{t}^{ij}\) denotes the label of similar/dissimilar labeled pairs, and \(\sigma _{t}^{ij}\) is a predefined threshold for hinge loss. The parameters b_{0} and b_{ t } represent weights for the sharing part and discriminating parts respectively, and the last term is the regularization on these weights.
Using this approach, the information contained in different tasks is shared among them and the multiple features are used in a more effective way. It uses the same strategy as mtLMNN to construct the multitask metric learning model and thus has the similar advantages/disadvantages.
This method naturally introduces the idea of group sparse to construct multitask metric learning, and the proposed approach is not difficult to be realized. However, this algorithm requires the set of rankone metrics to be pretrained and thus cannot be optimized simultaneously with the weights.
This method is so simple that no optimization procedure is needed. To be strict, it is not a typical metric learning and can deal with only some special problems.
Online semisupervised multitask distance metric learning (onlineSMDM) Li et al. [37] proposes a semisupervised metric learning approach that is capable to utilize the unlabeled data to learn the metric. The method is designed based on the regularized distance metric learning [38] and extended to a multitask metric learning model called online semisupervised multitask distance metric learning. It assumes each Mahalanobis matrix to be composed of a common part M_{0} and a taskspecific part M_{ t } as [31] does, and proposes an online algorithm to solve the model effectively.
This method utilizes the unlabeled data by assigning labels for them according to the original distances. The strategy of constructing the multitask learning is same as the previous ones.
Hierarchical multitask metric learning (HMML) Zheng et al. [39] proposes an approach to learn a hierarchical tree of multiple sparse metrics hierarchically over a visual tree. In this work, a visual tree is first constructed to organize the categories in a coarsetofine fashion. Then a topdown approach is used to learn multiple metrics along the visual tree, where the model is expected to benefit from leveraging both the internode visual correlation and the interlevel visual correlations.
where the parameters γ_{0} and γ_{ t }’s control the regularization on the common part and individual part respectively.
For nonroot nodes at the midlevel of the visual tree, besides the internode correlations, the interlevel visual correlations between the parent node and its sibling child nodes at the next level should be also exploited. Since all nodes on the same branch are similar, any node p characterizes the common visual properties of its sibling child nodes. On the other hand, the taskspecific metric M_{ p } for node p contains the taskspecific composition. Thus, it is reasonable to utilize the taskspecific metric of node p to help the learning of its sibling child nodes. Based on this idea, the regularization β∥M_{0}−M_{ p }∥^{2} is added into the objective of (6) for nonroot nodes, where M_{0} is the common metric shared among the sibling child nodes under parent node p and M_{ p } is the taskspecific metric for node p at the upper level.
This method introduces the hierarchical visual tree into multitask metric learning, which is used to guide the multitask learning and thus provides a more powerful capability of describing the relationship among tasks.
Task relationship learning and regularization
In that paper, the authors further propose a transfer metric learning based on this model by training the concatenated Mahalanobis matrix of only target task while leaving other matrices fixed as source tasks. The idea of learning the relationship between tasks is interesting, but the covariance between the vectorized Mahalanobis matrices does not explain well from the perspective of distance metric.
where \(\tilde {\mathbf {M}}_{i}=\text {vec}(\mathbf {M}_{i})\) converts the Mahalanobis matrix of the itask into a vector in a columnwise manner, DIA is a diagonal matrix where \(\mathbf {DIA}(i,i)=\sum _{j=1}^{T}{\mathbf {W}(i,j)}\), and thus the matrix L=DIA−W indeed defines the graph Laplacian matrix. The model can be optimized by alternating method.
In this work, the authors empirically set the adjacency matrix as W(i,j)=1, which indeed defines every pair of tasks are related. It is not difficult to prove that such a regularization is just a variant of the regularization of mtLMNN. Therefore, these two methods are closely related in this special case.
This work naturally introduces the graph regularization into multitask learning by applying a Laplacian to the vectorized Mahalanobis matrices. However, the relationship between two metrics is still vague, and the Laplacian matrix L is not easy to be reasonably determined.
Regularization with a common metric
In this framework, the loss L and constraints \(\mathcal {C}_{t}\) are used to incorporate the sideinformation from training samples into the learning process, while the regularization D(M_{ t },M_{ c }) encourages the metric of each task to be similar to a common one M_{ c }, and D(M_{0},M_{ c }) further regularizes the common metric to be close to a predefined metric. Without more prior information available, M_{0} is set to the identity I to define a Euclidean metric.
The mtLMNN can be easily included as a special case of this framework by \(D(\mathbf {X},\mathbf {Y})=\\mathbf {X}\mathbf {Y}\_{\mathrm {F}}^{2}\). The only difference exists on the constraints: the Mahalanobis matrix of the tth task in mtLMNN is M_{0}+M_{ t }, where both the two parts are positive semidefinite; the Mahalanobis matrix of the tth task in (9) with Frobenius norm is M_{ t } and the positive semidefiniteness of only this matrix is required. The authors indicate that the latter actually provides a more reasonable model because the constraints in mtLMNN is unnecessary to be so strict.
By a series of theoretical analysis, this method is proved to be able to encourage a higher geometry preserving geometry, and thus more likely to keep the relative distances of samples across different metrics. The introduced regularization is jointly convex and thus the problem can be effectively solved by alternating optimization.
This is the first paper that attempts to construct the multitask metric learning by sharing the supervised sideinformation among tasks. It provides a reasonable explanation from the perspective of metric learning. However, the macrostructure of the model is too simple and thus cannot describe more complex relationship among tasks.
Sharing transformation
According to (1). Learning a Mahalanobis distance is equivalent to learning a corresponding linear transformation. There are indeed some metric learning algorithms that aim to learn such a transformation directly, and it naturally provides a way to construct a multitask metric learning by sharing some parts of transformation.
Multitask metric learning based on common subspace (mtMLCS) Yang et.al [48] proposes a multitask learning method based on the assumption of common subspace. The idea is motivated by multitask feature learning [11] which learns a common sparse representations across multiple tasks. Based on the same assumption that all the tasks share a common lowdimensional subspace, the authors propose a multitask framework for metric learning by transformation.
To couple multiple tasks with a common lowdimensional subspace, the authors notice that for any lowrank Mahalanobis matrix M, the corresponding linear transformation matrix L is of full row rank and has the size of r×d, where r=rank(M) is the dimension of the subspace. Applying a compact SVD to L, there is L=UΛV^{⊤} where V is a d×r matrix defining a projection to the lowdimensional subspace, and UΛ defines a transformation in the subspace. This fact motivates a straightforward multitask strategy with common subspace: to share a common projection matrix V and learn an individual transformation \(\mathbf {R}_{t}\doteq \mathbf {U}_{t}\mathbf {\Lambda }_{t}\) for each task.
However, it is computationally complex to apply an orthogonal constraint to V. On the other hand, it’s notable that the orthogonality is not necessary for V to define a subspace. As well as V is of the size r×d and r<d, it indeed defines a subspace of dimensionality no more than r with some extra fullrank transformation in the subspace. Therefore, a common matrix L of size r×d is used to realize the common projection instead of V^{⊤}, and the extra transformation can be absorbed by R_{ t }. The obtained model for multitask metric learning is to a transformation for each task L_{ t }=R_{ t }L_{0} where L_{0} defines the common subspace and R_{ t } defines the taskspecific metric. This strategy is then incorporated into the LMCA [49] which is a variant of LMNN [2] by learning the transformation.
This approach is simple to implement. Compared with the approaches that learn metrics by learning Mahalanobis matrices, mtMLCS does not require the symmetric positivedefinite constraints, and thus is much easier to optimize. However, this model is not convex and thus the global optimum cannot be obtained.
Since this method learns the metric by optimizing on the transformation L, it has the similar merits and faults as mtMLCS. It is also designed for the simple case where the tasks are correlated by a common Mahalanobis matrix.
Deep multitask metric learning (DMML) Soleimani et al. [51] proposes a multitask learning version of deep metric learning. The method is constructed based on the discriminative deep metric learning (DDML) [29]. For any pair of points, the DDML transforms the two points with a neural network, and then the distance is defined to be the Euclidean distance of their transformations. Thus the process of metric learning is done by learning the parameters of the network.
This method is based on a simple yet effective idea which a part of the network weights are shared across multiple tasks. It is not difficult to implement by slightly modify the original network architecture. However, only the first layer is shared across different tasks in this model, which may be not the optimal choice and it is not easy to determine how many layers should be shared.
The strategy to construct the multitask metric learning used in this paper is common in the community of multitask learning. It is a flexible model by using different auxiliary tasks. However, for some task, it is difficult to choose a proper auxiliary task, and a bad auxiliary task may induce deterioration of the performance.
Applications
Multitask metric learning has been widely used in a variety of practical applications, and we would like to introduce some representative works in this section.
Semantic categorization and social tagging with knowledge transfer among tasks Wang et al. [32] uses their proposed multitask multifeature similarity learning to solve the large scale visual applications. The metrics for visual categorization and automatic tagging are learned jointly based on the framework, which benefits from several perspectives. First, M^{2}SL learns a metric for each feature instead of concatenating the multiple features into one feature. This effectively reduces the computation complexity growth from O(M^{2}d^{2}) to O(Md^{2}) and also the risk of overfitting. Second, the multitask framework is more flexibility to explore the intrinsic model sharing and feature weighting relations on image data with large amount of classes. Third, the knowledge is transferred among semantic labels and social tagging information by the model. This combines the information fusion from both sides for effective image understanding.
Offline signature verification Soleimani et al. [51] aims to deal with the offline signature verification problem using the deep multitask metric learning. For offline signature verification, there are writerdependent (WD) approaches and writerindependent (WI) approaches. These two approaches benefits from their particular advantages respectively. These two approaches are well integrated in this model where the shared layer acts as a WI approach while the separated layers learn WD factors. In the experiments, the DMML achieves better performance than other methods. For example, on the UTSig dataset and using the HOG feature, the DMML achieves equal error rate (ERR) of 17.45% while the SVM achieves ERR of 20.63%; using the DRT feature, the DMML achieves ERR of 20.28% while the SVM achieves ERR of 27.64%.
Conclusion
In this paper, we have systematically reviewed multitask metric learning. Following a brief overview of metric learning, various multitask learning approaches are categorized into four families and introduced respectively. We then review the motivations, models, and algorithms of them, and also discuss and compare some closely related approaches. Finally some representative applications of multitask metric learning are illustrated.
For future work, we suggest potential issues for exploration. First, the theoretical analysis of multitask metric learning should be addressed. There has long been an important issue yielding multiple results [53–56], with most studies focusing on how multitask learning improves the generalization [57] of a conventional algorithm. However, as mentioned earlier, the metric learning improves the performances of the algorithms who use the metric indirectly. This makes these results difficult for application to metric learning algorithms. There has also been some research [58–61] on the theoretical analysis of metric learning, however it has been to difficult to explain these in the context of multitask learning, Whilst Yang et al. [44] has attempted to provide an intuitive explanation, the issue pertaining to multitask learning remains unresolved. Second, how to avoid the negative transfer among tasks. Existing approaches are designed to couple multiple metrics without considering the problem of negative transfer, and thus it is likely to deteriorate the performances when the tasks are not related. Third, most existing multitask metric learning approaches are designed for global linear metrics. Thus it should be extended to more types of metric learning approaches, including local metric learning and nonlinear metric learning. Finally, increased applications of multitask metric learning are expected to be discovered.
Declarations
Funding
The paper was partially supported by National Natural Science Foundation of China (NSFC) under grant no.61403388, no.61473236, Natural science fund for colleges and universities in Jiangsu Province under grant no. 17KJD520010, Suzhou Science and Technology Program under grant no. SYG201712, SZS201613, Key Program Special Fund in XJTLU (KSFA01), and UK Engineering and Physical Sciences Research Council (EPSRC) grant numbers EP/I009310/1, EP/M026981/1.
Availability of data and materials
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.
Authors’ contributions
PY carried out the whole structure of the idea and the mainly drafted the manuscript. KH provided the guidance of the whole manuscript and revised the draft. AH participated the discussion and gave valuable suggestion on the idea. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Xing EP, Ng AY, Jordan MI, Russell SJ. Distance metric learning with application to clustering with sideinformation. In: Advances in Neural Information Processing Systems 15 [Neural Information Processing Systems, NIPS 2002, December 914, 2002, Vancouver, British Columbia, Canada].2002. p. 505–12. http://papers.nips.cc/paper/ 2164distancemetriclearningwithapplicationtoclusteringwithsideinformation.Google Scholar
 Weinberger KQ, Saul LK. Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res. 2009; 10:207–44.MATHGoogle Scholar
 Davis JV, Kulis B, Jain P, Sra S, Dhillon IS. Informationtheoretic metric learning. In: Proceedings of the 24th International Conference on Machine Learning.2007. p. 209–16.Google Scholar
 Huang K, Ying Y, Campbell C. Gsml: A unified framework for sparse metric learning. In: Ninth IEEE International Conference on Data Mining.2009. p. 189–98.Google Scholar
 Huang K, Ying Y, Campbell C. Generalized sparse metric learning with relative comparisons. Knowl Inf Syst. 2011; 28(1):25–45.View ArticleGoogle Scholar
 Ying Y, Huang K, Campbell C. Sparse metric learning via smooth optimization In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A, editors. Advances in Neural Information Processing Systems 22.2009. p. 2214–222.Google Scholar
 Ying Y, Li P. Distance metric learning with eigenvalue optimization. J Mach Learn Res. 2012; 13:1–26.MathSciNetMATHGoogle Scholar
 Caruana R. Multitask learning. Mach Learn. 1997; 28(1):41–75.MathSciNetView ArticleGoogle Scholar
 Evgeniou T, Pontil M. Regularized multitask learning. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2004. p. 109–17.Google Scholar
 Argyriou A, Micchelli CA, Pontil M, Ying Y. A spectral regularization framework for multitask structure learning. In: Advances in Neural Information Processing Systems 20.2008. p. 25–32.Google Scholar
 Argyriou A, Evgeniou T. Convex multitask feature learning. Mach Learn. 2008; 73(3):243–72.View ArticleGoogle Scholar
 Zhang J, Ghahramani Z, Yang Y. Flexible latent variable models for multitask learning. Mach Learn. 2008; 73(3):221–42.View ArticleGoogle Scholar
 Zhang Y, Yeung DY. A convex formulation for learning task relationships in multitask learning. In: Proceedings of the TwentySixth Conference Annual Conference on Uncertainty in Artificial Intelligence.2010. p. 733–442.Google Scholar
 Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010; 22(10):1345–59.View ArticleGoogle Scholar
 Dai W, Yang Q, Xue GR, Yu Y. Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning, ICML ’07. New York: ACM: 2007. p. 193–200.Google Scholar
 Gopalan R, Li R, Chellappa R. Domain adaptation for object recognition: An unsupervised approach. In: Proceedings of IEEE International Conference on Computer Vision, ICCV 2011. p. 999–1006.Google Scholar
 Vilalta R, Drissi Y. A perspective view and survey of metalearning. Artif Intell Rev. 2002; 18(2):77–95.View ArticleGoogle Scholar
 Thrun S. Lifelong learning algorithms. In: Learning to Learn. USA: Springer: 1998. p. 181–209.View ArticleGoogle Scholar
 Thrun S, Pratt L. Learning to Learn. USA: Springer; 2012.MATHGoogle Scholar
 Burago D, Burago Y, Ivanov S. A Course in Metric Geometry. USA: American Mathematical Society; 2001. Chap. Ch 1.1.View ArticleMATHGoogle Scholar
 Mahalanobis PC. On the generalised distance in statistics. In: Proceedings National Institute of Science, vol. 2. India: 1936. p. 49–55.Google Scholar
 Bellet A, Habrard A, Sebban M. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709v4, 2014.Google Scholar
 Kulis B. Metric learning: A survey. Found Trends Mach Learn. 2013; 5(4):287–364.MathSciNetView ArticleMATHGoogle Scholar
 Weinberger KQ, Blitzer J, Saul L. Distance metric learning for large margin nearest neighbor classification. In: Advances in Neural Information Processing Systems 18.2006.Google Scholar
 Huang K, Jin R, Xu Z, Liu CL. Robust metric learning by smooth optimization. In: The 26th Conference on Uncertainty in Artificial Intelligence.2010. p. 244–51.Google Scholar
 Goldberger J, Roweis S, Hinton G, Salakhutdinov R. Neighbourhood components analysis. In: Advances in Neural Information Processing Systems.2004. p. 513–20.Google Scholar
 Schmidhuber J. Deep learning in neural networks: An overview. Neural Netw. 2015; 61:85–117.View ArticleGoogle Scholar
 Salakhutdinov R, Hinton G. Learning a nonlinear embedding by preserving class neighbourhood structure. In: Artificial Intelligence and Statistics.2007. p. 412–9.Google Scholar
 Hu J, Lu J, Tan Y. Discriminative deep metric learning for face verification in the wild. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 2328, 2014.2014. p. 1875–82.Google Scholar
 Vapnik VN. Statistical Learning Theory, 1st ed. USA: Wiley; 1998.MATHGoogle Scholar
 Parameswaran S, Weinberger K. Large margin multitask metric learning. In: Advances in Neural Information Processing Systems 23.2010. p. 1867–75.Google Scholar
 Wang S, Jiang S, Huang Q, Tian Q. Multifeature metric learning with knowledge transfer among semantics and social tagging. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 1621, 2012.2012. p. 2240–7.Google Scholar
 Kwok JT, Tsang IW. Learning with idealized kernels. In: Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 2124, 2003, Washington, DC, USA: 2003. p. 400–7. http://www.aaai.org/Library/ICML/2003/icml03054.php.Google Scholar
 Shi Y, Bellet A, Sha F. Sparse compositional metric learning. In: Proceedings of the TwentyEighth AAAI Conference on Artificial Intelligence, July 27 31, 2014, Québec City, Québec, Canada: 2014. p. 2078–084. http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8224.Google Scholar
 Liu H, Zhang X, Wu P. Twolevel multitask metric learning with application to multiclassification. In: 2015 IEEE International Conference on Image Processing, ICIP 2015, Quebec City, QC, Canada, September 2730, 2015: 2015. p. 2756–60.Google Scholar
 Köstinger M, Hirzer M, Wohlhart P, Roth PM, Bischof H. Large scale metric learning from equivalence constraints. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 1621, 2012: 2012. p. 2288–95.Google Scholar
 Li Y, Tao D. Online semisupervised multitask distance metric learning. In: IEEE International Conference on Data Mining Workshops, ICDM Workshops 2016, December 1215, 2016, Barcelona, Spain: 2016. p. 474–9.Google Scholar
 Jin R, Wang S, Zhou Y. Regularized distance metric learning: Theory and algorithm. In: Advances in Neural Information Processing Systems, vol. 22.2009. p. 862–70.Google Scholar
 Zheng Y, Fan J, Zhang J, Gao X. Hierarchical learning of multitask sparse metrics for largescale image classification. Pattern Recogn. 2017; 67:97–109.View ArticleGoogle Scholar
 Zhang Y, Yeung DY. Transfer metric learning by learning task relationships. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2010.Google Scholar
 Zhang Y, Yeung DY. Transfer metric learning with semisupervised extension. ACM Trans Intell Syst Tech (TIST). 2012; 3(3):54–15428.Google Scholar
 Gupta AK, Nagar DK. Matrix Variate Distributions. Chapman & Hall/CRC Monographs and Surveys in Pure and Applied Mathematics, vol. 104. London: Chapman & Hall; 2000.Google Scholar
 Ma L, Yang X, Tao D. Person reidentification over camera networks using multitask distance metric learning. IEEE Trans Image Process. 2014; 23(8):3656–70.MathSciNetView ArticleMATHGoogle Scholar
 Yang P, Huang K, Liu CL. Geometry preserving multitask metric learning. Mach Learn. 2013; 92(1):133–75.MathSciNetView ArticleMATHGoogle Scholar
 Yang P, Huang K, Liu CL. Geometry preserving multitask metric learning. In: European Conference on Machine Learning and Knowledge Discovery in Databases, vol. 7523.2012. p. 648–64.Google Scholar
 Dhillon IS, Tropp JA. Matrix nearness problems with bregman divergences. SIAM J Matrix Anal Appl. 2008; 29:1120–46.MathSciNetView ArticleMATHGoogle Scholar
 Kulis B, Sustik MA, Dhillon IS. Lowrank kernel learning with bregman matrix divergences. J Mach Learn Res. 2009; 10:341–76.MathSciNetMATHGoogle Scholar
 Yang P, Huang K, Liu C. A multitask framework for metric learning with common subspace. Neural Comput Applic. 2013; 22(78):1337–47.View ArticleGoogle Scholar
 Torresani L, Lee K. Large margin component analysis. In: Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 47, 2006: 2006. p. 1385–92. http://papers.nips.cc/paper/3088largemargincomponentanalysis.Google Scholar
 Bhattarai B, Sharma G, Jurie F. Cpmtml: Coupled projection multitask metric learning for large scale face retrieval. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016: 2016. p. 4226–35.Google Scholar
 Soleimani A, Araabi BN, Fouladi K. Deep multitask metric learning for offline signature verification. Pattern Recogn Lett. 2016; 80:84–90.View ArticleGoogle Scholar
 McLaughlin N, del Rincón JM, Miller PC. Person reidentification using deep convnets with multitask learning. IEEE Trans Circ Syst Video Techn. 2017; 27(3):525–39.View ArticleGoogle Scholar
 Baxter J. A bayesian/information theoretic model of learning to learn via multiple task sampling. Mach Learn. 1997; 28(1):7–39.View ArticleMATHGoogle Scholar
 Baxter J. A model of inductive bias learning. J Artif Intell Res. 2000; 12:149–98.MathSciNetMATHGoogle Scholar
 Blitzer J, Crammer K, Kulesza A, Pereira F, Wortman J. Learning bounds for domain adaptation. In: Advances in Neural Information Processing Systems 20, Proceedings of the TwentyFirst Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 36, 2007: 2007. p. 129–36. http://papers.nips.cc/paper/3212learningboundsfordomainadaptation.Google Scholar
 BenDavid S, Blitzer J, Crammer K, Kulesza A, Pereira F, Vaughan JW. A theory of learning from different domains. Mach Learn. 2010; 79(12):151–75.MathSciNetView ArticleGoogle Scholar
 Bousquet O, Elisseeff A. Stability and generalization. J Mach Learn Res. 2002; 2:499–526.MathSciNetMATHGoogle Scholar
 Balcan MF, Blum A, Srebro N. A theory of learning with similarity functions. Mach Learn. 2008; 72(12):89–112.View ArticleGoogle Scholar
 Wang L, Sugiyama M, Yang C, Hatano K, Feng J. Theory and algorithm for learning with dissimilarity functions. Neural Comput. 2009; 21(5):1459–84.MathSciNetView ArticleMATHGoogle Scholar
 Perrot M, Habrard A. A theoretical analysis of metric hypothesis transfer learning. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015: 2015. p. 1708–17. http://jmlr.org/proceedings/papers/v37/perrot15.html.Google Scholar
 Bellet A, Habrard A. Robustness and generalization for metric learning. Neurocomputing. 2015; 151:259–67.View ArticleGoogle Scholar