### abstract ###
In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others
In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other
In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors
We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning
We show in simulations on synthetic examples and on the \textsc{iedb} MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non convex methods dedicated to the same problem
### introduction ###
Regularization has emerged as a dominant theme in machine learning and statistics, providing an intuitive and principled tool for learning from high-dimensional data
In particular, regularization by squared Euclidean norms or squared Hilbert norms has been thoroughly studied in various settings, leading to efficient practical algorithms based on linear algebra, and to very good theoretical understanding (see, eg ,  CITATION )
In recent years, regularization by non Hilbert norms, such as  SYMBOL  norms with  SYMBOL , has also generated considerable interest for the inference of linear functions in supervised classification or regression
Indeed, such norms can sometimes both make the problem statistically and numerically better-behaved, and impose various a priori knowledge on the problem
For example, the  SYMBOL -norm (the sum of absolute values) imposes some of the components to be equal to zero and is widely used to estimate sparse functions~ CITATION , while various combinations of  SYMBOL  norms can be defined to impose various sparsity patterns
While most recent work has focused on studying the properties of simple well-known norms, we take the opposite approach in this paper
That is, assuming a given prior knowledge, how can we design a norm that will enforce it
More precisely, we consider the problem of multi-task learning, which has recently emerged as a very promising research direction for various applications  CITATION
In multi-task learning several related inference tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each one may benefit from the others
When linear functions are estimated, each task is associated with a weight vector, and a common strategy to design multi-task learning algorithm is to translate some prior hypothesis about how the tasks are related to each other into constraints on the different weight vectors
For example, such constraints are typically that the weight vectors of the different tasks belong (a) to a Euclidean ball centered at the origin~ CITATION , which implies no sharing of information between tasks apart from the size of the different vectors, i e , the amount of regularization, (b) to a ball of unknown center~ CITATION , which enforces a similarity between the different weight vectors, or (c) to an unknown low-dimensional subspace~ CITATION
In this paper, we consider a different prior hypothesis that we believe could be more relevant in some applications: the hypothesis that  the different tasks are in fact clustered into different groups, and that the weight vectors of tasks within a group are similar to each other
A key difference with  CITATION , where a similar hypothesis is studied, is that we don't assume that the groups are known a priori, and in a sense our goal is both to identify the clusters and to use them for multi-task learning
An important situation that motivates this hypothesis is the case where most of the tasks are indeed related to each other, but a few ``outlier'' tasks are very different, in which case it may be better to impose similarity or low-dimensional constraints only to a subset of the tasks (thus forming a cluster) rather than to all tasks
Another situation of interest is when one can expect a natural organization of the tasks into clusters, such as when one wants to model the preferences of customers and believes that there are a few general types of customers with similar preferences within each type, although one does not know beforehand which customers belong to which types
Besides an improved performance if the hypothesis turns out to be correct, we also expect this approach to be able to identify the cluster structure among the tasks as a by-product of the inference step, eg , to identify outliers or groups of customers, which can be of interest for further understanding of the structure of the problem
In order to translate this hypothesis into a working algorithm, we follow the general strategy mentioned above which is to design a norm or a penalty over the set of weights which can be used as regularization in classical inference algorithms
We construct such a penalty by first assuming that the partition of the tasks into clusters is known, similarly to~ CITATION
We then attempt to optimize the objective function of the inference algorithm over the set of partitions, a strategy that has proved useful in other contexts such as multiple kernel learning~ CITATION
This optimization problem over the set of partitions being computationally challenging, we propose a convex relaxation of the problem which results in an efficient algorithm
