{"title": "Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization", "book": "Advances in Neural Information Processing Systems", "page_first": 28, "page_last": 36, "abstract": "We address the problem of learning classifiers when observations have multiple views, some of which may not be observed for all examples. We assume the existence of view generating functions which may complete the missing views in an approximate way. This situation corresponds for example to learning text classifiers from multilingual collections where documents are not available in all languages. In that case, Machine Translation (MT) systems may be used to translate each document in the missing languages. We derive a generalization error bound for classifiers learned on examples with multiple artificially created views. Our result uncovers a trade-off between the size of the training set, the number of views, and the quality of the view generating functions. As a consequence, we identify situations where it is more interesting to use multiple views for learning instead of classical single view learning. An extension of this framework is a natural way to leverage unlabeled multi-view data in semi-supervised learning. Experimental results on a subset of the Reuters RCV1/RCV2 collections support our findings by showing that additional views obtained from MT may significantly improve the classification performance in the cases identified by our trade-off.", "full_text": "Learning from Multiple Partially Observed Views \u2013\nan Application to Multilingual Text Categorization\n\nMassih R. Amini\n\nInteractive Language Technologies Group\n\nNational Research Council Canada\n\nNicolas Usunier\n\nLaboratoire d\u2019Informatique de Paris 6\nUniversit\u00b4e Pierre et Marie Curie, France\n\nMassih-Reza.Amini@cnrc-nrc.gc.ca\n\nNicolas.Usunier@lip6.fr\n\nCyril Goutte\n\nInteractive Language Technologies Group\n\nNational Research Council Canada\nCyril.Goutte@cnrc-nrc.gc.ca\n\nAbstract\n\nWe address the problem of learning classi\ufb01ers when observations have multiple\nviews, some of which may not be observed for all examples. We assume the\nexistence of view generating functions which may complete the missing views\nin an approximate way. This situation corresponds for example to learning text\nclassi\ufb01ers from multilingual collections where documents are not available in all\nlanguages. In that case, Machine Translation (MT) systems may be used to trans-\nlate each document in the missing languages. We derive a generalization error\nbound for classi\ufb01ers learned on examples with multiple arti\ufb01cially created views.\nOur result uncovers a trade-off between the size of the training set, the number\nof views, and the quality of the view generating functions. As a consequence,\nwe identify situations where it is more interesting to use multiple views for learn-\ning instead of classical single view learning. An extension of this framework is\na natural way to leverage unlabeled multi-view data in semi-supervised learning.\nExperimental results on a subset of the Reuters RCV1/RCV2 collections support\nour \ufb01ndings by showing that additional views obtained from MT may signi\ufb01cantly\nimprove the classi\ufb01cation performance in the cases identi\ufb01ed by our trade-off.\n\n1\n\nIntroduction\n\nWe study the learning ability of classi\ufb01ers trained on examples generated from different sources,\nbut where some observations are partially missing. This problem occurs for example in non-parallel\nmultilingual document collections, where documents may be available in different languages, but\neach document in a given language may not be translated in all (or any) of the other languages.\n\nOur framework assumes the existence of view generating functions which may approximate miss-\ning examples using the observed ones. In the case of multilingual corpora these view generating\nfunctions may be Machine Translation systems which for each document in one language produce\nits translations in all other languages. Compared to other multi-source learning techniques [6],\nwe address a different problem here by transforming our initial problem of learning from partially\nobserved examples obtained from multiple sources into the classical multi-view learning. The con-\ntributions of this paper are twofold. We \ufb01rst introduce a supervised learning framework in which\nwe de\ufb01ne different multi-view learning tasks. Our main result is a generalization error bound for\nclassi\ufb01ers trained over multi-view observations. From this result we induce a trade-off between the\nnumber of training examples, the number of views and the ability of view generating functions to\n\n\fproduce accurate additional views. This trade-off helps us identify situations in which arti\ufb01cially\ngenerated views may lead to substantial performance gains. We then show how the agreement of\nclassi\ufb01ers over their class predictions on unlabeled training data may lead to a much tighter trade-off.\nExperiments are carried out on a large part of the Reuters RCV1/RCV2 collections, freely available\nfrom Reuters, using 5 well-represented languages for text classi\ufb01cation. Our results show that our\napproach yields improved classi\ufb01cation performance in both the supervised and semi-supervised\nsettings.\n\nIn the following two sections, we \ufb01rst de\ufb01ne our framework, then the learning tasks we address.\nSection 4 describes our trade-off bound in the Empirical Risk Minimization (ERM) setting, and\nshows how and when the additional, arti\ufb01cially generated views may yield a better generalization\nperformance in a supervised setting. Section 5 shows how to exploit these results when additional\nunlabeled training data are available, in order to obtain a more accurate trade-off. Finally, section 6\ndescribes experimental results that support this approach.\n\n2 Framework and De\ufb01nitions\n\nIn this section, we introduce basic de\ufb01nitions and the learning objectives that we address in our\nsetting of arti\ufb01cially generated representations.\n\n2.1 Observed and Generated Views\n\ndef= (x1, ..., xV ), where different views xv provide a rep-\nA multi-view observation is a sequence x\nresentation of the same object in different sets Xv. A typical example is given in [3] where each\nWeb-page is represented either by its textual content (\ufb01rst view) or by the anchor texts which point\nto it (second view). In the setting of multilingual classi\ufb01cation, each view is the textual representa-\ntion of a document written in a given language (e.g. English, German, French).\n\nWe consider binary classi\ufb01cation problems where, given a multi-view observation, some of the\nviews are not observed (we obviously require that at least one view is observed). This hap-\npens, for instance, when documents may be available in different languages, yet a given document\nmay only be available in a single language. Formally, our observations x belong to the input set\nX def= (X1 \u222a {\u22a5}) \u00d7 ... \u00d7 (XV \u222a {\u22a5}), where xv =\u22a5 means that the v-th view is not observed.\nIn binary classi\ufb01cation, we assume that examples are pairs (x, y), with y \u2208 Y def= {0, 1}, drawn\naccording to a \ufb01xed, but unknown distribution D over X \u00d7Y, such that P(x,y)\u223cD (\u2200v : xv =\u22a5) = 0\n(at least one view is available). In multilingual text classi\ufb01cation, a parallel corpus is a dataset where\nall views are always observed (i.e. P(x,y)\u223cD (\u2203v : xv =\u22a5) = 0), while a comparable corpus is a\ndataset where only one view is available for each example (i.e. P(x,y)\u223cD (|{v : xv 6=\u22a5}| 6= 1) = 0).\nFor a given observation x, the views v such that xv 6=\u22a5 will be called the observed views. The\noriginality of our setting is that we assume that we have view generating functions \u03a8v\u2192v\u2032 : Xv \u2192\nXv\u2032 which take as input a given view xv and output an element of Xv\u2032, that we assume is close\nto what xv\u2032 would be if it was observed. In our multilingual text classi\ufb01cation example, the view\ngenerating functions are Machine Translation systems. These generating functions can then be used\nto create surrogate observations, such that all views are available. For a given partially observed x,\nthe completed observation x is obtained as:\n\n\u2200v, xv =(cid:26) xv\n\n\u03a8v\u2032\u2192v(xv\u2032\n\n)\n\notherwise, where v\u2032 is such that xv\u2032\n\nif xv 6=\u22a5\n\n6=\u22a5\n\n(1)\n\nIn this paper, we focus on the case where only one view is observed for each example. This setting\ncorresponds to the problem of learning from comparable corpora, which will be the focus of our\nexperiments. Our study extends to the situation where two or more views may be observed in a\nstraightforward manner. Our setting differs from previous multi-view learning studies [5] mainly on\nthe straightforward generalization to more than two views and the use of view generating functions\nto induce the missing views from the observed ones.\n\n\f2.2 Learning objective\n\nThe learning task we address is to \ufb01nd, in some prede\ufb01ned classi\ufb01er set C, the stochastic classi\ufb01er c\nthat minimizes the classi\ufb01cation error on multi-view examples (with, potentially, unobserved views)\ndrawn according to some distribution D as described above. Following the standard multi-view\nframework, in which all views are observed [3, 13], we assume that we are given V deterministic\nv=1, each working on one speci\ufb01c view1. That is, for each view v, Hv is a set\nclassi\ufb01er sets (Hv)V\nof functions hv : Xv \u2192 {0, 1}. The \ufb01nal set of classi\ufb01ers C contains stochastic classi\ufb01ers, whose\noutput only depends on the outputs of the view-speci\ufb01c classi\ufb01ers. That is, associated to a set of\nclassi\ufb01ers C, there is a function \u03a6C : (Hv)V\n\nv=1 \u00d7 X \u2192 [0, 1] such that:\nC = {x 7\u2192 \u03a6C(h1, ..., hV , x)|\u2200v, hv \u2208 Hv }\n\nFor simplicity, in the rest of the paper, when the context is clear, the function x 7\u2192 \u03a6C(h1, ..., hV , x)\nwill be denoted by ch1,...,hV . The overall objective of learning is therefore to \ufb01nd c \u2208 C with low\ngeneralization error, de\ufb01ned as:\n(2)\n\n\u01eb(c) = E\n\ne (c, (x, y))\n\n(x,y)\u223cD\n\nwhere e is a pointwise error, for instance the 0/1 loss: e(c, (x, y)) = c(x)(1 \u2212 y) + (1 \u2212 c(x))y.\nIn the following sections, we address this learning task in our framework in terms of supervised and\nsemi-supervised learning.\n\n3 Supervised Learning Tasks\n\nWe \ufb01rst focus on the supervised learning case. We assume that we have a training set S of m\nexamples drawn i.i.d. according to a distribution D, as presented in the previous section. Depending\non how the generated views are used at both training and test stages, we consider the following\nlearning scenarios:\n\n- Baseline: This setting corresponds to the case where each view-speci\ufb01c classi\ufb01er is trained using\nthe corresponding observed view on the training set, and prediction for a test example is\ndone using the view-speci\ufb01c classi\ufb01er corresponding to the observed view:\n\n\u2200v, hv \u2208 arg min\n\nh\u2208Hv X(x,y)\u2208S:xv6=\u22a5\n\ne(h, (xv, y))\n\n(3)\n\nIn this case we pose \u2200x, cb\nthat this is the most basic way of learning a text classi\ufb01er from a comparable corpus.\n\nh1,...,hV (x) = hv(xv), where v is the observed view for x. Notice\n\n- Generated Views as Additional Training Data: The most natural way to use the generated\nviews for learning is to use them as additional training material for the view-speci\ufb01c clas-\nsi\ufb01ers:\n\n\u2200v, hv \u2208 arg min\n\nh\u2208Hv X(x,y)\u2208S\n\ne(h, (xv, y))\n\n(4)\n\nwith x de\ufb01ned by eq. (1). Prediction is still done using the view-speci\ufb01c classi\ufb01ers cor-\nresponding to the observed view, i.e. \u2200x, cb\nh1,...,hV (x) = hv(xv). Although the test set\ndistribution is a subdomain of the training set distribution [2], this mismatch is (hopefully)\ncompensated by the addition of new examples.\n\n- Multi-view Gibbs Classi\ufb01er: In order to avoid the potential bias introduced by the use of gener-\nated views only during training, we consider them also during testing. This becomes a stan-\ndard multi-view setting, where generated views are used exactly as if they were observed.\nThe view-speci\ufb01c classi\ufb01ers are trained exactly as above (eq. 4), but the prediction is car-\nried out with respect to the probability distribution of classes, by estimating the probability\nof class membership in class 1 from the mean prediction of each view-speci\ufb01c classi\ufb01er:\n\n\u2200x, cmg\n\nh1,...,hV (x) =\n\n1\nV\n\nV\n\nXv=1\n\nhv(xv)\n\n(5)\n\n1We assume deterministic view-speci\ufb01c classi\ufb01ers for simplicity and with no loss of generality.\n\n\f- Multi-view Majority Voting: With view generating functions involved in training and test, a nat-\nural way to obtain a (generally) deterministic classi\ufb01er with improved performance is to\ntake the majority vote associated with the Gibbs classi\ufb01er. The view-speci\ufb01c classi\ufb01ers are\nagain trained as in eq. 4, but the \ufb01nal prediction is done using a majority vote:\n\n\u2200x, cmv\n\nh1,...,hV (x) =( 1\n\n2\n\nI(cid:16)PV\n\nv=1 hv(xv) > V\n\n2(cid:17)\n\nif PV\n\nv=1 hv(xv) = V\n2\notherwise\n\n(6)\n\nWhere I(.) is the indicator function. The classi\ufb01er outputs either the majority voted class,\nor either one of the classes with probability 1/2 in case of a tie.\n\n4 The trade-offs with the ERM principle\n\nWe now analyze how the generated views can improve generalization performance. Essentially,\nthe trade-off is that generated views offer additional training material, therefore potentially helping\nlearning, but can also be of lower quality, which may degrade learning.\n\nThe following theorem sheds light on this trade-off by providing bounds on the baseline vs. multi-\nview strategies. Note that such trade-offs have already been studied in the literature, although in\ndifferent settings (see e.g. [2, 4]). Our \ufb01rst result is the following theorem. The notion of func-\ntion class capacity used here is the empirical Rademacher complexity [1]. Proof is given in the\nsupplementary material.\n\nTheorem 1 Let D be a distribution over X \u00d7 Y, satisfying P(x,y)\u223cD (|{v : xv 6=\u22a5}| 6= 1) = 0.\nLet S = ((xi, yi))m\ni=1 be a dataset of m examples drawn i.i.d. according to D. Let e be the 0/1\nv=1 be the view-speci\ufb01c deterministic classi\ufb01er sets. For each view v, denote\nloss, and let (Hv)V\ndef= {(xv, y) 7\u2192 e(h, (xv, y))|h \u2208 Hv}, and denote , for any sequence S v \u2208 (Xv \u00d7 Y)mv of\ne\u25e6Hv\nsize mv, \u02c6Rmv (e \u25e6 Hv,S v) the empirical Rademacher complexity of e \u25e6 Hv on S v. Then, we have:\nBaseline setting: for all 1 > \u03b4 > 0, with probability at least 1 \u2212 \u03b4 over S:\n\nh\u2032\n\n\u01eb(cb\n\nh1,...,hV ) \u2264 inf\n\n)i + 2\nv\u2208Hvh\u01eb(cb\nXv=1\nwhere, for all v, S v def= {(xv\ni , yi)|i = 1..m and xv\nclassi\ufb01er minimizing the empirical risk on S v.\n\n1,...,h\u2032\n\nh\u2032\n\nV\n\nV\n\nmv\nm\n\n\u02c6Rmv (e \u25e6 Hv,S v) + 6r ln(2/\u03b4)\ni 6=\u22a5}, mv = |S v| and hv \u2208 Hv is the\n\n2m\n\nMulti-view Gibbs classi\ufb01cation setting: for all 1 > \u03b4 > 0, with probability at least 1 \u2212 \u03b4 over S:\n\nh\u2032\n\n\u01eb(cmg\n\nh1,...,hV ) \u2264 inf\n\nv\u2208Hvh\u01eb(cb\nwhere, for all v, S v def= {(xv\nempirical risk on S v, and\n\nh\u2032\n\n1,...,h\u2032\n\nV\n\n2\nV\n\n)i +\n\nV\n\nXv=1\n\n\u02c6Rm(e \u25e6 Hv,S v) + 6r ln(2/\u03b4)\n\n2m\n\n+ \u03b7\n\ni , yi)|i = 1..m}, hv \u2208 Hv is the classi\ufb01er minimizing the\n\n\u03b7 = inf\n\nv\u2208Hvh\u01eb(cmg\n\nh\u2032\n\nh\u2032\n\n1,...,h\u2032\n\nV\n\n)i \u2212 inf\n\nv\u2208Hvh\u01eb(cb\n\nh\u2032\n\nh\u2032\n\n1,...,h\u2032\n\nV\n\n)i\n\n(7)\n\nThis theorem gives us a rule for whether it is preferable to learn only with the observed views\n(the baseline setting) or preferable to use the view-generating functions in the multi-view Gibbs\n\u02c6Rm(e \u25e6\nclassi\ufb01cation setting: we should use the former when 2Pv\nHv,S v) + \u03b7, and the latter otherwise.\nLet us \ufb01rst explain the role of \u03b7 (Eq. 7). The difference between the two settings is in the train\nand test distributions for the view-speci\ufb01c classi\ufb01ers. \u03b7 compares the best achievable error for each\nof the distribution. inf h\u2032\nwithout generated views), with the automatically generated views, the best achievable error becomes\ninf h\u2032\n\n)i is the best achievable error in the baseline setting (i.e.\n\n\u02c6Rmv (e \u25e6 Hv,S v) < 2\n\nV Pv\n\n1,...,h\u2032\n\nmv\nm\n\nh\u2032\n\nV\n\nv\u2208Hvh\u01eb(cb\n)i.\n\nv\u2208Hvh\u01eb(cmg\n\nh\u2032\n\n1,...,h\u2032\n\nV\n\n\fTherefore \u03b7 measures the loss incurred by using the view generating functions.\nsituation, the quality of the generating functions will be suf\ufb01cient to make \u03b7 small.\nThe terms depending on the complexity of the class of functions may be better explained using\norders of magnitude. Typically, the Rademacher complexity for a sample of size n is usually of\norder O( 1\u221an ) [1].\nAssuming, for simplicity, that all empirical Rademacher complexities in Theorem 1 are approxi-\nmately equal to d/\u221an, where n is the size of the sample on which they are computed, and assuming\nthat mv = m/V for all v. The trade-off becomes:\n\nIn a favorable\n\nChoose the Multi-view Gibbs classi\ufb01cation setting when: d(cid:16)q V\n\nm \u2212 1\u221am(cid:17) > \u03b7\n\nThis means that we expect important performance gains when the number of examples is small, the\ngenerated views of suf\ufb01ciently high quality for the given classi\ufb01cation task, and/or there are many\nviews available. Note that our theoretical framework does not take the quality of the MT system in a\nstandard way: in our setup, a good translation system is (roughly) one which generates bag-of-words\nrepresentations that allow to correctly discriminate between classes.\n\nMajority voting One advantage of the multi-view setting at prediction time is that we can use a\nmajority voting scheme, as described in Section 2. In such a case, we expect that \u01eb(cmv\n) \u2264\nh\u2032\n\u01eb(cmg\n) if the view-speci\ufb01c classi\ufb01ers are not correlated in their errors. It can not be guaranteed\nh\u2032\nin general, though, since, in general, we can not prove any better than \u01eb(cmv\n)\nh\u2032\n(see e.g. [9]).\n\n) \u2264 2\u01eb(cmg\n\n1,...,h\u2032\n\n1,...,h\u2032\n\n1,...,h\u2032\n\nV\n\nV\n\nV\n\nh\u2032\n\n1,...,h\u2032\n\nV\n\n5 Agreement-Based Semi-Supervised Learning\n\nOne advantage of the multi-view settings described in the previous section is that unlabeled training\nexamples may naturally be taken into account in a semi\u2013supervised learning scheme, using existing\napproaches for multi-view learning (e.g. [3]).\n\nIn this section, we describe how, under the framework of [11], the supervised learning trade-off\npresented above can be improved using extra unlabeled examples. This framework is based on\nthe notion of disagreement between the various view-speci\ufb01c classi\ufb01ers, de\ufb01ned as the expected\nvariance of their outputs:\n\nE\n\n(x,y)\u223cD\uf8ee\n\uf8f0\n\n1\n\nV Xv\n\nhv(xv)2 \u2212 1\n\nV Xv\n\nhv(xv)!2\uf8f9\n\uf8fb\n\n(8)\n\nV (h1, ..., hV )\n\ndef=\n\nThe overall idea is that a set of good view-speci\ufb01c classi\ufb01ers should agree on their predictions,\nmaking the expected variance small. This notion of disagreement has two key advantages. First, it\ndoes not depend on the true class labels, making its estimation easy over a large, unlabeled training\nset. The second advantage is that if, during training, it turns out that the view-speci\ufb01c classi\ufb01ers\nhave a disagreement of at most \u00b5 on the unlabeled set, the set of possible view-speci\ufb01c classi\ufb01ers\nthat needs be considered in the supervised learning stage is reduced to:\n\nH\u2217v(\u00b5) def= {h\u2032v \u2208 Hv |\u2200v\u2032 6= v,\u2203h\u2032v\u2032 \u2208 Hv\u2032 , V(h\u20321, ..., h\u2032V ) \u2264 \u00b5}\n\nThus, the more the various view-speci\ufb01c classi\ufb01ers tend to agree, the smaller the possible set of\nfunctions will be. This suggests a simple way to do semi-supervised learning: the unlabeled data\ncan be used to choose, among the classi\ufb01ers minimizing the empirical risk on the labeled training\nset, those with best generalization performance (by choosing the classi\ufb01ers with highest agreement\non the unlabeled set). This is particularly interesting when the number of labeled examples is small,\nas the train error is usually close to 0.\nTheorem 3 of [11] provides a theoretical value B(\u01eb, \u03b4) for the minimum number of unlabeled ex-\namples required to estimate Eq. 8 with precision \u01eb and probability 1 \u2212 \u03b4 (this bound depends on\n{Hv}v=1..V ). The following result gives a tighter bound of the generalization error of the multi-view\nGibbs classi\ufb01er when unlabeled data are available. The proof is similar to Theorem 4 in [11].\n\n\fProposition 2 Let 0 \u2264 \u00b5 \u2264 1 and 0 < \u03b4 < 1. Under the conditions and notations of Theorem\n1, assume furthermore that we have access to u \u2265 B(\u00b5/2, \u03b4/2) unlabeled examples drawn i.i.d.\naccording to the marginal distribution of D on X .\nThen, with probability at\nleast 1 \u2212 \u03b4,\narg minh\u2208HvP(xv ,y)\u2208S v e(h, (xv, y)) have a disagreement\n\n\u2208\nless than \u00b5/2 on the unlabeled\n\nrisk minimizers hv\n\nset, we have:\n\nempirical\n\nthe\n\nif\n\n\u01eb(cmg\n\nh1,...,hV ) \u2264 inf\n\nh\u2032\n\nv\u2208Hvh\u01eb(cb\n\nh\u2032\n\n1,...,h\u2032\n\nV\n\n2\nV\n\n)i +\n\nV\n\nXv=1\n\n\u02c6Rm(e \u25e6 H\u2217v(\u00b5),S v) + 6r ln(4/\u03b4)\n\n2m\n\n+ \u03b7\n\nbecomes:\n\nWe can now rewrite the trade-off between the baseline setting and the multi-view Gibbs classi\ufb01er,\ntaking semi-supervised learning into account. Using orders of magnitude, and assuming that for\n\neach view, \u02c6Rm(e \u25e6 H\u2217v(\u00b5),S v) is O(du/\u221am), with the proportional factor du \u226a d, the trade-off\nChoose the mutli-view Gibbs classi\ufb01cation setting when: dpV /m \u2212 du/\u221am > \u03b7.\n\nThus, the improvement is even more important than in the supervised setting. Also note that the\nmore views we have, the greater the reduction in classi\ufb01er set complexity should be.\n\nNotice that this semi-supervised learning principle enforces agreement between the view speci\ufb01c\nclassi\ufb01ers. In the extreme case where they almost always give the same output, majority voting is\nthen nearly equivalent to the Gibbs classi\ufb01er (when all voters agree, any vote is equal to the majority\nvote). We therefore expect the majority vote and the Gibbs classi\ufb01er to yield similar performance in\nthe semi-supervised setting.\n\n6 Experimental Results\n\nIn our experiments, we address the problem of learning document classi\ufb01ers from a comparable\ncorpus. We build the comparable corpus by sampling parts of the Reuters RCV1 and RCV2 collec-\ntions [12, 14]. We used newswire articles written in 5 languages, English, French, German,\nItalian and Spanish. We focused on 6 relatively populous classes: C15, CCAT, E21, ECAT,\nGCAT, M11.\n\nFor each language and each class, we sampled up to 5000 documents from the RCV1 (for English)\nor RCV2 (for other languages). Documents belonging to more than one of our 6 classes were as-\nsigned the label of their smallest class. This resulted in 12-30K documents per language, and 11-34K\ndocuments per class (see Table 1). In addition, we reserved a test split containing 20% of the doc-\numents (respecting class and language proportions) for testing. For each document, we indexed\nthe text appearing in the title (headline tag), and the body (body tags) of each article. As prepro-\ncessing, we lowercased, mapped digits to a single digit token, and removed non alphanumeric\ntokens. We also \ufb01ltered out function words using a stop-list, as well as tokens occurring in less than\n5 documents.\nDocuments were then represented as a bag of words, using a TFIDF-based weighting scheme. The\n\ufb01nal vocabulary size for each language is given in table 1. The arti\ufb01cial views were produced using\n\nTable 1: Distribution of documents over languages and classes in the comparable corpus.\n\nLanguage\nEnglish\nFrench\nGerman\nItalian\nSpanish\n\nTotal\n\n# docs\n18, 758\n26, 648\n29, 953\n24, 039\n12, 342\n111, 740\n\n(%)\n16.78\n23.45\n26.80\n21.51\n11.46\n\n# tokens\n21, 531\n24, 893\n34, 279\n15, 506\n11, 547\n\nClass\nC15\nCCAT\nE21\nECAT\nGCAT\nM11\n\nSize (all lang.)\n\n18, 816\n21, 426\n13, 701\n19, 198\n19, 178\n19, 421\n\n(%)\n16.84\n19.17\n12.26\n17.18\n17.16\n17.39\n\n\fPORTAGE, a statistical machine translation system developed at NRC [15]. Each document from\nthe comparable corpus was thus translated to the other 4 languages.2\nFor each class, we set up a binary classi\ufb01cation task by using all documents from that class as\npositive examples, and all others as negative. We \ufb01rst present experimental results obtained in\nsupervised learning, using various amounts of labeled examples. We rely on linear SVM models as\nbase classi\ufb01ers, using the SVM-Perf package [8]. For comparisons, we employed the four learning\nstrategies described in section 3: 1\u2212 the single-view baseline svb (Eq. 3), 2\u2212 generated views as\nadditional training data gvb (Eq. 4), 3\u2212 multi-view Gibbs mvg (Eq. 5), and 4\u2212 multi-view majority\nvoting mvm (Eq. 6). Recall that the second setting, gvb, is the most straightforward way to train and\ntest classi\ufb01ers when additional examples are available (or generated) from different sources. It can\nthus be seen as a baseline approach, as opposed to the last two strategies (mvg and mvm), where\nview-speci\ufb01c classi\ufb01ers are both trained and tested over both original and translated documents.\nNote also that in our case (V = 5 views), additional training examples obtained from machine\ntranslation represent 4 times as many labeled examples as the original texts used to train the baseline\nsvb. All test results were averaged over 10 randomly sampled training sets.\n\nTable 2: Test classi\ufb01cation accuracy and F1 in the supervised setting, for both baselines (svb, gvb),\nGibbs (mvg) and majority voting (mvw) strategies, averaged over 10 random sets of 10 labeled\nexamples per view. \u2193 indicates statistically signi\ufb01cantly worse performance that the best result,\naccording to a Wilcoxon rank sum test (p < 0.01) [10].\nStrategy\n\nGCAT\n\nM11\n\nC15\n\nE21\n\nCCAT\nF1\n\nAcc.\n\nAcc.\n.559\u2193\n.705\n.693\u2193\n.716\n\nAcc.\n\nF1\n\nF1\n.388\u2193 .639\u2193 .403\u2193 .557\u2193 .294\u2193\n.474\u2193 .691\u2193 .464\u2193 .665\u2193 .351\u2193\n.494\u2193 .681\u2193 .445\u2193 .665\u2193 .375\u2193\n.521\n.405\n\n.708\n\n.478\n\n.693\n\nsvb\ngvb\nmvg\nmvm\n\nECAT\nF1\n\nAcc.\n.579\u2193 .374\u2193\n.623\u2193 .424\u2193\n.620\u2193 .420\u2193\n.636\n.441\n\nAcc.\nF1\n.800\u2193 .501\u2193\n.835\u2193 .595\u2193\n.834\u2193 .594\u2193\n.860\n.642\n\nF1\n\nAcc.\n.651\u2193 .483\u2193\n.786\u2193 .589\u2193\n.787\u2193 .600\u2193\n.820\n.644\n\nResults obtained in a supervised setting with only 10 labeled documents per language for training are\nsummarized in table 2. All learning strategies using the generated views during training outperform\nthe single-view baseline. This shows that, although imperfect, arti\ufb01cial views do bring additional\ninformation that compensates the lack of labeled data. Although the multi-view Gibbs classi\ufb01er\npredicts based on a translation rather than the original in 80% of cases, it produces almost identical\nperformance to the gvb run (which only predicts using the original text). These results indicate that\nthe translation produced by our MT system is of suf\ufb01cient quality for indexing and classi\ufb01cation\npurposes. Multi-view majority voting reaches the best performance, yielding a 6 \u2212 17% improve-\nment in accuracy over the baseline. A similar increase in performance is observed using F1, which\nsuggests that the multi-view SVM appropriately handles unbalanced classes.\n\nFigure 1 shows the learning curves obtained on 3 classes, C15, ECAT and M11. These \ufb01gures show\nthat when there are enough labeled examples (around 500 for these 3 classes), the arti\ufb01cial views do\nnot provide any additional useful information over the original-language examples. These empirical\nresults illustrate the trade-off discussed at the previous section. When there are suf\ufb01cient original\nlabeled examples, additional generated views do not provide more useful information for learning\nthan what view-speci\ufb01c classi\ufb01ers have available already.\n\nWe now investigate the use of unlabeled training examples for learning the view-speci\ufb01c classi\ufb01ers.\nOur overall aim is to illustrate our \ufb01ndings from section 5. Recall that in the case where view-speci\ufb01c\nclassi\ufb01ers are in agreement over the class labels of a large number of unlabeled examples, the multi-\nview Gibbs and majority vote strategies should have the same performance. In order to enforce\nagreement between classi\ufb01ers on the unlabeled set, we use a variant of the iterative co-training\nalgorithm [3]. Given the view-speci\ufb01c classi\ufb01ers trained on an initial set of labeled examples, we\niteratively assign pseudo-labels to the unlabeled examples for which all classi\ufb01er predictions agree.\nWe then train new view-speci\ufb01c classi\ufb01ers on the joint set of the original labeled examples, and those\nunanimously (pseudo-)labeled ones. Key differences between this algorithm and co-training are the\nnumber of views used for learning (5 instead of 2), and the use of unanimous and simultaneous\nlabeling.\n\n2The dataset is available from http://multilingreuters.iit.nrc.ca/ReutersMultiLingualMultiView.htm\n\n\fC15\n\nECAT\n\nM11\n\n1\nF\n\n 0.8\n\n 0.75\n\n 0.7\n\n 0.65\n\n 0.6\n\n 0.55\n\n 0.5\n\n 0.45\n\n 0.4\n\n 0.35\n\n \n \n \n\n10\n\n20\n\n50\n\n100\n\n200\n\nLabeled training size\n\nmvm\nmvg\nsvb\n\n500\n\n1\nF\n\n 0.8\n\n 0.75\n\n 0.7\n\n 0.65\n\n 0.6\n\n 0.55\n\n 0.5\n\n 0.45\n\n 0.4\n\n 0.35\n\n \n \n \n\n10\n\n20\n\n50\n\n100\n\n200\n\nLabeled training size\n\nmvm\nmvg\nsvb\n\n500\n\n1\nF\n\n 0.8\n\n 0.75\n\n 0.7\n\n 0.65\n\n 0.6\n\n 0.55\n\n 0.5\n\n 0.45\n\n 0.4\n\n 0.35\n\n \n \n \n\n10\n\n20\n\n50\n\n100\n\n200\n\nLabeled training size\n\nmvm\nmvg\nsvb\n\n500\n\nFigure 1: F1 vs. size of the labeled training set for classes C15, ECAT and M11.\n\nWe call this iterative process self-learning multiple-view algorithm, as it also bears a similarity with\nthe self-training paradigm [16]. Prediction from the multi-view SVM models obtained from this\nm).\nself-learning multiple-view algorithm is done either using Gibbs (mvs\nThese results are shown in table 3. For comparison we also trained a TSVM model [7] on each view\nseparately, a semi-supervised equivalent to the single-view baseline strategy. Note that the TSVM\nmodel mostly out-performs the supervised baseline svb, although the F1 suffers on some classes.\nThis suggests that the TSVM has trouble handling unbalanced classes in this setting.\n\ng) or majority voting (mvs\n\nTable 3: Test classi\ufb01cation accuracy and F1 in the semi-supervised setting, for single-view TSVM\nm), averaged over 10\nand multi-view self-learning using either Gibbs (mvs\nrandom sets using 10 labeled examples per view to start. For comparison we provide the single-view\nbaseline and multi-view majority voting performance for supervised learning.\n\ng) or majority voting (mvs\n\nStrategy\n\nsvb\nmvm\nTSVM\nmvs\ng\nmvs\nm\n\nC15\n\nCCAT\n\nE21\n\nAcc.\n.559\u2193\n.716\u2193\n.721\u2193\n.772\n.773\n\nF1\n\n.388\u2193\n.521\u2193\n.482\u2193\n.586\n.589\n\nAcc.\n.639\u2193\n.708\u2193\n.721\u2193\n.762\n.766\n\nF1\n\n.403\u2193\n.478\u2193\n.405\u2193\n.538\n.545\n\nAcc.\n.557\u2193\n.693\u2193\n.746\u2193\n.765\n.767\n\nF1\n\n.294\u2193\n.405\u2193\n.269\u2193\n.470\n.473\n\nECAT\nF1\n\nAcc.\n.579\u2193 .374\u2193\n.636\u2193 .441\u2193\n.665\u2193 .263\u2193\n.691 .504\n.701\n.508\n\nGCAT\n\nF1\n\nAcc.\n.800\u2193 .501\u2193\n.860\u2193 .642\u2193\n.876\u2193 .606\u2193\n.903 .729\n.905\n.734\n\nM11\n\nAcc.\nF1\n.651\u2193 .483\u2193\n.820\u2193 .644\u2193\n.834\u2193 .706\u2193\n.900 .764\n.901\n.766\n\nThe multi-view self-learning algorithm achieves the best classi\ufb01cation performance in both accuracy\nand F1, and signi\ufb01cantly outperforms both the TSVM and the supervised multi-view strategy in all\nclasses. As expected, the performance of both mvs\n\nm strategies are similar.\n\ng and mvs\n\n7 Conclusion\n\nThe contributions of this paper are twofold. First, we proposed a bound on the risk of the Gibbs\nclassi\ufb01er trained over arti\ufb01cially completed multi-view observations, which directly corresponds to\nour target application of learning text classi\ufb01ers from a comparable corpus. We showed that our\nbound may lead to a trade-off between the size of the training set, the number of views, and the\nquality of the view generating functions. Our result identi\ufb01es in which case it is advantageous to\nlearn with additional arti\ufb01cial views, as opposed to sticking with the baseline setting in which a clas-\nsi\ufb01er is trained over single view observations. This result leads to our second contribution, which is\na natural way of using unlabeled data in semi-supervised multi-view learning. We showed that in the\ncase where view-speci\ufb01c classi\ufb01ers agree over the class labels of additional unlabeled training data,\nthe previous trade-off becomes even much tighter. Empirical results on a comparable multilingual\ncorpus support our \ufb01ndings by showing that additional views obtained using a Machine Translation\nsystem may signi\ufb01cantly increase classi\ufb01cation performance in the most interesting situation, when\nthere are few labeled data available for training.\n\nAcknowlegdements This work was supported in part by the IST Program of the European Com-\nmunity, under the PASCAL2 Network of Excellence, IST-2002-506778.\n\n\fReferences\n\n[1] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3:463\u2013482, 2003.\n\n[2] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain\n\nadaptation. In NIPS, 2007.\n\n[3] A. Blum and T. M. Mitchell. Combining labeled and unlabeled sata with co-training. In COLT,\n\npages 92\u2013100, 1998.\n\n[4] K. Crammer, M. Kearns, and J. Wortman. Learning from multiple sources. Journal of Machine\n\nLearning Research, 9:1757\u20131774, 2008.\n\n[5] J. D. R. Farquhar, D. Hardoon, H. Meng, J. Shawe-Taylor, and S. Szedmak. Two view learning:\nSvm-2k, theory and practice. In Advances in Neural Information Processing Systems 18, pages\n355\u2013362. 2006.\n\n[6] D. R. Hardoon, G. Leen, S. Kaski, and J. S.-T. (eds). Nips workshop on learning from multiple\n\nsources. 2008.\n\n[7] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In\n\nICML, pages 200\u2013209, 1999.\n\n[8] T. Joachims. Training linear svms in linear time. In Proceedings of the ACM Conference on\n\nKnowledge Discovery and Data Mining (KDD), pages 217\u2013226, 2006.\n\n[9] J. Langford and J. Shawe-taylor. Pac-bayes & margins. In NIPS 15, pages 439\u2013446, 2002.\n[10] E. Lehmann. Nonparametric Statistical Methods Based on Ranks. McGraw-Hill, New York,\n\n1975.\n\n[11] B. Leskes. The value of agreement, a new boosting algorithm. In COLT, pages 95\u2013110, 2005.\n[12] D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text catego-\n\nrization research. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[13] I. Muslea. Active learning with multiple views. PhD thesis, USC, 2002.\n[14] Reuters. Corpus, volume 2, multilingual corpus, 1996-08-20 to 1997-08-19. 2005.\n[15] N. Uef\ufb01ng, M. Simard, S. Larkin, and J. H. Johnson. NRC\u2019s PORTAGE system for WMT. In\n\nIn ACL-2007 Second Workshop on SMT, pages 185\u2013188, 2007.\n\n[16] X. Zhu. Semi-supervised learning literature survey. Technical report, Univ. Wisconsis, 2007.\n\n\f", "award": [], "sourceid": 688, "authors": [{"given_name": "Massih R.", "family_name": "Amini", "institution": ""}, {"given_name": "Nicolas", "family_name": "Usunier", "institution": null}, {"given_name": "Cyril", "family_name": "Goutte", "institution": null}]}