### abstract ###
MISC	The Minimum Description Length principle for online sequence estimation/prediction in a proper learning setup is studied
MISC	If the underlying model class is discrete, then the total expected square loss is a particularly interesting performance measure: (a) this quantity is finitely bounded, implying convergence with probability one, and (b) it additionally specifies the convergence speed
MISC	For MDL, in general one can only have loss bounds which are finite but exponentially larger than those for Bayes mixtures
AIMX	We show that this is even the case if the model class contains only Bernoulli distributions
OWNX	We derive a new upper bound on the prediction error for countable Bernoulli classes
OWNX	This implies a small bound (comparable to the one for Bayes mixtures) for certain important model classes
OWNX	We discuss the application to Machine Learning tasks such as classification and hypothesis testing, and generalization to countable classes of iid models
### introduction ###
MISC	``Bayes mixture", ``Solomonoff induction", ``marginalization", all these terms refer to a central induction principle: Obtain a predictive distribution by integrating the product of prior and evidence over the model class
CONT	In many cases however, the Bayes mixture is computationally infeasible, and even a sophisticated approximation is expensive
MISC	The MDL or MAP (maximum a posteriori) estimator is both a common approximation for the Bayes mixture and interesting for its own sake: Use the model with the largest product of prior and evidence
MISC	In practice, the MDL estimator is usually being approximated too, since only a local maximum is determined
MISC	How good are the predictions by Bayes mixtures and MDL
MISC	This question has attracted much attention
MISC	In many cases, an important quality measure is the  total  or cumulative  expected loss  of a predictor
MISC	In particular the square loss is often considered
MISC	Assume that the outcome space is finite, and the model class is continuously parameterized
MISC	Then for Bayes mixture prediction, the cumulative expected square loss is usually small but unbounded, growing with  SYMBOL , where  SYMBOL  is the sample size  CITATION
MISC	This corresponds to an  instantaneous  loss bound of  SYMBOL
MISC	For the MDL predictor, the losses behave similarly  CITATION  under appropriate conditions, in particular with a specific prior
MISC	Note that in order to do MDL for continuous model classes, one needs to  discretize  the parameter space, see also  CITATION
MISC	On the other hand, if the model class is discrete, then Solomonoff's theorem  CITATION  bounds the cumulative expected square loss for the Bayes mixture predictions finitely, namely by  SYMBOL , where  SYMBOL  is the prior weight of the ``true" model  SYMBOL
MISC	The only necessary assumption is that the true distribution  SYMBOL  is contained in the model class, ie that we are dealing with  proper learning
MISC	It has been demonstrated  CITATION , that for both Bayes mixture and MDL, the proper learning assumption can be essential: If it is violated, then learning may fail very badly
MISC	For MDL predictions in the proper learning case, it has been shown  CITATION  that a bound of  SYMBOL  holds
MISC	This bound is exponentially larger than the Solomonoff bound, and it is sharp in general
MISC	A finite bound on the total expected square loss is particularly interesting:   It implies convergence of the predictive to the true probabilities with probability one
MISC	In contrast, an instantaneous loss bound of  SYMBOL  implies only convergence in probability
MISC	Additionally, it gives a  convergence speed , in the sense that errors of a certain magnitude cannot occur too often
MISC	So for both, Bayes mixtures and MDL, convergence with probability one holds, while the convergence speed is exponentially worse for MDL compared to the Bayes mixture
MISC	We avoid the term ``convergence rate" here, since the order of convergence is identical in both cases
MISC	It is eg  SYMBOL  if we additionally assume that the error is monotonically decreasing, which is not necessarily true in general)
MISC	It is therefore natural to ask if there are model classes where the cumulative loss of MDL is comparable to that of Bayes mixture predictions
AIMX	In the present work, we concentrate on the simplest possible stochastic case, namely discrete Bernoulli classes
OWNX	Note that then the MDL ``predictor" just becomes an estimator, in that it estimates the true parameter and directly uses that for prediction
OWNX	Nevertheless, for consistency of terminology, we keep the term predictor
MISC	It might be surprising to discover that in general the cumulative loss is still exponential
OWNX	On the other hand, we will give mild conditions on the prior guaranteeing a small bound
MISC	Moreover, it is well-known that the instantaneous square loss of the Maximum Likelihood estimator decays as  SYMBOL  in the Bernoulli case
MISC	The same holds for MDL, as we will see
MISC	If convergence speed is measured in terms of instantaneous losses, then much more general statements are possible  CITATION , this is briefly discussed in Section
MISC	A particular motivation to consider discrete model classes arises in Algorithmic Information Theory
MISC	From a computational point of view, the largest relevant model class is the class of all computable models on some fixed universal Turing machine, precisely prefix machine  CITATION
MISC	Thus each model corresponds to a program, and there are countably many programs
MISC	Moreover, the models are stochastic, precisely they are  semimeasures  on strings (programs need not halt, otherwise the models were even measures)
MISC	Each model has a natural description length, namely the length of the corresponding program
MISC	If we agree that programs are binary strings, then a prior is defined by two to the negative description length
MISC	By the Kraft inequality, the priors sum up to at most one
MISC	Also the Bernoulli case can be studied in the view of Algorithmic Information Theory
MISC	We call this the  universal setup : Given a universal Turing machine, the related class of Bernoulli distributions is isomorphic to the countable set of computable reals in  SYMBOL
MISC	The description length  SYMBOL  of a parameter  SYMBOL  is then given by the length of its shortest program
MISC	A prior weight may then be defined by  SYMBOL
MISC	If a string  SYMBOL  is generated by a Bernoulli distribution with computable parameter  SYMBOL , then with high probability the two-part complexity of  SYMBOL  with respect to the Bernoulli class does not exceed its algorithmic complexity by more than a constant, as shown by Vovk  CITATION
MISC	That is, the two-part complexity with respect to the Bernoulli class  is  the shortest description, save for an additive constant
MISC	Many Machine Learning tasks are or can be reduced to sequence prediction tasks
MISC	An important example is classification
MISC	The task of classifying a new instance  SYMBOL  after having seen (instance,class) pairs  SYMBOL  can be phrased as to predict the continuation of the sequence  SYMBOL
MISC	Typically the (instance,class) pairs are iid
MISC	Cumulative loss bounds for prediction usually generalize to prediction  conditionalized  to some inputs  CITATION
MISC	Then we can solve classification problems in the standard form
OWNX	It is not obvious if and how the proofs in this paper can be conditionalized
OWNX	Our main tool for obtaining results is the Kullback-Leibler divergence
OWNX	Lemmata for this quantity are stated in Section
OWNX	Section  shows that the exponential error bound obtained in  CITATION  is sharp in general
OWNX	In Section , we give an upper bound on the instantaneous and the cumulative losses
OWNX	The latter bound is small eg under certain conditions on the distribution of the weights, this is the subject of Section
OWNX	Section  treats the universal setup
OWNX	Finally, in Section  we discuss the results and give conclusions