### abstract ###
Given a random binary sequence  SYMBOL  of random variables,  SYMBOL   SYMBOL , for instance, one that is generated by a Markov source of order  SYMBOL  (each state represented by  SYMBOL  bits)
Let  SYMBOL  be the probability of  SYMBOL  and assume it is constant with respect to  SYMBOL  (due to stationarity)
Consider a learner based on a parametric model, for instance a Markov model of order  SYMBOL , who trains on a sample sequence  SYMBOL  which is randomly drawn by the source
Test the learner's performance by giving it a sequence  SYMBOL  (generated by the source) and check its predictions on every bit of  SYMBOL  An error occurs at time  SYMBOL  if the  prediction  SYMBOL  differs from the true bit value  SYMBOL
Denote by  SYMBOL  the sequence of errors where the error bit  SYMBOL  at time  SYMBOL  equals  SYMBOL  or  SYMBOL  according to whether the event of an error occurs or not, respectively
Consider the subsequence  SYMBOL  of  SYMBOL  which corresponds to the errors of predicting a  SYMBOL , ie ,  SYMBOL  consists of the bits of  SYMBOL  only at times  SYMBOL  such that  SYMBOL  In this paper we compute an upper bound on the deviation of the frequency of  SYMBOL s of  SYMBOL  from  SYMBOL  showing dependence on  SYMBOL ,  SYMBOL ,  SYMBOL
### introduction ###
From basic theory on finite Markov chains, since the matrix  SYMBOL  is stochastic (i e , the sum of the elements in any row equals  SYMBOL ) then  SYMBOL  has a stationary joint probability distribution  SYMBOL *} which is not necessarily unique
To keep the notation simple we use  SYMBOL  to denote also any marginal distribution derived from the stationary joint distribution
For instance,  SYMBOL
Henceforth, all random binary sequences are assumed to be drawn according to this probability distribution  SYMBOL
Thus for any  SYMBOL  and  SYMBOL  satisfying  SYMBOL  the probability of a string  SYMBOL  can be expressed as  SYMBOL } Let us denote by  SYMBOL } the stationary probability of the event  SYMBOL  at time  SYMBOL
Data generation : We henceforth assume that the source reached stationarity and produces the data sequence  SYMBOL  with respect to  SYMBOL
Consider the learner's model  SYMBOL
Its set of parameters are the true (unknown) probability values of transitions between states in  SYMBOL  where the probability values are assigned according to the source distribution  SYMBOL
We denote them by  SYMBOL  For instance, suppose  SYMBOL  and  SYMBOL  and consider two states  SYMBOL  and  SYMBOL
The corresponding transition probability is  SYMBOL  Based on  SYMBOL  the learner estimates  SYMBOL  by  SYMBOL  where for a state  SYMBOL ,  SYMBOL  denotes the number of times that  SYMBOL  appears in  SYMBOL  and  SYMBOL  denotes the number of times there is a transition from state  SYMBOL  to  SYMBOL  in  SYMBOL
For instance, if  SYMBOL ,  SYMBOL  and  SYMBOL  then  SYMBOL
Thus  SYMBOL  are the frequency of state-transitions in  SYMBOL
Note that  SYMBOL ,  SYMBOL , are dependent random variables since the Markov chain may visit each state a random number of times and they must satisfy  SYMBOL
After training, the learner is tested on the remaining  SYMBOL  bits of the data  SYMBOL
It makes a binary prediction  SYMBOL  for  SYMBOL ,  SYMBOL  based on the maximum  a posteriori  probability which is defined as follows: suppose that the current state is  SYMBOL  then the prediction is   SYMBOL } where  SYMBOL  is defined as  SYMBOL  for the state  SYMBOL  obtained from  SYMBOL  by a type-1 transition, i e , if  SYMBOL  then  SYMBOL
The corresponding true probability value is denoted by  SYMBOL
Note that () may be expressed alternatively as SYMBOL }   We claim that  SYMBOL ,  SYMBOL , are independent random variables when conditioned on the vector  SYMBOL
We now prove the claim which will be used in Section
Let us denote by  SYMBOL ,  SYMBOL ,  SYMBOL , the particular sequence of states corresponding to the sequence  SYMBOL
To show the dependence of  SYMBOL  on  SYMBOL  we will sometimes write  SYMBOL
Then by () we have SYMBOL *} Since at every bit there are only two types of transitions then not every sequence  SYMBOL  is possible
For instance, if  SYMBOL  then the state sequence  SYMBOL  is valid but  SYMBOL  is not valid
Denote by  SYMBOL  the set of  valid  state sequences  SYMBOL
We now show that if  SYMBOL  is in  SYMBOL  then, conditioned on  SYMBOL , any other state sequence that visits the same states as  SYMBOL  the same number of times (perhaps in a different order) must have the same probability
For any state  SYMBOL  denote by  SYMBOL  the random variable whose value is the number of type-1 transitions from state  SYMBOL  in a sequence of random states  SYMBOL
Define by  SYMBOL  the number of type-1 transitions from state  SYMBOL  in the sequence  SYMBOL
Since all state transitions are either type-0 or type-1  then we have SYMBOL } where  SYMBOL  was defined above
Let  SYMBOL  be a non-negative integer parameter and define the random variable  SYMBOL
Associate a conditional probability function with parameter  SYMBOL  for the random variable  SYMBOL  as  SYMBOL  Then the right side of () equals SYMBOL } For a fixed value of  SYMBOL  the event {}`` SYMBOL '' is equivalent to the event {}`` SYMBOL ''
Hence alternatively, the right side of () can be expressed as  SYMBOL } The right side of () is a product of probability functions of the random variables  SYMBOL
So conditioned on  SYMBOL  and on the event that  SYMBOL  corresponds to a valid state sequence  SYMBOL , the event that  SYMBOL  is generated by the source Markov chain  SYMBOL  is equivalent to the event that its corresponding state sequence  SYMBOL  has transition frequencies  SYMBOL  that independently take the particular values  SYMBOL  as prescribed in  SYMBOL
The claim is proved
It also follows that  SYMBOL  is the average of independent Bernoulli trials (success taken as a type-1 transition from state  SYMBOL )
It is distributed according to the Binomial distribution with parameters  SYMBOL  and  SYMBOL
We now summarize the problem setting under which the main result of the paper holds
Problem setting : Let  SYMBOL  and  SYMBOL  be positive integers
Let  SYMBOL  be the stationary probability distribution based on a finite, ergodic and reversible Markov chain with probability-transition matrix  SYMBOL  that has a second largest eigenvalue  SYMBOL
All probability values are measured according to  SYMBOL
Denote by  SYMBOL
After reaching stationarity the source generates a binary sequence  SYMBOL  by repeatedly drawing  SYMBOL  according to  SYMBOL
Denote by  SYMBOL
Let  SYMBOL  be a data-sequence obtained by randomly drawing according to  SYMBOL
Let the learner's model  SYMBOL  be Markov of order  SYMBOL , and denote by  SYMBOL  the probability of making a type-1 transition from state  SYMBOL  of  SYMBOL
The learner uses the first  SYMBOL  bits,  SYMBOL , to estimate  SYMBOL ) by  SYMBOL
Let  SYMBOL  denote the number of times that state  SYMBOL  appears in  SYMBOL ,  SYMBOL
After training, the learner's decision at state  SYMBOL  is to output  SYMBOL  if  SYMBOL  else output  SYMBOL
Denote by  SYMBOL  the probability that a Binomial random variable with parameters  SYMBOL ,  SYMBOL , is larger (or smaller) than  SYMBOL  given that  SYMBOL  is smaller (or larger) than  SYMBOL , respectively
Let  SYMBOL
Let  SYMBOL
Using  SYMBOL  the learner is tested incrementally on the remaining  SYMBOL  bits  SYMBOL  of the data and predicts an output bit  SYMBOL  for bit  SYMBOL  in  SYMBOL  to be  SYMBOL  if  SYMBOL , else  SYMBOL
Denote by  SYMBOL  the sequence of mistakes where  SYMBOL  if  SYMBOL , and  SYMBOL  otherwise,  SYMBOL
Denote by  SYMBOL ,  SYMBOL , the subsequence of  SYMBOL  with time instants  SYMBOL  corresponding to  SYMBOL -predictions,  SYMBOL ,  SYMBOL
Note that  SYMBOL  is also a subsequence of the input sequence  SYMBOL  hence effectively the learner acts as a selection rule which picks certain bits  SYMBOL  from  SYMBOL
Let    SYMBOL *}   and assume that the learner's model order  SYMBOL  satisfies,    SYMBOL  We now state the main result of the paper
Before presenting the proof we make the following remarks,   The effect of the training sequence length  SYMBOL  on  SYMBOL  is as  SYMBOL  which is  SYMBOL
As  SYMBOL  increases the class of possible learnt models (hypothesis class) decreases in size thereby decreasing the bound  SYMBOL  on the deviation of the error sequence
The effect of the learner's model order  SYMBOL  is opposite of that of  SYMBOL
We see that  SYMBOL  and as  SYMBOL  increases, the hypothesis class increases in size
The effect of the length  SYMBOL  of the error sequence on  SYMBOL  is as  SYMBOL
Clearly, the longer the subsequence the less chance that its frequency of 1s deviate from the mean  SYMBOL
The effect of the inter-dependence between the states of the source model  SYMBOL  on  SYMBOL  is as  SYMBOL
As the dependence increases,  SYMBOL  decreases which increases the possible deviation size  SYMBOL
As  SYMBOL  decreases, the bits of the sequence  SYMBOL  become less dependent and  SYMBOL  decreases
