### abstract ###
The Baum-Welch algorithm together with its derivatives and
 variations has been the main technique for learning Hidden Markov
 Models (HMM) from observational data
We present an HMM learning
 algorithm based on the non-negative matrix factorization (NMF) of
 higher order Markovian statistics that is structurally different
 from the Baum-Welch and its associated approaches
The described
 algorithm supports estimation of the number of recurrent states of
 an HMM and iterates the non-negative matrix factorization (NMF)
 algorithm to improve the learned HMM parameters
Numerical
 examples are provided as well
### introduction ###
Hidden Markov Models (HMM) have been successfully used to model
 stochastic systems arising in a variety of applications ranging from
 biology to engineering to
 finance~ CITATION
Following accepted notation for representing the parameters and
 structure of HMM's (see
  CITATION 
 for example), we will use the following terminology and definitions:
 
 
  SYMBOL  is the number of states of the Markov chain underlying the
 HMM
The state space is  SYMBOL  and the system's
 state process at time  SYMBOL  is denoted by  SYMBOL ;
   SYMBOL  is the number of distinct observables or symbols generated
 by the HMM
The set of possible observables is  SYMBOL  and the observation process at time  SYMBOL  is denoted by
  SYMBOL
We denote by  SYMBOL  the subprocess
  SYMBOL ;
 
 The joint probabilities
  SYMBOL  SYMBOL S_j SYMBOL t+1 SYMBOL v_k SYMBOL t SYMBOL S_i SYMBOL v_k SYMBOL S_i SYMBOL S_j SYMBOL A(k)=(a_{ij}(k)) SYMBOL v_k SYMBOL A=\sum_{k}A(k) SYMBOL x_t SYMBOL t=1 SYMBOL \Gamma
 = \{\gamma_1 ,

,  \gamma_N\} SYMBOL \gamma_i = P(x_1 = S_i)
 0 SYMBOL \sum_i \gamma_i = 1 SYMBOL A(k) SYMBOL SYMBOL = ( \{A(k)\ |\
 1kM\}, )$
We present an algorithm for  learning  an HMM from single or
 multiple observation sequences
The traditional approach for
 learning an HMM is the Baum-Welch Algorithm~ CITATION 
 which has been extended in a variety of ways by others
  CITATION
Recently, a novel and promising approach to the HMM approximation
 problem was proposed by Finesso et al ~ CITATION
That
 approach is based on Anderson's HMM stochastic realization
 technique~ CITATION  which demonstrates that a positive
 factorization of a certain Hankel matrix (consisting of observation
 string probabilities) can be used to recover the hidden Markov
 model's probability matrices
Finesso and his coauthors used
 recently developed non-negative matrix factorization (NMF)
 algorithms~ CITATION  to express those stochastic realization
 techniques as an operational algorithm
Earlier ideas in that vein
 were anticipated by Upper in 1997~ CITATION , although that
 work did not benefit from HMM stochastic realization techniques or
 NMF algorithms, both of which were developed after 1997
Methods based on stochastic realization techniques, including the
 one presented here, are fundamentally different from Baum-Welch
 based methods in that the algorithms use as input observation
 sequence  probabilities  as opposed to raw observation {\em
 sequences}
Anderson's and Finesso's approaches use system
 realization methods while our algorithm is in the spirit of the
 Myhill-Nerode~ CITATION  construction for building
 automata from languages
In the Myhill-Nerode construction, states
 are defined as equivalence classes of pasts which produce the same
 futures
In an HMM, the ``future'' of a state is a probability
 distribution over future observations
Following this intuition we
 derive our result in a manner that appears comparatively more
 concise and elementary, in relation to the aforementioned approaches
 by Anderson and Finesso
At a conceptual level, our algorithm operates as follows
We first
 estimate the matrix of an observation sequence's high order
 statistics
This matrix has a natural non-negative matrix
 factorization (NMF) ~ CITATION  which can be interpreted in
 terms of the probability distribution of future observations given
 the current state of the underlying Markov Chain
Once estimated,
 these probability distributions can be used to directly estimate
 the transition probabilities of the HMM
The estimated HMM parameters can be used, in turn, to compute the
 NMF matrix factors as well as the underlying higher order
 correlation matrix from data generated by the estimated HMM
We
 present a simple example in which an NMF factorization is exact
 but does not correspond to any HMM
This is a fact that can be
 established by comparing the factors  computed by the NMF with the
 factors computed by the estimated HMM parameters
This kind of
 comparison is not possible with other approaches
  CITATION
It is important to point out that the optimal non-negative matrix
 factorization of a positive matrix is known to be NP-Hard in the
 general case ~ CITATION , so in practice one computes
 only locally optimal factorizations
As we will show through
 examples, the repeated iteration of the factorization and
 transition probability estimation steps improves the
 factorizations and overall model estimation
Details are provided
 below
