### abstract ###
We consider the problem of joint universal variable-rate lossy coding and identification for parametric classes of stationary  SYMBOL -mixing sources with general (Polish) alphabets
Compression performance is measured in terms of Lagrangians, while identification performance is measured by the variational distance between the true source and the estimated source
Provided that the sources are mixing at a sufficiently fast rate and satisfy certain smoothness and Vapnik--Chervonenkis learnability conditions, it is shown that, for bounded metric distortions, there exist universal schemes for joint lossy compression and identification whose Lagrangian redundancies converge to zero as  SYMBOL  as the block length  SYMBOL  tends to infinity, where  SYMBOL  is the Vapnik--Chervonenkis dimension of a certain class of decision regions defined by the  SYMBOL -dimensional marginal distributions of the sources; furthermore, for each  SYMBOL , the decoder can identify  SYMBOL -dimensional marginal of the active source up to a ball of radius  SYMBOL  in variational distance, eventually with probability one
The results are supplemented by several examples of parametric sources satisfying the regularity conditions \\ \\  Index Terms--- Learning, minimum-distance density estimation, two-stage codes, universal vector quantization, Vapnik--Chervonenkis dimension
### introduction ###
It is well known that lossless source coding and statistical modeling are complementary objectives
This fact is captured by the Kraft inequality (see Section~5 2 in Cover and Thomas  CITATION ), which provides a correspondence between uniquely decodable codes and probability distributions on a discrete alphabet
If one has full knowledge of the source statistics, then one can design an optimal lossless code for the source, and  vice versa
However, in practice it is unreasonable to expect that the source statistics are known precisely, so one has to design  universal  schemes that perform asymptotically optimally within a given class of sources
In universal coding, too, as Rissanen has shown in  CITATION , the coding and modeling objectives can be accomplished jointly: given a sufficiently regular parametric family of discrete-alphabet sources, the encoder can acquire the source statistics via maximum-likelihood estimation on a sufficiently long data sequence and use this knowledge to select an appropriate coding scheme
Even in nonparametric settings (e g , the class of all stationary ergodic discrete-alphabet sources), universal schemes such as Ziv--Lempel  CITATION  amount to constructing a probabilistic model for the source
In the reverse direction, Kieffer  CITATION  and Merhav  CITATION , among others, have addressed the problem of statistical modeling (parameter estimation or model identification) via universal lossless coding
Once we consider  lossy  coding, though, the relationship between coding and modeling is no longer so simple
On the one hand, having full knowledge of the source statistics is certainly helpful for designing optimal rate-distortion codebooks
On the other hand, apart from some special cases (e g , for  iid 
Bernoulli sources and the Hamming distortion measure or for  iid 
Gaussian sources and the squared-error distortion measure), it is not at all clear how to extract a reliable statistical model of the source from its reproduction via a rate-distortion code (although, as shown recently by Weissman and Ordentlich  CITATION , the joint empirical distribution of the source realization and the corresponding codeword of a ``good" rate-distortion code converges to the distribution solving the rate-distortion problem for the source)
This is not a problem when the emphasis is on compression, but there are situations in which one would like to compress the source and identify its statistics at the same time
For instance, in  indirect adaptive control  (see, eg , Chapter~7 of Tao  CITATION ) the parameters of the plant (the controlled system) are estimated on the basis of observation, and the controller is modified accordingly
Consider the discrete-time stochastic setting, in which the plant state sequence is a random process whose statistics are governed by a finite set of parameters
Suppose that the controller is geographically separated from the plant and connected to it via a noiseless digital channel whose capacity is  SYMBOL  bits per use
Then, given the time horizon  SYMBOL , the objective is to design an encoder and a decoder for the controller to obtain reliable estimates of both the plant parameters and the plant state sequence from the  SYMBOL  possible outputs of the decoder
To state the problem in general terms, consider an information source emitting a sequence  SYMBOL  of random variables taking values in an alphabet  SYMBOL
Suppose that the process distribution of  SYMBOL  is not specified completely, but it is known to be a member of some parametric class  SYMBOL
We wish to answer the following two questions:   Is the class  SYMBOL  universally encodable with respect to a given single-letter distortion measure  SYMBOL , by codes with a given structure (e g , all fixed-rate block codes with a given per-letter rate, all variable-rate block codes, etc )
In other words, does there exist a scheme that is asymptotically optimal for each  SYMBOL ,  SYMBOL
If the answer to Question 1) is positive, can the codes be constructed in such a way that the decoder can not only reconstruct the source, but also identify its process distribution  SYMBOL , in an asymptotically optimal fashion
In previous work  CITATION , we have addressed these two questions in the context of fixed-rate lossy block coding of stationary memoryless ( iid  ) continuous-alphabet sources with parameter space  SYMBOL  a bounded subset of  SYMBOL  for some finite  SYMBOL
We have shown that, under appropriate regularity conditions on the distortion measure and on the source models, there exist joint universal schemes for lossy coding and source identification whose redundancies (that is, the gap between the actual performance and the theoretical optimum given by the Shannon distortion-rate function) and source estimation fidelity both converge to zero as  SYMBOL , as the block length  SYMBOL  tends to infinity
The code operates by coding each block with the code matched to the source with the parameters estimated from the preceding block
Comparing this convergence rate to the  SYMBOL  convergence rate, which is optimal for redundancies of fixed-rate lossy block codes  CITATION , we see that there is, in general, a price to be paid for doing compression and identification simultaneously
Furthermore, the constant hidden in the  SYMBOL  notation increases with the ``richness" of the model class  SYMBOL , as measured by the Vapnik--Chervonenkis (VC) dimension  CITATION  of a certain class of measurable subsets of the source alphabet associated with the sources
The main limitation of the results of  CITATION  is the  iid 
assumption, which is rather restrictive as it excludes many practically relevant model classes (e g , autoregressive sources, or Markov and hidden Markov processes)
Furthermore, the assumption that the parameter space  SYMBOL  is bounded may not always hold, at least in the sense that we may not know the diameter of  SYMBOL   a priori
In this paper we relax both of these assumptions and study the existence and the performance of universal schemes for joint lossy coding and identification of stationary sources satisfying a mixing condition, when the sources are assumed to belong to a parametric model class  SYMBOL ,  SYMBOL  being an open subset of  SYMBOL  for some finite  SYMBOL
Because the parameter space is not bounded, we have to use variable-rate codes with countably infinite codebooks, and the performance of the code is assessed by a composite Lagrangian functional  CITATION  which captures the trade-off between the expected distortion and the expected rate of the code
Our result is that, under certain regularity conditions on the distortion measure and on the model class, there exist universal schemes for joint lossy source coding and identification such that, as the block length  SYMBOL  tends to infinity, the gap between the actual Lagrangian performance and the optimal Lagrangian performance achievable by variable-rate codes at that block length, as well as the source estimation fidelity at the decoder, converge to zero as  SYMBOL , where  SYMBOL  is the VC dimension of a certain class of decision regions induced by the collection  SYMBOL  of the  SYMBOL -dimensional marginals of the source process distributions
This result shows very clearly that the price to be paid for universality, in terms of both compression and identification, grows with the richness of the underlying model class, as captured by the VC dimension sequence  SYMBOL
The richer the model class, the harder it is to learn, which affects the compression performance of our scheme because we use the source parameters learned from past data to decide how to encode the current block
Furthermore, comparing the rate at which the Lagrangian redundancy decays to zero under our scheme with the  SYMBOL  result of Chou, Effros and Gray  CITATION , whose universal scheme is not aimed at identification, we immediately see that, in ensuring to satisfy the twin objectives of compression and modeling, we inevitably sacrifice some compression performance
The paper is organized as follows
Section~ introduces notation and basic concepts related to sources, codes and Vapnik--Chervonenkis classes
Section~ lists and discusses the regularity conditions that have to be satisfied by the source model class, and contains the statement of our result
The result is proved in Section~
Next, in Section~ we give three examples of parametric source families (namely,  iid 
Gaussian sources, Gaussian autoregressive sources and hidden Markov processes) which fit the framework of this paper under suitable regularity conditions
We conclude in Section~ and outline directions for future research
Finally, the Appendix contains some technical results on Lagrange-optimal variable-rate quantizers
