### abstract ###
We consider an agent interacting with an unmodeled environment
At each time, the agent makes an observation, takes an action, and incurs a cost
Its actions can influence future observations and costs
The goal is to minimize the long-term average cost
We propose a novel algorithm, known as the active LZ algorithm, for optimal control based on ideas from the Lempel-Ziv scheme for universal data compression and prediction
We establish that, under the active LZ algorithm, if there exists an integer  SYMBOL  such that the future is conditionally independent of the past given a window of  SYMBOL  consecutive actions and observations, then the average cost converges to the optimum
Experimental results involving the game of Rock-Paper-Scissors illustrate merits of the algorithm
### introduction ###
\IEEEPARstart{C}{onsider} an agent that, at each integer time  SYMBOL , makes an observation  SYMBOL  from a finite observation space  SYMBOL , and takes an action  SYMBOL  selected from a finite action space  SYMBOL
The agent incurs a bounded cost  SYMBOL
The goal is to minimize the long-term average cost  SYMBOL  Here, the expectation is over the randomness in the  SYMBOL  process, and, at each time  SYMBOL , the action  SYMBOL  is selected as a function of the prior observations  SYMBOL  and the prior actions  SYMBOL
We will propose a general action-selection strategy called the  active LZ algorithm
In addition to the new strategy, a primary contribution of this paper is a theoretical guarantee that this strategy attains optimal average cost under weak assumptions about the environment
The main assumption is that there exists an integer  SYMBOL  such that the future is conditionally independent of the past given a window of  SYMBOL  consecutive actions and observations
In other words,  SYMBOL } where  SYMBOL  is a transition kernel and  SYMBOL  is the  SYMBOL -algebra generated by  SYMBOL
We are particularly interested in situations where neither  SYMBOL  nor even  SYMBOL  are known to the agent
That is, where there is a finite but unknown dependence on history
Consider the following examples, which fall into the above formalism
The optimization problem is to find a sequence of functions  SYMBOL , where each function  SYMBOL  specifies an encoder at time  SYMBOL , so as to minimize the long-term average distortion  SYMBOL  Assume that the source is Markov of order  SYMBOL , but that both the transition probabilities for the source and the order  SYMBOL  are unknown
Setting  SYMBOL , define the observation at time  SYMBOL  to be the vector  SYMBOL  and the action at time  SYMBOL  to be  SYMBOL
Then, optimal coding problem at hand falls within our framework (cf
CITATION  and references therein) \end{example}  With knowledge of the kernel  SYMBOL  (or even just the order of the kernel,  SYMBOL ), solving for the average cost optimal policy in either of the examples above via dynamic programming methods is relatively straightforward
This paper develops an algorithm that,  without knowledge of the kernel or its order , achieves average cost optimality
The active LZ algorithm we develop consists of two broad components
The first is an efficient data structure, a context tree on the joint process  SYMBOL , to store information relevant to predicting the observation at time  SYMBOL ,  SYMBOL , given the history available up to time  SYMBOL  and the action selected at time  SYMBOL ,  SYMBOL
Our prediction methodology borrows heavily from the Lempel-Ziv algorithm for data compression  CITATION
The second component of our algorithm is a dynamic programming scheme that, given the probabilistic model determined by the context tree, selects actions so as to minimize costs over a suitably long horizon
Absent knowledge of the order of the kernel,  SYMBOL , the two tasks above---building a context tree in order to estimate the kernel, and selecting actions that minimize long-term costs---must be done continually in tandem which creates an important tension between `exploration' and `exploitation'
In particular, on the one hand, the algorithm must select actions in a manner that builds an accurate context tree
On the other hand, the desire to minimize costs naturally restricts this selection
By carefully balancing these two tensions our algorithm achieves an average cost equal to that of an optimal policy with full knowledge of the kernel  SYMBOL
Related problems have been considered in the literature
Kearns and Singh  CITATION  present an algorithm for reinforcement learning in a Markov decision process
This algorithm can be applied in our context when  SYMBOL  is known, and asymptotic optimality is guaranteed
More recently, Even-Dar et al \  CITATION  present an algorithm for optimal control of partially observable Markov decision processes, a more general setting than what we consider here, and are able to establish theoretical bounds on convergence time
The algorithm there, however, seems difficult and unrealistic to implement in contrast with what we present here
Further, it relies on knowledge of a constant related to the amount of time a `homing' policy requires to achieve equilibrium
This constant may be challenging to estimate
Work by de Farias and Megiddo  CITATION  considers an optimal control framework where the dynamics of the environment are not known and one wishes to select the best of a finite set of `experts'
In contrast, our problem can be thought of as competing with the set of all possible strategies
The prediction problem for loss functions with memory and a Markov-modulated source considered by  Merhav et al \  CITATION  is essentially a Markov decision problem as the authors point out; again, in this case, knowing the structure of the loss function implicitly gives the order of the underlying Markov process
The active LZ algorithm is inspired by the Lempel-Ziv algorithm
This algorithm has been extended to address many problems, such as prediction  CITATION  and filtering  CITATION
In almost all cases, however, future observations are not influenced by actions taken by the algorithm
This is in contrast to the active LZ algorithm, which proactively anticipates the effect of actions on future observations
An exception is the work of Vitter and Krishnan  CITATION , which considers cache pre-fetching and can be viewed as a special case of our formulation
The Lempel-Ziv algorithm and its extensions revolve around a context tree data structure that is constructed as observations are made
This data structure is simple and elegant from an implementational point of view
The use of this data structure in reinforcement learning represents a departure from representations of state and belief state commonly used in the reinforcement learning literature
Such data structures have proved useful in representing experience in algorithms for engineering applications ranging from compression to prediction to denoising
Understanding whether and how some of this value can be extended to reinforcement learning is the motivation for this paper
The remainder of this paper is organized as follows
In Section~, we formulate our problem precisely
In Section~, we present our algorithm and provide computational results in the context of the rock-paper-scissors example
Our main result, as stated in Theorem~ in Section~, is that the algorithm is asymptotically optimal
Section~ concludes
