DIMACS TR: 93-74
Brownian Bandits
Authors: Avi Mandelbaum and Robert J. Vanderbei
ABSTRACT
We consider a multi-armed bandit whose arms are driven by Brownian motions:
the state of arm $i$ is modeled as a one-dimensional
Brownian motion $B^i$, $i = 1, \ldots, d$.
At each instant in time, a gambler must decide to pull some subset of these $d$
arms, holding the others fixed at their current state.
If arm $i$ is pulled when
$B^i$ is in state $x_i$, the gambler accumulates reward at rate $r_i(x_i)$.
The goal is to find a strategy that maximizes the accumulated discounted reward
over an infinite horizon (with discount rate $\lambda$).
Let
\Gamma_i(x_i) = \sup_{\tau > 0}
\frac{
\Exp_{x_i} \int_0^{\tau} e^{-\lambda t} r_i(B^i_t) dt
}{
\Exp_{x_i} \int_0^{\tau} e^{-\lambda t} dt
} ,
where the supremum is over all stopping times $\tau$ of $B^i$.
Put
{\cal M}_i = \{ x \in {\reals}^d:
\Gamma_i(x_i) = \max_j \Gamma_j(x_j) \}.
>From prior work, one expects that
an optimal control is to pull arm $i$ when the state of the
bandit belongs to ${\cal M}_i$.
Equivalently, the optimal strategy follows the leader
among the processes $\Gamma_i(B^i)$, $i = 1, \ldots, d$.
Such results have been established for bounded monotone reward functions. In
this paper we extend the scope to cover certain
unimodal functions. At the same time,
we develop a framework within which general
rewards and diffusions can be studied.
Key words:
Gittins indices,
Brownian motion,
optimal stopping,
optimal switching
Paper available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1993/93-74.ps
DIMACS Home Page