## Brownian Bandits

### Authors: Avi Mandelbaum and Robert J. Vanderbei

ABSTRACT

We consider a multi-armed bandit whose arms are driven by Brownian motions: the state of arm $i$ is modeled as a one-dimensional Brownian motion $B^i$, $i = 1, \ldots, d$. At each instant in time, a gambler must decide to pull some subset of these $d$ arms, holding the others fixed at their current state. If arm $i$ is pulled when $B^i$ is in state $x_i$, the gambler accumulates reward at rate $r_i(x_i)$. The goal is to find a strategy that maximizes the accumulated discounted reward over an infinite horizon (with discount rate $\lambda$). Let \Gamma_i(x_i) = \sup_{\tau > 0} \frac{ \Exp_{x_i} \int_0^{\tau} e^{-\lambda t} r_i(B^i_t) dt }{ \Exp_{x_i} \int_0^{\tau} e^{-\lambda t} dt } , where the supremum is over all stopping times $\tau$ of $B^i$. Put {\cal M}_i = \{ x \in {\reals}^d: \Gamma_i(x_i) = \max_j \Gamma_j(x_j) \}. >From prior work, one expects that an optimal control is to pull arm $i$ when the state of the bandit belongs to ${\cal M}_i$. Equivalently, the optimal strategy follows the leader among the processes $\Gamma_i(B^i)$, $i = 1, \ldots, d$. Such results have been established for bounded monotone reward functions. In this paper we extend the scope to cover certain unimodal functions. At the same time, we develop a framework within which general rewards and diffusions can be studied.

Key words: Gittins indices, Brownian motion, optimal stopping, optimal switching

Paper available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1993/93-74.ps