Optimal Robot Scheduling for Web Search Engines


E. G. Coffman, Jr.
Bell Labs, Lucent Technologies, USA
Zhen Liu
INRIA, Sophia Antipolis, France
Richard R. Weber
Cambridge University, UK

The Web has quickly become a major information publishing and retrieving mechanism on the Internet. But with the exponential growth in the number of Web sites and pages, it is no longer possible to manually browse any significant portion of the hypertext structure. This situation motivates the introduction of search engines such as Alta Vista, Lycos,... These systems consist of indexing engines for constructing a data base of Web pages, and in many cases robots (more colorfully called Web crawlers, spiders, worms,...) for bringing new information to the indexing engine. To maintain currency and completeness of the data base, robots periodically make recursive traversals of the Web's hypertext structure by visiting pages, then the pages referenced by these pages, and so on. There are currently about 50 robots on the Web. Most of them, including the ones of interest here, are designed for resource discovery, but others are used for mirroring, maintenance (link checking), statistics, etc.

In the ongoing work reported here, we study robot scheduling problems that stem from the observation that the frequency at which a page is visited by a robot should depend on how often the page is expected to change. For example, assume there are Web pages, labeled 1,2,...,N, which are to be visited repeatedly by a robot. Assume also that the contents of page i are modified at times that follow a Poisson process with parameter lambda_i. Page i is considered up-to-date (with respect to the indexing engine) from the time it is visited by the robot until the next time it is modified, at which point it becomes out-of-date until it is revisited by the robot. The problem is to find the relative frequency of the robot's visits to each of the pages such that the proportion of time that the pages are out-of-date is smallest possible. Under some simplifying assumptions on the intervisit times of the robot to the pages, we obtain an optimal solution to this problem. Variants of the problem will be discussed along with other search-engine performance issues.