Chapter 1
Introduction or Exploration
This book is about mathematical models for optimizing in a sequential manner the allocation of effort between a number of competing projects. The effort and the projects may take a variety of forms. Examples are: an industrial processor and jobs waiting to be processed; a server with a queue of customers; an industrial laboratory with research projects; any busy person with jobs to do; a stream of patients and alternative treatments (yes, the patients do correspond to effort—see Problem 5 later in this chapter); a searcher who may look in different places. In every case effort is treated as being homogeneous, and the problem is to allocate it between the different projects so as to maximize the expected total reward which they yield. It is a sequential problem, as effort is allowed to be reallocated in a feedback manner, taking account of the pattern of rewards so far achieved. The reallocations are assumed to be costless, and to take a negligible time, since the alternative is to impose a travelling-salesman-like feature, thereby seriously complicating an already difficult problem.
The techniques which come under the heading of dynamic programming have been devised for sequential optimization problems. The key idea is a recurrence equation relating the expected total reward (call this the payoff) at a given decision time to the distribution of its possible values at the next decision time (see equation (2.2)). Sometimes this equation may be solved analytically. Otherwise a recursive numerical solution may, at any rate in principle, be carried out. This involves making an initial approximation to the payoff function, and then successive further approximations by substituting in the right-hand side of the recurrence equation. As Bellman (1957), for many years the chief protagonist of this methodology, pointed out, using the recurrence equation involves less computing than a complete enumeration of all policies and their corresponding payoffs, but none the less soon runs into the sands of intractable storage and processing requirements as the number of variables on which the payoff function depends increases.
For the problem of allocating effort to projects, the number of variables is at least equal to the number of projects. An attractive idea, therefore, is to establish priority indices for each project, depending on its past history but not that of any other project, and to allocate effort at each decision time only to the project with the highest current index value. To calculate these indices it should be possible to calibrate a project in a given state against some set of standard projects with simple properties. If this could be done we should have a reasonable policy without having to deal with any function of the states of more than one project.
That optimal policies for some problems of effort allocation are expressible in terms of such indices is well known. The first three problems described in this chapter are cases in point. Chapters 2, 3 and 4 set out a general theory which puts these results into context. They include several theorems asserting the optimality of index policies under different conditions. In fact five of the seven problems described in this chapter may be solved fairly rapidly by using the index theorem (Theorem 2.1), and its Corollary 3.5. Problem 4 requires Theorem 3.1, which is a continuous-time version of the index theorem.
Since they may change as more effort is allocated, these priority indices may aptly be termed dynamic allocation indices. The policies that they define include, in relevant circumstances, the cμ-rule, Smith's rule, and the shortest processing time rule (SPT). The main aim of this chapter is to exhibit their range of application by describing a number of particular instances. A second aim is to give an informal introduction to the main methods available for determining indices. These are by (i) interchange arguments; (ii) exploiting any special features of the bandit processes concerned, in particular those which lead to the optimality of myopic policies; (iii) calibration by reference to standard bandit processes, often involving iteration using the dynamic programming recurrence equation; and (iv) using the fact that a dynamic allocation index may be regarded as a maximized equivalent constant reward rate.
Problem 1 Single-Machine Scheduling (see also Section 2.12, Section 3.2 and Exercise 4.1)
There are n jobs ready to be processed on a single machine. A cost ci is incurred per unit of time until job i has been completed, and the service time (or processing time) for job i is si. In what order should the jobs be processed so as to minimize the total cost?
Problem 2 Gold-Mining (see also Section 3.5.2)
A woman owns n gold mines and a gold-mining machine. Each day she must assign the machine to one of the mines. When the machine is assigned to mine i there is a probability pi that it extracts a proportion qi of the gold left in the mine, and a probability 1 − pi that it extracts no gold and breaks down permanently. To what sequence of mines on successive days should the machine be assigned so as to maximize the expected amount of gold mined before it breaks down?
Problem 3 Search (see also Section 3.5.4 and Exercise 3.6)
A stationary object is hidden in one of n boxes. The probability that a search of box i finds the object if it is in box i is qi. The probability that the object is in box i is pi, and changes by Bayes' theorem as successive boxes are searched. The cost of a single search of box i is ci. How shoul...