A surrogate is a substitute for the real thing. In statistics, draws from predictive equations derived from a fitted model can act as a surrogate for the data-generating mechanism. If the fit is good ā model flexible yet well-regularized, data rich enough and fitting scheme reliable ā then such a surrogate can be quite valuable. Gathering data is expensive, and sometimes getting exactly the data you want is impossible or unethical. A surrogate could represent a much cheaper way to explore relationships, and entertain āwhat ifs?ā. How do surrogates differ from ordinary statistical modeling? One superficial difference may be that surrogates favor faithful yet pragmatic reproduction of dynamics over other things statistical models are used for: interpretation, establishing causality, or identification. As you might imagine, that characterization oversimplifies.
The terminology came out of physics, applied math and engineering literatures, where the use of mathematical models leveraging numerical solvers has been commonplace for some time. As such models became more complex, requiring more resources to simulate/solve numerically, practitioners increasingly relied on meta-models built off of limited simulation campaigns. Often they recruited help from statisticians, or at least used setups resembling ones from stats. Data collected via expensive computer evaluations tuned flexible functional forms that could be used in lieu of further simulation. Sometimes the goal was to save money or computational resources; sometimes to cope with an inability to perform future runs (expired licenses, off-line or over-impacted supercomputers). Trained meta-models became known as surrogates or emulators, with those terms often used interchangeably. (A surrogate is designed to emulate the numerics coded in the solver.) The enterprise of design, running and fitting such meta-models became known as a computer experiment.
So a computer experiment is like an ordinary statistical experiment, except the data are generated by computer codes rather than physical or field observations, or surveys. Surrogate modeling is statistical modeling of computer experiments. Computer simulations are generally cheaper than physical observation, so the former could be entertained as an alternative or precursor to the latter. Although computer simulation can be just as expensive as field experimentation, computer modeling is regarded as easier because the experimental apparatus is better understood, and more aspects may be controlled. For example many numerical solvers are deterministic, whereas field observations are noisy or have measurement error. For a long time noise was the main occupant in the gulf between modeling and design considerations for surrogates, on the one hand, and more general statistical methodology on the other. But hold that thought for a moment.
Increasingly that gulf is narrowing, not so much because the nature of experimentation is changing (it is), but thanks to advances in machine learning. The canonical surrogate model, a fitted Gaussian process (GP) regression, which was borrowed for computer experiments from the geostatisticsā kriging1 literature of the 1960s, enjoys wide applicability in contexts where prediction is king. Machine learners exposed GPs as powerful predictors for all sorts of tasks2, from regression to classification, active learning/sequential design, reinforcement learning and optimization, latent variable modeling, and so on. They also developed powerful libraries, lowering the bar to application by non-expert practitioners, especially in the information technology world. Facebook uses surrogates to tailor its web portal and apps to optimize engagement; Uber uses surrogates trained to traffic simulations to route pooled ride-shares in real-time, reducing travel and wait time.
Round about the same time, computer simulation as a means of scientific inquiry began to blossom. Mathematical biologists, economists and others had reached the limit of equilibrium-based mathematical modeling with cute closed-form solutions. They embraced simulation as a means of filling in the gap, just as physicists and engineers had decades earlier. Yet their simulations were subtly different. Instead of deterministic solvers based on finite elements3, NavierāStokes4 or Euler methods5, they were building stochastic simulations6, and agent-based models7, to explore predator-prey (LotkaāVoltera8) dynamics, spread of disease, management of inventory or patients in health insurance markets. Suddenly, and thanks to an explosion in computing capacity, software tools, and better primary school training in STEM9 subjects (all decades in the making), simulation was enjoying a renaissance. Weāre just beginning to figure out how best to model these experiments, but one thing is for sure: the distinction between surrogate and statistical model is all but gone.
If thereās (real) field data, say on a historical epidemic, further experimentation may be almost entirely limited to the mathematical and computer modeling side. You canāt seed a real community with Ebola and watch what happens. Epidemic simulations, and surrogates built from a limited number of expensive runs where virtual agents interact and transmit infection, can be calibrated to a limited ...