Attention-conservation notice: I talk a bit about replication in general, and go into detail about about my attempt to replicate Scholz, Calbert and Smith's replication of the Bueno De Mesquita expected-utility model. Turns out replication is hard, but code makes it easier. Here's the code.
Bruce Bueno De Mesquita (generally known in lit reviews as BDM) is a political scientist at NYU who has also developed a public persona (and private consulting business) as a political forecaster. His reputation as "The New Nostradamus" (apparently meant as a compliment) is anchored on work he published in the 1980s using a game-theoretic model to correctly predict the ascension of Ayatollah Khamenei as Supreme Leader of Iran, among other things. This blog post is not about him; it's about replication.
Replication is a hot topic in science right now. An ongoing argument over replication in psychology was covered in a long Slate article, while replication seems to be (as of the time of this blog post) the main reason a new 'impossible space drive' is getting any attention at all.
In many disciplines, replication has a straightforward meaning. Scientists repeat a previous experiment and see whether they achieve the same results. Under the umbrella of quantitative and computational social sciences, the meaning of replication isn't as clear. I'm going to steer clear for now of issues to do with data-focused research (for example, the fact that the Twitter API 'rules of the road' specifically forbid sharing data) and instead focus on the modeling and simulation side.
What does replicating a model involve? If we simply want to reproduce the original results while adhering as closely as possible to the original protocol, it seems that replication means running the original code. Being able to replicate so closely is a luxury that social psychologists don't have. This can absolutely be important! See the Reinhart–Rogoff affair for a high-profile example of the value of repeating the exact same analysis on the exact same dataset. Of course, if there's a mistake in the original code, this kind of replication won't catch it. Furthermore, obsolescence is a fact of life with computers. The seminal Garbage Can Model, a proto-agent based model in organization theory from 1972, provides full FORTRAN code which is essentially impossible to run as-is on contemporary computers.
Replicating a model can involve reimplementing it from scratch, potentially in a different language or environment, based on the descriptions provided in the original text. This comes much closer to the kind of replication we see in other sciences, where the replicators are effectively setting up a brand new experiment which attempts to come close to the original experiment. Uri Wilensky and Bill Rand have a good paper on replicating agent-based models which touches on many of the challenges involved.
Replicating a Replication
Back to BDM. While he published numerous papers using his expected utility model, the model itself was never fully specified, and indeed continued to evolve over time. In 2011, three other researchers, Jason Scholz, Gregory Calbert, and Glen Smith, went through BDM's publications and attempted to reverse-engineer a stable form of the model. This is a particularly heroic example of replication, and potentially extremely useful. In the absence of a full explanation of the model, the replicators have to examine the assumptions underlying the model and figure out the mathematical forms they imply. It also allows them to identify and call out some of the model's questionable assumptions or weaknesses. Importantly, the authors end the paper by fully spelling out the model algorithm.
I set out to replicate their replication, turning their math into Python code. Since some of the debates over replication in other disciplines bring in the replicators' motivations, here are mine. First of all, I am very interested in the BDM model as a rare example of a generalized agent-based model which is forecasting-oriented and has actually been used in the wild. I was also using it as a jumping-off point for a deeper dive into game theory and game-theoretic agents, potentially for use in my own research. Finally, I often find that I understand mathematical procedures best when I can see them written out in code, and was hoping that others would find my code as a useful educational resource.
A very basic summary of the model. It consists of political actors competing over an outcome, represented by a number (for example, in how many years will new emission standards be introduced). Each actor has a preferred outcome, as well as a salience (how much the actor cares about achieving its preferred outcome) and capability (the actor's ability to influence or coerce other actors), which are also given numeric values. The positions, capabilities and saliences are all specified based on input from subject-matter experts. The actors iteratively attempt to influence ('challenge') one another to change their positions, and update their preferred outcomes in response to such challenges.
I refer to the BDM model as an agent-based model. Both BDM and Scholz et al. discuss the model primarily in terms of game theory. The first thing I realized when studying the Scholz paper is that the model is in fact an ABM: it is composed of heterogeneous actors, who interact with each other in discrete time steps. Furthermore, the actors are characterized by multiple attributes, and all need to perform the same calculations repeatedly, with different inputs. In other words, they're perfectly described as code objects.
This math-to-code translation highlighted a handful of minor issues in the Scholz paper. For example, the text of the paper and the appendix specifying the algorithm both give the equations for two actors i and j's expected utilities of challenging one another as:
Note that both equations include the term , which is Actor j's salience. Of course, the theory suggests that in fact actor i should use actor j's salience and vice versa.
Similarly, the equation used to determine actors' risk exponent:
uses the subscript i to indicate two different things: the actor whose exponent is being calculated, but also as the 'iterator' for the max and min operations.
These were minor typographical issues. I encountered a more serious problem with the formula for calculating an actor's probability of successfully changing another actor's position:
My understanding of both the notation and the description in the paper was that, in the numerator, k if arg>0 implies only summing the terms which evaluate to greater than 0. This also makes sense with regard to the theory, as it can be interpreted as summing the 'votes' in favor of actor i's preferred outcome over actor j's. However, when I actually ran the code, the results were significantly different from those presented in the paper. The results were much closer when I included negative votes into the sum, but set 0 as a lower bound for the numerator as a whole, i.e.:
Note that when , the equation above results in a divide-by-zero. Presumably when two actors have the same preferred outcome, there is no need for then to challenge each other, and therefor .
The sample data provided in the paper lists the salience of each actor as a natural number, apparently from 0 to 100. However, the expected utility equation (above) includes the terms and , which strongly suggests that salience should be scaled from 0 to 1.
Finally, Scholz et al. do not specify a value for T, actors' (fixed, homogeneous) assigned probability that a change in position by any counterpart will be in their favor. Their comparison of their expected utility charts with BDM's do not specify how many rounds into the model they are.
While I have implemented the procedures described in Scholz et al. to the best of my ability, I am not closely reproducing their results. Below is a comparison of the mean position each round as reported in the paper and produced by my model. While the scale is similar, there is no correlation in the movements from round to round. Additionally the mean position in my model appears to exhibit nearly cyclic behavior absent from the original.
There are a few possible points of failure. I may not be calculating the mean position in the same way, though a formula for that is not explicitly given in the paper. I may also have a bug in the actors' decision rule, or in the expected utility calculation. The latter seems most likely. Below is a side-by-side comparison of the Scholz (left) and my own (right) "view from Belgium," comparing the expected utilities between Belgium and the other countries as derived from the sample input.
Using the metric which Scholz et al. themselves use, comparing the quadrant each actor is in, suggests a better fit than the mean position above. Nevertheless, the specific positions of the actors within the quadrant are clearly different. Furthermore, the UK, clearly one of the pivotal actors, is in the wrong quadrant entirely. I see similar results when comparing the expected utilities from the perspective of the other actors as well.
Note that Scholz et al. don't say how many rounds in their quadrant comparisons are. My quadrant charts above are after a single round; I did not notice any improvement when comparing them for rounds 2-10.
I'm not writing this post to criticize Scholz, Calbert, and Smith, who did solid academic detective work in piecing together the BDM model. My first point is that replication is hard. This model does not have particularly many moving parts, no stochasticity, and the paper goes out of its way to provide the algorithm of the steps it entails. Yet despite all of this, my replication was only moderately successful.
A second point is that mathematical notation may not always be the best way to present a model. In this case, it gave rise to typos and notation issues I noted above, small stumbling blocks on the road to replication. It is possible that a mistake currently present in the code also arises either from a mistake in the formulas or in my interpretation of them.
With a computational model such as this one, it may be clearer to present at least parts of it as pseudocode or actual code. This could help resolve some of the mathematical ambiguity, and hew closer to the form the model takes in practice. Executable code in particular provides a way of checking replications more robustly: instead of just comparing final results, values produced by the original and replicated models can be compared step by step in order to identify where they diverge.
Finally, despite not fully reproducing the results of the Scholz and original BDM models, the replication effort succeeded in giving me a much better grasp of the underlying theory. This is another value of replication not discussed as often. Replication cannot easily be done blindly, even in the presence of a well-defined algorithm. Each step requires thought and understanding, and provides opportunities to engage with the assumptions the model is based on and its results. I'll save my criticisms of the BDM model for future work, but it's worth noting that based on the web front-end to a more recent (proprietary) version of the model, it seems to have changed substantially from the version published in the academic literature.
My code is available on GitHub, and I'm hoping others will go through it and and potentially find the mistakes I'm missing.