Show Summary Details

Page of

PRINTED FROM OXFORD HANDBOOKS ONLINE (www.oxfordhandbooks.com). © Oxford University Press, 2018. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Handbooks Online for personal use (for details see Privacy Policy and Legal Notice).

date: 21 July 2019

# Field Experiments and Natural Experiments

## Abstract and Keywords

This article evaluates the strengths and limitations of field experimentation. It first defines field experimentation and describes the many forms that field experiments take. It also interprets the growth and development of field experimentation. It then discusses why experiments are valuable for causal inference. The assumptions of experimental and nonexperimental inference are distinguished, noting that the value accorded to observational research is often inflated by misleading reporting conventions. The article elaborates on the study of natural experiments and discontinuities as alternatives to both randomized interventions and conventional nonexperimental research. Finally, it outlines a list of methodological issues that arise commonly in connection with experimental design and analysis: the role of covariates, planned vs. unplanned comparisons, and extrapolation. It concludes by dealing with the ways in which field experimentation is reshaping the field of political methodology.

This chapter assesses the strengths and limitations of field experimentation. The chapter begins by defining field experimentation and describing the many forms that field experiments take. The second section charts the growth and development of field experimentation. Third, we describe in formal terms why experiments are valuable for causal inference. Fourth, we contrast the assumptions of experimental and nonexperimental inference, pointing out that the value accorded to observational research is often inflated by misleading reporting conventions. The fifth section discusses the special methodological role that field experiments play insofar as they lay down benchmarks against which other estimation approaches can be assessed. Sixth, we describe two methodological challenges that field experiments frequently confront, noncompliance and attrition, showing the statistical and design implications of each. Seventh, we discuss the study of natural experiments and discontinuities as alternatives to both randomized interventions and conventional nonexperimental research. Finally, we review a list of methodological issues that arise commonly in connection with experimental design and analysis: the role of covariates, planned vs. unplanned comparisons, and extrapolation. The chapter concludes (p. 1109) by discussing the ways in which field experimentation is reshaping the field of political methodology.

# 1 Definition of Field Experimentation

Field experimentation represents the conjunction of two methodological strategies, experimentation and fieldwork. Experimentation is a form of investigation in which units of observation (e.g. individuals, groups, institutions, states) are randomly assigned to treatment and control groups. In other words, experimentation involves a random procedure (such as a coin flip) that ensures that every observation has the same probability of being assigned to the treatment group. Random assignment ensures that in advance of receiving the treatment, the experimental groups have the same expected outcomes, a fundamental requirement for unbiased causal inference. Experimentation represents a deliberate departure from observational investigation, in which researchers attempt to draw causal inferences from naturally occurring variation, as opposed to variation generated through random assignment.

Field experimentation represents a departure from laboratory experimentation. Field experimentation attempts to simulate as closely as possible the conditions under which a causal process occurs, the aim being to enhance the external validity, or generalizability, of experimental findings. When evaluating the external validity of political experiments, it is common to ask whether the stimulus used in the study resembles the stimuli of interest in the political world, whether the participants resemble the actors who are ordinarily confronted with these stimuli, whether the outcome measures resemble the actual political outcomes of theoretical or practical interest, and whether the context within which actors operate resembles the political context of interest.

One cannot apply these criteria in the abstract, because they each depend on the research question that an investigator has posed. If one seeks to understand how college students behave in abstract distributive competitions, laboratory experiments in which undergraduates vie for small economic payoffs may be regarded as field experiments. On the other hand, if one seeks to understand how the general public responds to social cues or political communication, the external validity of lab studies of undergraduates has inspired skepticism (Sears 1986; Benz and Meier 2006). These kinds of external validity concerns may subside in the future if studies demonstrate that lab studies involving undergraduates consistently produce results that are corroborated by experimental studies outside the lab; for now, the degree of correspondence remains an open question.

The same may be said of survey experiments. By varying question wording and order, survey experiments may provide important insights into the factors that shape survey response, and they may also shed light on decisions that closely resemble (p. 1110) survey response, such as voting in elections. Whether survey experiments provide externally valid insights about the effects of exposure to media messages or other environmental factors, however, remains unclear. What constitutes a field experiment therefore depends on how “the field” is defined. Early agricultural experiments were called field experiments because they were literally conducted in fields. But if the question were how to maximize agricultural productivity of greenhouses, the appropriate field experiment might be conducted indoors.

Because the term “field experiment” is often used loosely to encompass randomized studies that vary widely in terms of realism, Harrison and List (2004, 1014) offer a more refined classification system. “Artefactual” field experiments are akin to laboratory experiments, except that they involve a “non-standard” subject pool. Habyarimana et al. (2007), for example, conduct experiments in which African subjects win prizes depending on how quickly they can open a combination lock; the random manipulation is whether the person who instructs them on the use of such locks is a co-ethnic or member of a different ethnic group. “Framed” field experiments are artefactual experiments that also involve a realistic task. An example of a framed field experiment is Chin, Bond, and Geva’s (2000) study of the way in which sixty-nine congressional staffers made simulated scheduling decisions, an experiment designed to detect whether scheduling preference is given to groups associated with a political action committee. “Natural” field experiments unobtrusively assess the effects of realistic treatments on subjects who would ordinarily be exposed to them, typically using behavioral outcome measures. For example, Gerber, Karlan, and Bergan (2009) randomly assign newspaper subscriptions prior to an election and conduct a survey of recipients in order to gauge the extent to which the ideological tone of different papers manifests itself in the recipients’ political opinions. For the purposes of this chapter, we restrict our attention to natural field experiments, which have clear advantages over artefactual and framed experiments in terms of external validity. We will henceforth use the term field experiments to refer to studies in naturalistic settings; although this usage excludes many lab and survey experiments, we recognize that some lab and survey studies may qualify as field experiments, depending on the research question.

# 2 Growth and Development of Field Experimentation

Despite the allure of random assignment and unobtrusive measurement, field experimentation has, until recently, rarely been used in political science. Although non-randomized field interventions date back to Gosnell (1927), the first randomized field experiment to appear in a political science journal was Eldersveld’s (1956) study of voter mobilization in the Ann Arbor elections of 1953 and 1954. Assigning voters (p. 1111) to receive phone calls, mail, or personal contact prior to election day, Eldersveld examined the marginal effects of different types of appeals, both separately and in combination with one another, using official records to measure voter turnout. The next field experiments, replications of Eldersveld’s study (Adams and Smith 1980; Miller, Bositis, and Baer 1981) and a study of the effects of franked mail on constituent opinions of their congressional representative (Cover and Brumberg 1982), appeared a quarter-century later. Although the number of laboratory and survey experiments grew markedly during the 1980s and 1990s, field experimentation remained quiescent. Not a single such experiment was published in a political science journal during the 1990s.

Nor were field experiments part of discussions about research methodology. Despite the fact that political methodology often draws its inspiration from other disciplines, important experiments on the effects of the negative income tax (Pechman and Timpane 1975) and subsidized health insurance (Newhouse 1993) had very little impact on methodological discussion in political science. The most influential research methods textbook, Designing Social Inquiry (King, Keohane, and Verba 1994, 125), scarcely mentions experiments in general, noting in passing that experiments are useful insofar as they “provide a useful model for understanding certain aspects of non-experimental design.” Books that champion qualitative methods, such as Mahoney and Reuschemeyer (2003), typically ignore the topic of experimental design, despite the fact that random assignment is compatible with—and arguably an important complement to—qualitative outcome and process measurement.

Field experimentation’s low profile in political science may be traced to two prevailing methodological beliefs. The first is that field experiments are infeasible. In every stage in the discipline’s development, leading political scientists have dismissed the possibility of experimentation. Lowell (1910, 7) declared, “we are limited by the impossibility of experiment. Politics is an observational, not an experimental, science.” After the behavioral revolution, the prospects for experimentation were upgraded from impossible to highly unlikely: “The experimental method is the most nearly ideal method for scientific explanation, but unfortunately it can only rarely be used in political science because of practical and ethical impediments” (Lijphart 1971, 684). Textbook discussions of field experiments reinforce this view. Consider Johnson, Joslyn, and Reynolds’s description of the practical problems confronting field experimention in the third edition of their Political Science Research Methods text:

Suppose, for example, that a researcher wanted to test the hypothesis that poverty causes people to commit robberies. Following the logic of experimental research, the researcher would have to randomly assign people to two groups prior to the experimental treatment, measure the number of robberies committed by members of the two groups prior to the experimental treatment, force the experimental group to be poor, and then to remeasure the number of robberies committed at some later date.

(2001, 133)

In this particular example, the practical difficulties stem from an insistence on baseline measurement (in fact, baseline measurement is not necessary for unbiased inference, thanks to random assignment), while the ethical concerns arise because (p. 1112) the intervention is presumed to involve making people poorer rather than making them richer.

The second methodological view that contributed to the neglect of experimentation is the notion that statistical methods can be used to overcome the infirmities of observational data. Whether the methods in question are maximum likelihood estimation, simultaneous equations and selection models, pooled cross-section time series, ecological inference, vector autoregression, or nonparametric techniques such as matching, the underlying theme in most methodological writing is that proper use of statistical methods generates reliable causal inferences. The typical book or essay in this genre describes a statistical technique that is novel to political scientists and then presents an empirical illustration of how the right method overturns the substantive conclusions generated by the wrong method. The implication is that sophisticated analysis of nonexperimental data provides reliable results. From this vantage point, experimental data look more like a luxury than a necessity. Why contend with the expense and ethical encumbrances of generating experimental data?

Long-standing suppositions about the feasibility and necessity of field experimentation have recently begun to change in a variety of social science disciplines, including political science. A series of ambitious studies have demonstrated that randomized interventions are possible. Criminologists have randomized police raids on crack houses in order to assess the hypothesis that public displays of police power deter other forms of crime in surrounding areas (Sherman and Rogan 1995). Chattopadhyay and Duflo (2004) have examined the effects of randomly assigning India’s Village Council head positions to women on the kinds of public goods that these rural administrative bodies provide. Economists and sociologists have examined the effects of randomly moving tenants out of public housing projects into neighborhoods with better schools, less crime, and more job opportunities (Kling, Ludwig, and Katz 2005). Hastings et al. (2005) have examined the effects of a “school choice” policy on the academic achievement of students and the voting behavior of parents. Olken (2005) examined the effects of various forms of administrative oversight, including grass-roots participation, on corruption in Indonesia. In political science, use of field experimentation became more widespread after Gerber and Green’s (2000) investigation of the mobilizing effects of various forms of nonpartisan campaign communication, with scholars examining voter mobilization campaigns directed at a range of different ethnic groups, in a variety of electoral contexts, and using an array of different campaign appeals (Cardy 2005; Michelson 2003). Experimentation has begun to spread to other subfields, such as comparative politics (Wantchekon 2003; Guan and Green 2006). Hyde (2006), for example, uses random assignment to study the effects of international monitoring efforts on election fraud.

Nevertheless, there remain important domains of political science that lie beyond the reach of randomized experimentation. Although the practical barriers to field experimentation are frequently overstated, it seems clear that topics such as nuclear deterrence or constitutional design cannot be studied in this manner, at least not directly. As a result, social scientists have increasingly turned to natural experiments (as distinct from “natural” field experiments) in which units of observation receive (p. 1113) different treatments in a manner that resembles random assignment. Although there are no formal criteria by which to judge whether naturally occurring variation approximates a random experiment, several recent studies seem to satisfy the requirements of a natural experiment. For example, Miguel, Satyanath, and Sergenti (2004) examine the consequences of weather-induced economic shocks on the violent civil conflicts in sub-Saharan Africa, and Ansolabehere, Snyder, and Stewart (2000) use decennial redistricting to assess the “personal vote” that incumbent legislators receive by comparing voters that legislators retain from their old districts to new voters that they acquire through redistricting.

# 3 Experiments and Inference

The logic underlying randomized experiments—and research designs that attempt to approximate random assignment—is often explicated in terms of a notational system that has its origins in Neyman (1923) and is usually termed the “Rubin Causal Model,” after Rubin (1978; 1990). The notational system is best understood by setting aside, for the time being, the topic of experimentation and focusing solely on the definition of causal influence. For each individual i let Y0 be the outcome if i is not exposed to the treatment, and Y1 be the outcome if i is exposed to the treatment. The treatment effect is defined as: (1)

$Display mathematics$

In other words, the treatment effect is the difference between two potential states of the world, one in which the individual receives the treatment, and another in which the individual does not. Extending this logic from a single individual to a set of individuals, we may define the average treatment effect (ATE) as follows: (2)

$Display mathematics$

The concept of the average treatment effect implicitly acknowledges the fact that the treatment effect may vary across individuals in systematic ways. One of the most important patterns of variation in τi occurs when the treatment effect is especially large (or small) among those who seek out a given treatment. In such cases, the average treatment effect in the population may be quite different from the average treatment effect among those who actually receive the treatment.

Stated formally, the concept of the average treatment effect among the treated (ATT) may be written (3)

$Display mathematics$
where Ti = 1 when a person receives a treatment. To clarify the terminology, (Yi1| Ti = 1) is the outcome resulting from the treatment among those who are actually (p. 1114) treated, whereas Yi0|Ti = 1 is the outcome that would have been observed in the absence of treatment among those who are actually treated. By comparing equations (2) and (3), it is apparent that the average treatment effect is not in general the same as the treatment effect among the treated.

The basic problem in estimating a causal effect, whether the ATE or the ATT, is that at a given point in time each individual is either treated or not: Either Y1 or Y0 is observed, but not both. Random assignment solves this “missing data” problem by creating two groups of individuals that are similar prior to application of the treatment. The randomly assigned control group then can serve as a proxy for the outcome that would have been observed for individuals in the treatment group if the treatment had not been applied to them.

Having now laid out the Rubin potential outcomes framework, we now show how it can be used to explicate the implications of random assignment. When treatments are randomly administered, the group that receives the treatment (Ti = 1) has the same expected outcome as the group that does not receive the treatment (Ti = 0) would if it were treated: (4)

$Display mathematics$

Similarly, the group that does not receive the treatment has the same expected outcome, if untreated, as the group that receives the treatment, if it were untreated: (5)

$Display mathematics$

Equations (4) and (5) are termed the independence assumption by Holland (1986) because the randomly assigned value of Ti conveys no information about the potential values of Yi. Equations (2), (4), and (5) imply that the average treatment effect may be written (6)

$Display mathematics$

Because E (Yi1|Ti = 1) and E (Yi0|Ti = 0) may be estimated directly from the data, this equation suggests a solution to the problem of causal inference. The estimator implied by equation (6) is simply the difference between two sample means: the average outcome in the treatment group minus the average outcome in the control group. In sum, random assignment satisfies the independence assumption, and the independence assumption suggests a way to generate empirical estimates of average treatment effects.

Random assignment further implies that independence will hold not only for Yi, but for any variable Xi that might be measured prior to the administration of the treatment. For example, subjects’ demographic attributes or their scores on a pre-test are presumably independent of randomly assigned treatment groups. Thus, one expects the average value of Xi in the treatment group to be the same as the control group; indeed, the entire distribution of Xi is expected to be the same across experimental groups. This property is known as covariate balance. It is possible to gauge the degree of balance empirically by comparing the sample averages for the treatment and control groups. One may also test for balance statistically by evaluating (p. 1115) the null hypothesis that the covariates jointly have no systematic tendency to predict treatment assignment. Regression, for example, may be used to generate an F-test to evaluate the hypothesis that the slopes of all predictors of treatment assignment are zero. A significant test statistic suggests that something may have gone awry in the implementation of random assignment, and the researcher may wish to check his or her procedures. It should be noted, however, that a significant test statistic does not prove that the assignment procedure was nonrandom; nor does an insignificant test statistic prove that treatments were assigned using a random procedure. Balance tests provide useful information, but researchers must be aware of their limitations.

We return to the topic of covariate balance below. For now, we note that random assignment obviates the need for multivariate controls. Although multivariate methods may be helpful as a means to improve the statistical precision with which causal effects are estimated, the estimator implied by equation (6) generates unbiased estimates without such controls.

For ease of presentation, the above discussion of causal effects skipped over two further assumptions that play a subtle but important role in experimental analysis. The first is the idea of an exclusion restriction. Embedded in equation (1) is the idea that outcomes vary as a function of receiving the treatment per se. It is assumed that assignment to the treatment group only affects outcomes insofar as subjects receive the treatment. Part of the rationale for using placebo groups in experimental design is the concern that subjects’ knowledge of their experimental assignment might affect their outcomes. The same may be said for double-blind procedures: When those who implement experiments are unaware of subjects’ experimental assignments, they cannot intentionally or inadvertently alter their measurement of the dependent variable.

A second assumption is known as the Stable Unit Treatment Value Assumption, or SUTVA. In the notation used above, expectations such as E (Yi1|Ti = ti) are all written as if the expected value of the treatment outcome variable Yi1 for unit i only depends upon whether or not the unit gets the treatment (whether ti equals one or zero). A more complete notation would allow for the consequences of treatments T1 through Tn administered to other units. It is conceivable that experimental outcomes might depend on the values of t1, t2, …, ti−1, ti+1, …, tn as well as the value of ti:

$Display mathematics$

By ignoring the assignments to all other units when we write this as E (Yi1|Ti = ti) we assume away spillovers from one experimental group to the other. Nickerson (2008) and Miguel and Kremer (2004) provide empirical illustrations of instances in which treatments administered to one person have effects on those around them.

Note that violations of SUTVA may produce biased estimates, but the sign and magnitude of the bias depend on the way in which treatment effects spill over across observations (cf. Nickerson 2008). Suppose an experiment were designed to gauge the ATT of door-to-door canvassing on voter turnout. Suppose that, by generating (p. 1116) enthusiasm about the upcoming election, treating one person has the effect of increasing their probability of voting by π and the probability that their next-door neighbors vote by π* (regardless of whether the neighbors are themselves treated). If treatment subjects are as likely as control subjects to live next to a person who receives the treatment, the difference in voting rates in the treatment and control groups provides an unbiased estimate of π even though SUTVA is violated. On the other hand, suppose that canvassers increase voter turnout by conveying information about where to vote. Under this scenario, treating one person has the effect of increasing their probability of voting by π and the probability that their untreated next-door neighbors vote by π* there is no effect on treated neighbors. This particular violation of SUTVA will boost the turnout rate in the control group and lead to an underestimation of π. As this example illustrates, the direction and magnitude of SUTVA-related bias are often difficult to know ex ante, and experimental researchers may wish to assess the influence of spillover effects empirically by randomly varying the density of treatments within different geographic units. Note that SUTVA, like other core assumptions of experimental inference, in fact applies to both experimental and observational research. The next section comments on the points at which the two modes of research diverge and the consequences of this divergence for the interpretation of experimental and observational findings.

# 4 Contrasting Experimental and Observational Inference

Observational studies compare cases that received the “treatment” (through some unknown selection mechanism) with those that did not receive it. Because random assignment is not used, there are no procedural grounds on which to justify the independence assumption. Without independence, equations (4) and (5) do not hold. We cannot assume that the expected value of the outcomes in the “treatment group,” E (Yi1|Ti = 1) equals the expected value of treated outcomes for the control group if treated, E (Yi1|Ti = 0), because the treatment group may differ systematically from the control group in advance of the treatment. Nor can we assume that the expected value of the untreated outcomes in the control group, E (Yi0|Ti = 0) equals the expected value of the untreated outcomes for the treatment group, E (Yi0|Ti = 1). Again, the treatment group in its untreated state may not resemble the control group.

There is no foolproof way to solve this problem. In an effort to eliminate potential confounds, a standard approach is to control for observable differences between treatment and control. Sometimes regression is used to control for covariates, but (p. 1117) suppose one were to take the idea of controlling for observables to its logical extreme and were to match the treatment and control groups exactly on a set of observable characteristics X. For every value of X, one gathers a treatment and control obser-vation.1 This matching procedure would generate treatment and control groups that are perfectly balanced in terms of the covariates. However, in order to draw unbiased inferences from these exactly matched treatment and control groups, it is necessary to invoke the ignorability assumption described by Heckman et al. (1998), which stipulates that the treatment and potential outcomes are independent conditional on a set of characteristics X. (7)

$Display mathematics$
(8)
$Display mathematics$

Note that these two assumptions parallel the two implications of randomization stated in equations (4) and (5), except that these are conditional on X. From these equations, it is easy to generate expressions for average treatment effects, using the same logic as above.

Analysts of observational data reason along these lines when defending their estimation techniques. It is conventional to assume, for example, that conditional on a set of covariates, individuals who receive treatment have the same expected outcomes as those in the control group. Notice, however, the important difference between the assumptions imposed by experimental and observational researchers. The experimental researcher relies on the properties of a procedure—random assignment—to derive unbiased estimates. The observational researcher, on the other hand, relies on substantive assumptions about the properties of covariates. Whereas the properties of the random assignment procedure are verifiable (one can check the random number generator and review the clerical procedures by which numbers were assigned to units of observation), the validity of the substantive assumptions on which observational inferences rest are seldom verifiable in any direct sense.

Even if one believes that the assumptions underlying an observational inference are probably sound, the mere possibility that these assumptions are incorrect alters the statistical attributes of observational results. Gerber, Green, and Kaplan (2004) show formally that an estimate’s sampling distribution reflects two sources of variance: the statistical uncertainty that results when a given model is applied to data and additional uncertainty about whether the estimator is biased. Only the first source of uncertainty is accounted for in the standard errors that are generated by conventional statistical software packages. The second source of uncertainty is ignored. In other words, the standard errors associated with observational results are derived using formulas that are appropriate for experimental data but represent lower bounds when applied to observational data. This practice means that observational results (p. 1118) are conventionally described in ways that are potentially quite misleading. More specifically, they are misleading in ways that exaggerate the value of observational findings.

This point has been demonstrated empirically. Inspired by the pioneering work of LaLonde (1986), which assessed the correspondence between experimental and observational estimates of the effectiveness of job-training programs, Arceneaux, Gerber, and Green (2006) compare experimental and observational estimates of the effects of phone-based voter mobilization campaigns. Using a sample of more than one million observations, they find the actual root mean squared error associated with the observational estimates to be an order of magnitude larger than their nominal standard errors.

LaLonde-type comparisons expose a seldom-noticed deficiency in political methodology research. Heretofore, methodological debate was premised on the assumption that the value of a casual parameter is unknowable; as a result, the terms of debate hinged on the technical attributes of competing approaches. LaLonde’s empirical approach radically alters these debates, for it sometimes turns out that cutting-edge statistical techniques perform quite poorly.

That said, experiments are by no means free from methodological concerns. Field experiments in particular are susceptible to noncompliance and attrition, two problems that convinced some early critics of field experimentation to abandon the enterprise in favor of observational research designs (Chapin 1947). The next sections consider in depth the nature and consequences of these two problems.

# 5 Noncompliance

Sometimes only a subset of those who are assigned to the treatment group are actually treated, or a portion of the control group receives the treatment. When those who get the treatment differ from those who are assigned to receive it, an experiment confronts a problem of noncompliance. In experimental studies of get-out-the-vote canvassing, noncompliance occurs when some subjects who were assigned to the treatment group remain untreated because they are not reached. In clinical trials, subjects may choose to stop a treatment. In studies of randomized election monitoring, observers may fail to follow their assignment; consequently, some places assigned to the treatment group go untreated, while places assigned to the control group receive treatment.

How experimenters approach the problem of noncompliance depends on their objectives. Those who wish to gauge the effectiveness of an outreach program may be content to estimate the so-called “intent-to-treat” effect (ITT); that is, the effect of being randomly assigned to the treatment group after “treatment.” At the end of the day, a program’s effectiveness is a function of both its effects on those who receive (p. 1119) the treatment and the extent to which the treatment is actually administered. Other experimenters may be primarily interested in measuring the effects of the treatment on those who are actually treated. For them, the rate at which any specific program reaches the intended treatment group is of secondary interest. The formal discussion below shows how both the intent-to-treat and treatment-on-treated effects may be estimated from data.

Table 50.1 Summary of notation distinguishing assigned and received treatments

Experimental assignment

Treatment group (Zi = 1)

Control group (Zi = 0)

Treated (Ti = 1)

D1 = 1

D0 = 1

Not treated (Ti = 0)

D1 = 0

D0 = 0

When there is noncompliance, a subject’s group assignment, Zi, is not equivalent to Ti, whether the subject gets treated or not. Angrist, Imbens, and Rubin (1996) extend the notation presented in equations (1) through (6) to the case where treatment group assignment and receipt of treatment can diverge. Let D1 = 1 when a subject assigned to the treatement group is treated, and let D1 = 0 when a subject assigned to the treatment group is not treated. Using this notation, which has been summarized in Table 50.1, we can define a subset of the population, called “Compliers,” who get the treatment when assigned to the treatment group but not otherwise. “Compliers” are subjects for whom D1 = 1 and D0 = 0. In the simplest experimental design, in which all those in the treatment group get treated and no one in the control group does, every subject is a Complier. Note that whether a subject is a Complier is a function of both subject characteristics and the particular features of the experiment and is not a fixed attribute of a subject.

When treatments are administered exactly according to plan (Zi = Ti, ∀ i), the average causal effect of a randomly assigned treatment can be estimated simply by comparing mean treatment group outcomes and mean control group outcomes. What can be learned about treatment effects when there is noncompliance? Angrist, Imbens, and Rubin (1996) present a set of sufficient conditions for estimating the average treatment effect for the subgroup of subjects who are Compliers. Here we will first present a description of the assumptions and the formula for estimating the average treatment effect for the Compliers. We then elucidate the assumptions using an example.

In addition to the assumption that treatment group assignment Z is random, Angrist et al.’s result invokes the following four assumptions, the first two of which have been mentioned above:

Exclusion restriction. The outcome for a subject is a function of the treatment they receive but is not otherwise influenced by their assignment to the treatment (p. 1120) group. In experiments this assumption may fail if subjects change their behavior in response to the treatment group assignment per se (as opposed to the treatment itself) or are induced to do so by third parties who observe the treatment assignment.

Stable Unit Treatment Value Assumption (SUTVA). Whether a subject is treated depends only on the subject’s own treatment assignment and not on the treatment assignment of any other subjects. Also, the subject’s outcome is a function of his or her treatment assignment and receipt of treatment, and not affected by the assignment of or treatment received by any other subject.

Monotonicity. For all subjects, the probability the subject is treated is at least as great when the subject is in the treatment group as when the subject is in the control group. This assumption is satisfied by design in experiments where the treatment is only available to the treatment group.

Nonzero causal effects of assignment on treatment. The treatment assignment has an effect on the probability that at least some subjects are treated. This is satisfied if the monotonicity assumption is slightly strengthened to require the inequality to be strong for at least some subjects.

Angrist et al. (1996, proposition 1) show that if assumptions 1–4 are satisfied then the effect of the treatment can be expressed as: (9)

$Display mathematics$
where Yj is the outcome when Zi = j, Dj is the value of D when Zi = j, E (h) is the mean value of h in the subject population, and E(h|g) is the mean value of h in the subset of the subject population for which g holds.

The numerator on the left-hand side of equation (9) is the intent-to-treat (ITT) effect of Z on Y, the average causal effect of Z on Y for the entire treatment group, including those who do not get treated. The ITT may be expressed formally as (10)

$Display mathematics$

The ITT is often used in program evaluation because it takes into account both the treatment effect and the rate at which the treatment is successfully administered. A program may have a weak intent-to-treat effect because the average treatment effect is weak or because the intervention is actually administered to a small portion of those assigned to receive the treatment.

The denominator of equation (9) is the ITT effect of Z on D, the average effect of being placed in the treatment group on the probability a subject is treated. The ratio of these ITT effects equals the average causal effect of the treatment for the Complier population, which is referred to as the Local Average Treatment Effect (LATE). When the control group is never treated, D0 = 0, the LATE is equal to the ATT defined in equation (3).

This proposition has an intuitive basis. Suppose that p percent of those in the treatment group are converted from untreated to treated by their assignment to the treatment group. Suppose that for those whose treatment status is changed by (p. 1121) treatment assignment (the subjects in the treatment group who are Compliers), the average change in Y caused by the treatment is Π. How is the average outcome of those subjects assigned to the treatment group changed by the fact that they were given a random assignment to the treatment, rather than the control group?

The observed average value of Y for the treatment group is altered by pΠ, the share of the treatment group affected by the treatment assignment multiplied by the average treatment effect for compliers. If the control group subjects are not affected by the experiment, and the treatment assignment is random, then the difference between the treatment and control group outcome will be on average pΠ. To recover the average effect of the treatment on compliers, divide this difference by p. The average treatment effect on those who are induced to receive treatment by the experiment is therefore equal to the ratio on the left-hand side of (9): the difference in average group outcomes, inflated by dividing this by the change in the probability the treatment is delivered to a subject.

If treatment groups are formed by random assignment, the left-hand side of (9) can be estimated using the sample analogues for the left-hand side quantities. The ITT effect for Y is estimated by the mean difference in outcomes between the treatment and control group and the ITT for D is estimated by the mean difference in treatment rates in the treatment and control group. Equivalently, the LATE can also be estimated using two-stage least squares (2SLS), where subject outcomes are regressed on a variable for receipt of treatment, and treatment assignment is used as an instrument for receipt of treatment. Note that for the simplest case, where no one in the control group is treated and the treatment rate in the treatment group equals c, the LATE is equal to $I T T Y c$, where the numerator is the ITT estimate for Y, which is then inflated by the inverse of the treatment rate.

A basic point of Angrist et al.’s analysis is that while some estimate can always be calculated, the estimate is the causal effect of the treatment only when the set of assumptions listed above are satisfied. These conditions are illustrated using an example in which there is failure to treat the treatment group, but treatment is unavailable to the control group. Suppose that after random assignment to treatment and control groups, investigators attempt to contact and then treat the treatment group. The treatment is unavailable to the control group. Examples of this design are get-out-the-vote canvassing experiments, where there is typically failure to treat some of those assigned to the treatment group. Let us now consider how the assumptions presented above apply to this example.

Exclusion restriction. The exclusion restriction is violated when treatment assignment has a direct effect on a subject’s outcome. In the example considered here subjects are not aware of their treatment group status (Z), and so they cannot change their behavior based on Z directly. However, it is possible that third parties can observe the treatment assignment. For example, a campaign worker might observe the list of persons assigned to be contacted for get-out-the-vote and then concentrate persuasive messages on these households. This violation of the exclusion restriction causes bias.

(p. 1122) SUTVA. Recall that SUTVA requires that the treatment received by a subject does not alter the outcome for other subjects. The SUTVA assumption will fail if, for example, a subject’s probability of voting is affected by the treatment of their neighbors. If an experiment that contains multiple members of a household is analyzed at the individual rather than the household level, SUTVA is violated if a subject’s probability of voting is a function of whether another member of the household is treated. More generally, spillover effects are a source of violation of SUTVA. If there are decreasing returns to repeat treatments, spillovers will reduce the difference between the average treatment and control group outcomes without changing the recorded treatment rates. As a result, this violation of SUTVA will bias the estimated treatment effect toward zero.

Monotonicity. In the get-out-the-vote example each subject has a zero chance of getting the treatment when assigned to the control group, and therefore monotonicity is satisfied by the experimental design.

Nonzero effect of treatment assignment on the probability of treatment. This assumption is satisfied by the experimental design so long as some members of the assigned treatment group are successfully contacted.

A few final comments on estimating treatment effects when there is noncompliance:

The population proportion of Compliers is a function of population characteristics and the experimental design. The approach presented above produces an estimate of the average treatment effect for Compliers, but which subjects are Compliers may vary with the experimental design. An implication of this is that if the experimental protocols reach different types of subjects and the treatment rates are low, the treatment effect estimate for the same treatment may change across experiments. In order to detect heterogeneous treatment effects, the experimenter may wish to vary the efforts made to reach those assigned to the treatment group.

Generalization of average treatment effects beyond the Compliers requires additional assumptions. When there is noncompliance, the treatment effect estimate applies to Compliers, not the entire subject population. Statements about the effect of treating the entire population require assumptions about the similarity of treatment effects for Compliers and the rest of the subjects.

There is no assumption of homogeneous treatment effects. While both experimental and observational research often implicitly assumes constant treatment effects, the treatment effect estimated by 2SLS is the average treatment effect for Compliers. If homogeneous treatment effects are assumed, then LATE = ATT = ATE.

Given noncompliance, it is sometimes impossible to determine which particular individuals are Compliers. When no members of the control group can get the treatment, which subjects are Compliers can be directly observed, since they are the subjects in the treatment group who get the treatment. However, in situations where it is possible to be treated regardless of group assignment, the exact subjects who are the Compliers cannot be determined. Angrist et al. (1996) offer as an example the natural experiment produced by the Vietnam draft lottery. Those with low lottery numbers were more likely to enter the military than those with high numbers, but some people with (p. 1123) high numbers did get “treated.” This suggests that some subjects would have joined the military regardless of their draft number. Subjects who get treated regardless of their group assignment are called “Always Takers;” they are people for whom D1 = 1 and D0 = 1. Since some of the subjects assigned to the treatment group are Always Takers, the set of treated subjects in the treatment group is a mix of Always Takers and Compliers. This mix of subjects cannot be divided into Compliers and Always Takers, since an individual subject’s behavior in the counterfactual assignment is never observed.

In sum, problems of noncompliance present a set of design challenges for field experimental researchers. Awareness of these problems leads researchers to gather data on levels of compliance in the treatment and control groups, so that local average treatment effects may be estimated. It also encourages researchers to think about which core assumptions—SUTVA, monotonicity, exclusion, and nonzero assignment effects—are likely to be problematic in any given application and to design experiments so as to minimize these concerns.

# 6 Attrition

Less tractable are the problems associated with attrition. Attrition occurs when outcomes are unobserved for certain observations. Consider a simple example of an intervention designed to encourage political parties in closed-list systems to nominate women candidates as regional representatives. The outcomes for this study are Y = 1 (a woman is nominated), Y = 0 (no women are nominated), and Y = ? (the investigator does not know whether a woman is nominated). Suppose that the experimental results indicate that 50 percent of the jurisdictions in the treatment group had women nominees, 20 percent did not, and 30 percent remain unknown. The corresponding rates for the control group are 40 percent, 20 percent, and 40 percent. In the absence of any other information about these jurisdictions and the reasons why their outcomes are observed or unobserved, this set of results is open to competing interpretations. One could exclude missing data from the analysis, in which case the estimated treatment effect is .50/(.50 + .20) − .40/(.40 + .20) = .05. Alternatively, one could assume that those whose outcomes were not observed failed to field women candidates, in which case the estimated treatment effect is .10. Or, following Manski (1990), one could calculate the bounds around the estimated effect by assuming the most extreme possible outcomes. For example, if none of the treatment group’s missing observations fielded women candidates, but all of the missing observations in the control group did so, the treatment effect would be .50 − .80 = −.30. Conversely, if the missing observations in the treatment group are assumed to be Y = 1, but missing observations in the control group are (p. 1124) assumed to be Y = 0, the estimated treatment effect becomes .80 − .40 = .40. As this example illustrates, the bounds {−.30, .40} admit an uncomfortably large range of potential values.

The indeterminacy created by attrition can in principle be reduced by imposing some theoretical structure on the data. Imputation models, for example, attempt to use observed covariates to forecast the outcomes for missing values (Imbens and Rubin 1997; Dunn, Maracy, and Tomenson 2005). This approach will generate unbiased estimates under special conditions that may or may not hold for any given application. In situations where data are missing in a fashion that is random conditional on covariates, this approach will be superior to excluding missing data or substituting the unconditional mean. On the other hand, it is possible to construct examples in which imputation exacerbates the problem of bias, as when missing data arise due to systematic factors that are not captured by observed covariates.

Although the consequences of attrition are always to some degree speculative, certain experimental designs are better able than others to allay concerns about attrition. Consider, for example, Howell and Peterson’s (2004) analysis of the effects of school vouchers on students’ standardized test scores. From a pool of voucher applicants, a subset of voucher recipients was chosen at random. Outcomes were measured by a test administered privately to both voucher recipients and nonrecipients, but recipients were more likely to take the test than nonrecipients. Thus, Howell and Peterson faced a situation in which attrition rates differed between treatment and control groups. The fact that they administered an academic baseline test prior to random assignment enabled them to assess whether attrition was related to academic ability as gauged in the pre-test. This example provides another instance in which covariates, although not strictly necessary for experimental analysis, play a useful role.

# 7 Natural Experiments and Discontinuity Designs

In contrast to randomized experiments, where a randomization procedure ensures unbiased causal inference, natural experiments and regression discontinuity designs use near-random assignment to approximate experimentation. When evaluating near-random research designs, the key methodological issue is whether the treatment is unrelated to unmeasured determinants of the dependent variable. The plausibility of this claim will depend on substantive assumptions. For example, studies of (p. 1125) lottery windfalls on consumption (Imbens, Rubin, and Sacerdote 2001) and political attitudes (Doherty, Gerber, and Green 2006) have invoked assumptions about the comparability of lottery players who win varying amounts in distinct lotteries over time. Although plausible, these assumptions are potentially flawed (lottery players may change over time), resulting in biased estimates. Note that the mere possibility of bias means that the reported standard errors potentially understate the mean squared error associated with the estimates for the reasons spelled out in Gerber, Green, and Kaplan (2004).

Regression discontinuity designs attempt to address concerns about bias by looking at sharp breakpoints that make seemingly random distinctions between units that receive a treatment and those that do not. When, for example, one gauges the effects of legislature size on municipal spending by taking advantage of the fact that Scandinavian countries mandate changes in legislature size as a step function of local population size (Pettersson-Lidbom 2004), the key assumption is that the change in legislature size is effectively random in the immediate vicinity of a population threshold that necessitates a change in legislative size. When the legislature changes in size as the municipal population grows from 4,999 to 5,000, is there an abrupt change in municipal spending? A hypothetical example of this kind of discontinuity is illustrated in Figure 50.1. In the vicinity of the discontinuity in legislative size, there is an abrupt shift in government expenditures.2 The practical problem with this type of design is that the data are typically sparse in the immediate vicinity of a breakpoint, requiring the analyst to include observations that are farther away and therefore potentially contaminated by other factors. The standard remedy for this problem is to control for these factors through use of covariates, such as polynomial functions of the variable (in this example, population size) on which the discontinuity is based. Pettersson-Lidbom (2004) demonstrates the robustness of his estimates across a variety of specifications, but as always there remains a residuum of uncertainty about whether municipal spending patterns reflect unmeasured aspects of population change.

Even if one isolates a causal effect produced by a discontinuity, problems of interpretation remain. Posner (2004) argues that ethnic groups residing on the border of Zambia and Malawi provide an opportunity to study the effects of party competition and ethnic coalitions. The argument is that Chewa and Tumbuka peoples are more in conflict on the Malawi side of the border than the Zambian side, because they together comprise a much larger proportion of the overall Malawi electorate in a winner-takes-all presidential system. Here the issue is one of internal validity. Conceivably, the observed contrast in inter-group relations could be attributed to any difference between the two countries. In an attempt to bolster his interpretation, Posner attempts to rule out differences in economic status, proximity to roads, and colonial history as potential sources of cross-border variation.

Fig. 50.1 Example of a regression discontinuity

Note: In this hypothetical example, inspired by Pettersson-Lidbom (2004), the number of seats in a local legislature increases abruptly when a town passes the population cutoff of 5,000 residents. The hypothesis tested here is whether local government expenditures (the Y-axis) increase in the vicinity of this breakpoint, presumably due to the exogenous change in legislative size.

(p. 1126)

# 8 Assorted Methodological Issues in the Design and Analysis of Experiments

Various aspects of design, analysis, and inference have important implications for the pursuit of long-term objectives, such as the generation of unbiased research literatures that eventually isolate the causal parameters of interest. This section briefly summarizes several of the leading issues.

## 8.1 Balance and Stratification

Random assignment is designed to create treatment and control groups that are balanced in terms of both observed and unobserved attributes. As N approaches infinity, the experimental groups become perfectly balanced. In finite samples, random (p. 1127) assignment may create groups that, by chance, are unbalanced. Experimental procedures remain unbiased in small samples, in that there will be no systematic tendency for imbalance to favor the treatment group. Yet there remains the question of what to do about the prospect or occurrence of observed imbalances between treatment and control groups.

If information about covariates is available in advance of an experiment (perhaps due to the administration of a pre-test), a researcher can potentially reduce sampling variability of the estimated treatment effect by stratification or blocking. For example, if gender were thought to be a strong predictor of an experimental outcome, one might divide the sample into men and women, randomly assigning a certain fraction of each gender group to the treatment condition. This procedure ensures that gender is uncorrelated with treatment assignment. This procedure may be extended to include a list of background covariates. Within each type of attribute (e.g. women over sixty-five years of age, with less than a high school degree), the experimenter randomly assigns observations to treatment and control groups.

Alternatively, the researcher could stratify after the experiment is conducted by including the covariates as control variables in a multivariate model. The inclusion of covariates has the potential to increase the precision with which the treatment effect is estimated so long as the predictive accuracy of the covariates offsets the loss of degrees of freedom. The same principle applies also to pre-stratification; adding the strata-generating covariates as control variables may improve the precision of the experimental estimates by reducing disturbance variance.

Four practical issues accompany the use of covariates. First, although covariates potentially make for more precise experimental estimates, the covariates in question must predict the dependent variable. To the extent that covariates (or nonlinear functions of the covariates or interactions among covariates) have little predictive value, their inclusion may actually make the treatment estimates less precise. The same goes for nonparametric methods such as matching, which suspend the usual assumptions about the linear and additive effects of the covariates, but throw out a certain portion of the observations for which the correspondence between treatment and control covariates is deemed to be inadequate. Second, the use of covariates creates the potential for mischief insofar as researchers have discretion about what covariates to include in their analysis. The danger is that researchers will pick and choose among alternative specifications based on how the results turn out. This procedure has the potential to introduce bias. A third issue concerns potential interactions between the treatment and one or more covariate. Researchers often find subgroup differences. Again, the exploration of interactions raises the distinction between planned and unplanned comparisons. When interactions are specified according to an ex ante plan, the sampling distribution is well defined. When interactions are proposed after the researcher explores the data, the sampling distribution is no longer well defined. The results should be regarded as provisional pending confirmation in a new sample. Fourth, a common use of covariates occurs when the experimenter administers a pretest prior to the start of an intervention. Pre-tests, however, may lead subjects to infer the nature and purpose of the experiment, thereby contaminating the results.

## (p. 1128) 8.2 Publication Bias

An important source of bias that afflicts both observational and experimental research is publication bias. If the size and statistical significance of experimental results determine whether they are reported in the scholarly publications, a synthesis of research findings may produce a grossly misleading estimate of the true causal effect. Consider, for example, the distortions that result when an intervention that is thought to contribute to democratic stability is evaluated using a one-tailed hypothesis test. If statistically insignificant results go unreported, the average reported result will exaggerate the intervention’s true effect.

Publication bias in experimental research literatures has several tell-tale symptoms. For one-tailed tests of the sort described above, studies based on small sample sizes should tend to report larger treatment effects than studies based on large sample sizes. The reason is that small-N studies require larger effect sizes in order to achieve statistical significance. A similar diagnostic device applies to two-tailed tests; the relationship between sample size and effect size should resemble a funnel. More generally, the observed sampling distribution of experimental results will not conform to the sampling distribution implied by the nominal standard errors associated with the individual studies. Inconclusive studies will be missing from the sampling distribution. The threat of publication bias underscores the tension between academic norms, which often stress the importance of presenting conclusive results, and scientific progress, which depends on unbiased reporting conventions.

## 8.3 Extrapolation

Any experiment, whether conducted in the lab or field, raises questions about generalizability. To what extent do the findings hold for other settings? To what extent do they hold for other dosages or types of interventions? These questions raise important epistemological questions about the conditions under which one can safely generalize beyond the confines of a particular intervention and experimental setting. These questions have been the subject of lively, ongoing debate (Heckman and Smith 1995; Moffitt 2004).

Scholars have approached this issue in two complementary ways. The first is to bring to bear theories about the isomorphism of different treatments and contexts. Often, this type of theoretical argument is based on isolating the essential features of a given treatment or context. For example, one might say of the Gerber et al. (2009) study of the effects of newspaper exposure that the key ingredients are the political topics covered by the newspapers, the manner in which they are presented, the care and frequency with which they are read, and the partisanship and political sophistication of the people who are induced to read as a result of the intervention.

The theoretical exercise of isolating what are believed to be the primary characteristics of the treatment, setting, and recipients directs an empirical agenda of experimental extensions. By varying the nature of the treatment, how it is delivered, (p. 1129) and to whom, the experimenter gradually assembles a set of empirical propositions about the conditions under which the experimental effect is strong or weak. Note that this empirical agenda potentially unifies various forms of experimental social science by asking whether lab experiments, survey experiments, and natural experiments converge on similar answers and, if not, why the experimental approach generates discrepant results. This process of empirical replication and extension in turn helps refine theory by presenting a set of established facts that it must explain.

## 8.4 Power and Replication

Often field experiments face practical challenges that limit their size and scope. For example, suppose one were interested in the effects of a political campaign commercial on 100,000 television viewers’ voting preferences. Ideally, one would randomly assign small units such as households or blocks to treatment and control groups; as a practical matter, it may only be possible to assign a small number of large geographic units to treatment and control. If individuals living near one another share unobserved attributes that predict their voting preferences, this kind of clustered randomization greatly reduces the power of the study (Raudenbush 1997). Indeed, it is possible to construct examples in which such an experiment has very little chance of rejecting a null hypothesis of no effect.

How should one proceed in this case? Although any given study may have limited diagnostic value, the accumulation of studies generates an informative posterior distribution. Unlike the failure to publish insignificant results, the failure to embark on low-power studies does not lead to bias, but it does slow the process of scientific discovery. In the words of Campbell and Stanley (1963, 3), we must

justify experimentation on more pessimistic grounds—not as a panacea, but rather at the only available route to cumulative progress. We must instill in our students the expectation of tedium and disappointment and the duty of thorough persistence, by now so well achieved in the biological and physical sciences.

# 9 Methodological Value of Experiments

Beyond the substantive knowledge that they generate, field experiments have the potential to make a profound methodological contribution. Rather than evaluate statistical methods in terms of the abstract plausibility of their underlying assumptions, researchers in the wake of LaLonde (1986) have made the performance of statistical methods an empirical question, by comparing estimates based on observational data to experimental benchmarks. Experimentation has been buoyed in part by the (p. 1130) lackluster performance of statistical techniques designed to correct the deficiencies of observational data. Arceneaux et al. (2006), for example, find a wide disparity between experimental estimates and estimates obtained using observational methods such as matching or regression. This finding echoes results in labor economics and medicine, where observational methods have enjoyed mixed success in approximating experimental benchmarks (Heckman et al. 1998; Concato, Shah, and Horwitz 2000; Glazerman, Levy, and Myers 2003). By making the performance of observational methods an empirical research question, field experimentation is changing the terms of debate in the field of political methodology.

## References

Adams, W. C. and Smith, D. J. 1980. Effects of telephone canvassing on turnout and preferences: a field experiment. Public Opinion Quarterly, 44: 53–83.Find this resource:

Angrist, J. D., Imbens, G. W., and Rubin, D. B. 1996. Identification of casual effects using instrumental variables. Journal of the American Statistical Association, 91: 444–55.Find this resource:

Ansolabehere, S., Snyder, J. M., Jr., and Stewart, C., III 2000. Old voters, new voters, and the personal vote: using redistricting to measure the incumbency advantage. American Journal of Political Science, 44: 17–34.Find this resource:

Arceneaux, K., Gerber, A. S., and Green, D. P. 2006. Comparing experimental and matching methods using a large-scale voter mobilization experiment. Political Analysis, 14: 1–36.Find this resource:

Benz, M. and Meier, S. 2006. Do people behave in experiments as in the field? Evidence from donations. Institute for Empirical Research in Economics Working Paper No. 248.Find this resource:

Campbell, D. T. and Stanley, J. C. 1963. Experimental and Quasi-Experimental Designs for Research. Boston: Houghton-Mifflin.Find this resource:

Cardy, E. A. 2005. An experimental field study of the GOTV and persuasion effects of partisan direct mail and phone calls. Annals of the American Academy of Political and Social Science, 601: 28–40.Find this resource:

Chapin, F. S. 1947. Experimental Designs in Sociological Research. New York: Harper and Brothers.Find this resource:

Chattopadhyay, R. and Duflo, E. 2004. Women as policy makers: evidence from a randomized policy experiment in India. Econometrica, 72: 1409–43.Find this resource:

Chin, M. L., Bond, J. R., and Geva, N. 2000. A foot in the door: an experimental study of PAC and constituency effects on access. Journal of Politics, 62: 534–49.Find this resource:

Concato, J., Shah, N., and Horwitz, R. I. 2000. Randomized, controlled trials, observational studies, and the hierarchy of research designs. New England Journal of Medicine, 342: 1887–92.Find this resource:

Cover, A. D. and Brumberg, B. S. 1982. Baby books and ballots: the impact of congressional mail on constituent opinion. American Political Science Review, 76: 347–59.Find this resource:

Doherty, D., Gerber, A. S., and Green, D. P. 2006. Personal income and attitudes toward redistribution: a study of lottery winners. Political Psychology, 27: 441–58.Find this resource:

Dunn, G., Maracy, M., and Tomenson, B. 2005. Estimating treatment effects from randomized clinical trials with noncompliance and loss to follow-up: the role of instrumental variables methods. Statistical Methods in Medical Research, 14: 369–95.Find this resource:

Eldersveld, S. J. 1956. Experimental propaganda techniques and voting behavior. American Political Science Review, 50: 154–65.Find this resource:

(p. 1131) Gerber, A. S. and Green, D. P., 2000. The effects of canvassing, direct mail, and telephone contact on voter turnout: a field experiment. American Political Science Review, 94: 653–63.Find this resource:

——and Kaplan, E. H. 2004. The illusion of learning from observational research. Pp. 251–73 in Problems and Methods in the Study of Politics, ed. I. Shapiro, R. Smith, and T. Massoud. New York: Cambridge University Press.Find this resource:

—Karlan, D. S., and Bergan, D. 2009. Does the media matter? A field experiment measuring the effect of newspapers on voting behavior and political opinions. American Economic Journal: Applied Economics.Find this resource:

Glazerman, S., Levy, D. M., and Myers, D. 2003. Nonexperimental versus experimental estimates of earnings impacts. Annals of the American Academy of Political and Social Science, 589: 63–93.Find this resource:

Gosnell, H. F. 1927. Getting-out-the-Vote: An Experiment in the Stimulation of Voting. Chicago: University of Chicago Press.Find this resource:

Guan, M. and Green, D. P. 2006. Non-coercive mobilization in state-controlled elections: an experimental study in Beijing. Comparative Political Studies, 39: 1175–93.Find this resource:

Habyarimana, J., Humphreys, M., Posner, D., and Weinstein, J. 2007. The Co-ethnic Advantage.Find this resource:

Harrison, G. W., and List, J. A. 2004. Field experiments. Journal of Economic Literature, 42: 1009–55.Find this resource:

Hastings, J. S., Kane, T. J., Staiger, D. O., and Weinstein, J. M. 2005. Economic outcomes and the decision to vote: the effect of randomized school admissions on voter participation. Unpublished manuscript, Department of Economics, Yale University.Find this resource:

Heckman, J. J. and Smith, J. A. 1995. Assessing the case for social experiments. Journal of Economic Perspectives, 9: 85–110.Find this resource:

—Ichimura, H., Smith, J., and Todd, P. 1998. Matching as an econometric evaluation estimator. Review of Economic Studies, 65: 261–94.Find this resource:

Holland, P. W. 1986. Statistics and causal inference. Journal of the American Statistical Association, 81: 945–60.Find this resource:

Howell, W. C. and Peterson, P. E. 2004. Uses of theory in randomized field trials: lessons from school voucher research on disaggregation, missing data, and the generalization of findings. Anerican Behavioral Scientist, 47: 634–57.Find this resource:

Hyde, S. D. 2006. Foreign democracy promotion, norm development and democratization: explaining the causes and consequences of internationally monitored elections. Unpublished doctoral thesis, Department of Political Science, University of California, San Diego.Find this resource:

Imbens, G. W. and Rubin, D. B. 1997. Bayesian inference for causal effects in randomized experiments with noncompliance. Annals of Statistics, 25: 305–27.Find this resource:

——and Sacerdote, B. I. 2001. Estimating the effect of unearned income on labor earnings, savings, and consumption: evidence from a survey of lottery winners. American Economic Review, 91: 778–94.Find this resource:

Johnson, J. B., Joslyn, R. A., and Reynolds, H. T. 2001. Political Science Research Methods, 4th edn. Washington, DC: CQ Press.Find this resource:

King, G., Keohane, R. O., and Verba, S. 1994. Designing Social Inquiry. Princeton, NJ: Princeton University Press.Find this resource:

Kling, J. R., Ludwig, J., and Katz, L. F. 2005. Neighborhood effects on crime for female and male youth: evidence from a randomized housing voucher experiment. Quarterly Journal of Economics, 120: 87–130.Find this resource:

LaLonde, R. J. 1986. Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76: 604–20.Find this resource:

(p. 1132) Lijphart, A. 1971. Comparative politics and the comparative method. American Political Science Review, 65: 682–93.Find this resource:

Lowell, A. L. 1910. The physiology of politics. American Political Science Review, 4: 1–15.Find this resource:

Mahoney, J. and Rueschemeyer, D. (eds.) 2003. Comparative Historical Analysis in the Social Sciences. New York: Cambridge University Press.Find this resource:

Manski, C. F. 1990. Nonparametric bounds on treatment effects. American Economic Review Papers and Proceedings, 80: 319–23.Find this resource:

Michelson, M. R. 2003. Getting out the Latino vote: how door-to-door canvassing influences voter turnout in rural central California. Political Behavior, 25: 247–63.Find this resource:

Miguel, E. and Kremer, M. 2004. Worms: identifying impacts on education and health in the presence of treatment externalities. Econometrica, 72: 159–217.Find this resource:

—Satyanath, S., and Sergenti, E. 2004. Economic shocks and civil conflict: an instrumental variables approach. Journal of Political Economy, 112: 725–53.Find this resource:

Miller, R. E., Bositis, D. A., and Baer, D. L. 1981. Stimulating voter turnout in a primary: field experiment with a precinct committeeman. International Political Science Review, 2: 445–60.Find this resource:

Moffitt, R. A. 2004. The role of randomized field trials in social science research: a perspective from evaluations of reforms of social welfare programs. American Behavioral Scientist, 47: 506–40.Find this resource:

Newhouse, J. P. 1993. Free for All? Lessons from the RAND Health Insurance Experiment. Boston: Harvard University Press.Find this resource:

Neyman, J. 1923. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Roczniki Nauk Roiniczych, 10: 1–51; repr. in English in Statistical Science, 5 (1990): 463–80.Find this resource:

Nickerson, D. W. 2008. Is voting contagious? Evidence from two field experiments. American Political Science Review, 102: 49–57.Find this resource:

Olken, B. A. 2005. Monitoring corruption: evidence from a field experiment in Indonesia. NBER Working Paper 11753.Find this resource:

Pechman, J. A. and Timpane, P. M. (eds.) 1975. Work Incentives and Income Guarantees: The New Jersey Negative Income Tax Experiment. Washington, DC: Brookings Institution.Find this resource:

Pettersson-Lidbom, P. 2004. Does the size of the legislature affect the size of government? Evidence from two natural experiments. Unpublished manuscript, Department of Economics, Stockholm University.Find this resource:

Posner, D. N. 2004. The political salience of cultural difference: why Chewas and Tumbukas are allies in Zambia and adversaries in Malawi. American Political Science Review, 98: 529–45.Find this resource:

Raudenbush, S. W. 1997. Statistical analysis and optimal design for cluster randomized trials. Psychological Methods, 2: 173–85.Find this resource:

Rubin, D. B. 1978. Bayesian inference for causal effects: the role of randomization. Annals of Statistics, 6: 34–58.Find this resource:

—1990. Comment: Neyman (1923) and causal inference in experiments and observational studies. Statistical Science, 5 (4): 472–80.Find this resource:

Sears, D. O. 1986. College sophomores in the laboratory: influences of a narrow database on social-psychology’s view of human nature. Journal of Personality and Social Psychology, 51: 515–30.Find this resource:

Sherman, L. W. and Rogan, D. P. 1995. Deterrent effects of police raids on crack houses: a randomized, controlled experiment. Justice Quarterly, 12: 755–81.Find this resource:

Wantchekon, L. 2003. Clientelism and voting behavior: evidence from a field experiment in Benin. World Politics, 55: 399–422.Find this resource:

## Notes:

(1) This procedure satisfies the auxiliary assumption that cases with the same X values have a positive probability of being in either treated or control group. This assumption would be violated if treatment assignment were perfectly predicted by X.

(2) Pettersson-Lidbom (2004) actually finds little evidence of a link between legislature size and government expenditures in Scandinavia.