Show Summary Details

Page of

PRINTED FROM OXFORD HANDBOOKS ONLINE (www.oxfordhandbooks.com). © Oxford University Press, 2018. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Handbooks Online for personal use (for details see Privacy Policy and Legal Notice).

Subscriber: null; date: 15 November 2018

# Issues in Polling Methodologies: Inference and Uncertainty

## Abstract and Keywords

This chapter presents issues and complications in statistical inference and uncertainty assessment using public opinion and polling data. It emphasizes the methodologically appropriate treatment of polling results as binomial and multinomial outcomes, and highlights methodological issues with correctly specifying and explaining the margin of error. The chapter also examines the log-ratio transformation of compositional data such as proportions of candidate support as one possible approach for the difficult analysis of such information. The deeply flawed Null Hypothesis Significance Testing (NHST) is discussed, along with common inferential misinterpretations. The relevance of this discussion is illustrated using specific examples of errors from journalistic sources as well as from academic journals focused on measures of public opinion.

# Introduction

Researchers working with survey research and polling data face many methodological challenges, from sampling decisions to interpretation of model results. In this chapter we discuss some of the statistical issues that such researchers encounter, with the goal of describing underlying theoretical principles that affect such work. We start by describing polling results as binomial and multinomial outcomes so that the associated properties are correctly described. This leads to a discussion of statistical uncertainty and the proper way to treat it. The latter involves the interpretation of variance and the correct treatment of the margin of error. The multinomial description of data is extended into a compositional approach that provides a richer set of models with specified covariates. We then note that the dominant manner of understanding and describing survey research and polling data models is deeply flawed. This is followed by some examples.

The key purpose of this chapter is to highlight a set of methodological issues, clarifying underlying principles and identifying common misconceptions. Many practices are applied without consideration of their possibly deleterious effects. Polling data in particular generate challenges that need introspection.

# Polling Results as Binomial and Multinomial Outcomes

Survey research and public opinion data come in a wide variety of different forms and shapes. However, for almost all kinds of polling data, two statistical distributions (p. 276) provide the background that is necessary to fully understand them and to be able to interpret the associated properties correctly: the binomial and multinomial distributions.

The binomial distribution is usually used to model the number of successes in a sequence of independent yes/no trials. In the polling world, it is most useful when analyzing data that can take on only two different values, such as the predicted vote shares in a two-candidate race. The binomial probability mass function (PMF) and its properties are given by

• $PMF:BN(x|n,p)=(nx)px(1−p)n−x, x=0,1,…,n, 0

• $E[X]=np$

• $Var[X]=np(1−p)$

So the binomial distribution has mean np and standard deviation $p(1−p)/n.$ If we assume that np and n(1 – p) are both bigger than 5, then we can use the normal approximation for p. One common application of the binomial distribution to public opinion data is its use in the construction of confidence intervals (CI) around a given estimate or prediction. More specifically, if we are interested in building a 1 – α confidence interval around the unknown value of p (i.e., the vote share of candidate 1), then we start with $p^=Y¯$, and define

$Display mathematics$
(1)

by substituting the $p^$estimate for p into the formulat for the SE, where m is the number of separate binomial trials of size n. Moreover, if we assume that there are m number of separate binomial experiments, each with n trials, then we can standardize using the z-score for $p^$:

$Display mathematics$
(2)

As an example, consider a study with 124 respondents who self-identified as Republican and gave 0.46 support to a particular Republican candidate. The standard error of the proportion estimate is then given by

$Display mathematics$
(3)

The 95% confidence interval for π, the true population proportion, is calculated as follows:

$Display mathematics$
(4) (p. 277)

Meaning that nineteen times out of twenty we expect to cover the true population proportion with this calculation. More generally, a confidence interval is an interval that over 1 – α% of replications contains the true value of the parameter, on average.

The multinomial distribution is a generalization of the binomial distribution that allows for successes in k different categories. In the polling context, this is useful when working with data that can take on more than just two different values, for example vote shares in a multicandidate race such as primary elections. The multinomial probability mass function and its properties are given by

• $PMF:MN(x|n,p1,…,pk)=n!x1!⋯xk!p1x1⋯pkxk, xi=0,1,…n, 0

• $E[Xi]=npi$

• $Var[Xi]=npi(1−pi)$

• $Cov[Xi,Xj]=npipj$

One useful application of the multinomial distribution is the possibility to make predictions based on its properties. Three months before Donald Trump began his presidential campaign, an ABC News/Washington Post telephone survey of 444 Republicans between March 26 and March 29, 2015,1 gives the following proportions for the listed candidates in table 13.1.2

Table 13.1 Support for Republican Candidates for President

Candidate

X

Jeb Bush

0.20

Ted Cruz

0.13

Scott Walker

0.12

Rand Paul

0.09

Mike Huckabee

0.08

Ben Carson

0.07

Marco Rubio

0.07

Chris Christie

0.06

Other

0.10

Undecided

0.04

None

0.03

(p. 278)

Table 13.2 Covariance with Rand Paul

Candidate

R

Jeb Bush

−72.0

Ted Cruz

−78.3

Scott Walker

−79.2

Mike Huckabee

−82.8

Ben Carson

−83.7

Marco Rubio

−83.7

Chris Christie

−84.6

If we assume that this was a representative sample of likely Republican primary voters, then we can use the properties of the multinomial to make predictions. For example, suppose we intend to put another poll into the field with a planned sample of one thousand likely Republic primary voters and wanted to have an expected covariance of Rand Paul with each of the other candidates from $Cov[Xi,Xj]=npipj$. This is shown in table 13.2. What we see here is that there is not much difference in covariances relative to the scale of the differences in the proportions. Notice also that they are all negative from this equation. This makes intuitive sense, since increased support for a chosen candidate has to come from the pool of support for all of the other candidates.

So in all such contests, gains by a single candidate necessarily come at the expense of other candidates from the multinomial setup. This is less clear in the binomial case, since the PMF is expressed through a series of trials in which a stated event happens or does not happen. With k categories in the multinomial, we get a covariance term between any two outcomes, which is useful in the polling context to understand which candidates’ fortunes most affect others. Of course any such calculation must be accompanied by a measure of uncertainty, since these are inferential statements.

# Explaining Uncertainty

This section discusses how uncertainty is a necessary component of polling and how to properly manage and discuss it. Since pollsters are using samples to make claims about populations, certain information about the uncertainty of this linkage should always be supplied. As part of its “Code of Ethics,” the American Association for Public Opinion (p. 279) Research (AAPOR) provides a list of twenty disclosure items for survey research, ranging from potential sponsors of a given survey to methods and modes used to administer the survey to sample sizes and a description of how weights were calculated.3 Most important, and also reflected in AAPOR’s Transparency Initiative, which was launched in October 2014, the association urges pollsters and survey researchers to provide a number of indicators that allow informed readers to reach their own conclusions regarding the uncertain character of the reported data.4 As this section will show, only a few additional items of information about data quality are crucial in allowing us to interpret results appropriately and with the necessary caution.5

Unfortunately, it is common to provide incomplete statistical summaries. Consider the LA Times article “Two years after Sandy Hook, poll finds more support for gun rights,” which appeared on December 14, 2014 (Kurtis Lee). This short piece described percentages from a Pew Research Center poll6 that took place December 3–7, 2014. In describing the structure of the poll, the article merely stated: “Pew’s overall poll has a margin of error of plus or minus 3 percentage points. The error margin for subgroups is higher.” Additional important information was omitted, including a national sample size of 1,507 adults from all fifty states plus the District of Columbia, 605 landline contacts, 902 cell phone contacts, the weighting scheme, α = 0.05, and more. This section clarifies the important methodological information that should accompany journalistic and academic accounts of polling efforts.

## Living with Errors

Whenever we make a decision, such as reporting an opinion poll result based on statistical analysis, we run the risk of making an error, because these decisions are based on probabilistic not deterministic statements. Define first δ as the observed or desired effect size. This could be something like a percentage difference between two candidates or a difference from zero. Conventionally label the sample size as n. With hypothesis testing, either implicit or explicit, we care principally about two types of errors. A Type I Error is the probability that the null hypothesis of no effect or relationship is true, and we reject it anyway. This is labeled α, and is almost always set to 0.05 in polling and public opinion research. A Type II Error is the probability that the null hypothesis is false, and we fail to reject it. This is labeled β. Often we care more about 1 – β, which is called power, than about β. The key issue is that these quantities are always traded off by determination of α, δ, β, n, meaning that a smaller α implies a larger β holding δ and n constant, a larger n leads to smaller α and β plus a smaller detectable δ, and fixing α and β in advance (as is always done in prospective medical studies) gives a direct trade-off between the effect size and the data size.

Furthermore, these trade-offs are also affected by the variance in the data, σ2. Increasing the sample size decreases the standard error of statistics of interest in (p. 280) proportion to $1/n$. So the variance can be controlled with sampling to a desired level. The implication for the researcher with sufficient resources is that the standard errors can be purchased down to a desired level by sampling enough cases. This, however, assumes that we know or have a good estimate of the true population variance. In advance, we usually do not know the underlying variance of the future data generating process for certain. While academic survey researchers often have at least a rough idea of the expected variance from previous work (be it their own or that of others), their counterparts in the media often have very good estimates of the variance due to much more repetition under similar circumstances.

Most polling results are expressed as percentages, summarizing attitudes toward politicians, proposals, and events. Since percentages can also be expressed as proportions, we can use some simple tools to make these trade-offs between objectives and determine the ideal values for α, δ, β, or n respectively (always depending on the others). Suppose we want to estimate the population proportion that supports a given candidate, π, and we want a standard error that is no worse than σ = 0.05. To test an effect size (support level) of 55%, we hypothesize p = 0.55. This is a Bernoulli setup, so we have a form for the standard error of some estimated $p^$:

$Display mathematics$
(5)

with a mathematical upper bound of $0.5/n$. Using the hypothesized effect size, p = 0.55, this means

$Display mathematics$
(6)

Rewriting this algebraically means that n = (0.49749/0.05)2 = 98.999. So 99 respondents are necessary to test an effect size of 55% with a standard error that is 0.05 or smaller. Again, notice that sample size is important because it is controllable and affects all of the other quantities in a direct way.

Now suppose that we just want evidence that one candidate in a two-candidate race is in the lead. This is equivalent to testing whether π > 0.5, and is tested with the sample proportion $p^=x/n$, where x is the number of respondents claiming to support the candidate of interest. This time we do not have a value for p, so we will use the value that produces the largest theoretical standard error as a way to be as cautious as possible:

$Display mathematics$
(7)

which maximizes σ due to the symmetry of the numerator. The 95% margin of error is created by multiplying this value times the α = 0.05 critical value under a normal distribution assumption:

$Display mathematics$
(8)

(p. 281) Which is used to create a reported 95% confidence interval:

$Display mathematics$
(9)

To understand whether there is evidence that our candidate is over 50%, we care about the lower bound of this confidence interval, which can be algebraically isolated,

$Display mathematics$
(10)

so at $p^=0.55$ we need n = 384, and at $p^=0.65$ we need only n = 43. This highlights an important principle: the higher the observed sample value, the fewer the respondents needed. If our hypothetical candidate is far in the lead, then we do not need to sample many people, but if both candidates are in a very close race, then more respondents are required to make an affirmative claim at the α = 0.05 level. Now what is the power of the test that the 95% CI will be completely above the comparison point of 0.5? Using a simple Monte Carlo simulation in R with one million draws, hypothesizing p0 = 0.55, and using n = 99, we calculate

• # SET THE SIMULATION SAMPLE SIZE
• m <– 1000000
• # GENERATE m NORMALS WITH MEAN 0.55 AND STANDARD DEVIATION sqrt (0.55*(1–0.55)/99)
• p.hat <– rnorm(m,0.55,sqrt(0.55*(1–0.55)/99))
• # CREATE A CONFIDENCE INTERVAL MATRIX THAT IS m * 2 BIG
• p.ci <– cbind(p.hat – 1.96*0.5/sqrt(99),p.hat + 1.96*0.5/sqrt(99))
• # GET THE PROPORTION OF LOWER BOUNDS GREATER THAN ONE-HALF
• sum(p.ci[,1] > 0.5)/m
• [1] 0.16613

showing that the probability that the complete CI is greater than 0.5 is 0.16613, which is terrible. More specifically, this means that there is only an approximately 17% chance of rejecting a false null hypothesis. Note that we fixed the sample size (99), fixed the effect size (0.55), fixed the significance level (α = 0.05), and got the standard error by assumption, but let the power be realized. How do we improve this number?

Suppose that were dissatisfied with the result above and wanted n such that 0.8 of the 95% CIs do not cover 0.5 (80% power). We want the scaled difference of the lower bound and the threshold to be equal to the 0.8 cumulative density function (CDF):

$Display mathematics$
(11)

(p. 282) Rewriting this gives

$Display mathematics$
(12)

Since $L=p^−zα/2(σ/n)$ by definition of a confidence interval for the mean, then

$Display mathematics$

So we can calculate n by solving the equation:

$Display mathematics$

meaning that n = 785, using the cautious σ = 0.5.7 Notice that we needed a considerably greater sample size to get a power of 0.8, which is standard in many academic disciplines as a criterion. We can also use R to check these calculations:

• # SET THE SAMPLE SIZE
• n <– 785
• # SET THE NUMBER OF SIMULATIONS
• m <– 1000000
• # CALCULATE THE ESTIMATE OF p
• p.hat <– rnorm(m,0.55,sqrt(0.55*0.45/n))
• # CALCULATE THE CONFIDENCE INTERVAL
• p.ci <– cbind(p.hat – 1.96*0.5/sqrt(n),p.hat + 1.96*0.5/sqrt(n))
• # RETURN THE NUMBER OF LOWER BOUNDS GREATER THAN 0.5
• sum(p.ci[,1] > 0.5)/m
• [1] 0.80125

Here we fixed the power level (1–β = 0.8), fixed the effect size (using 0.55), fixed the significance level (α = 0.05), and got the standard error by the binomial assumption, but let the sample size be realized. What are the implications of this power stipulation? Anyone who considers (or is actively) expending resources to collect samples should at least understand the power implications of the sample size selected. Perhaps a few more cases would considerably increase the probability of rejecting a false null. Researchers who are not themselves collecting data generally cannot stipulate a power level, but it should still be calculated in order to fully understand the subsequent inferences being made.

(p. 283) To further illustrate the importance of sample size, suppose we are interested in testing whether support for a candidate is stronger in one state over another. The standard error for the difference of proportions is

$Display mathematics$
(13)

or more cautiously, if we lack information we assume that p1 = p2 = 0.5 to get

$Display mathematics$
(14)

Restricting the sample sizes to be equal gives $σprop=0.52n$, where n is the sample size in each group. Then for α = 0.05 and 1 – β = 0.8, in the approach where we do not know p1 and p2, we get $n=[2.8/(p1−p2)]2$. However, if we have the necessary information, this becomes $n=2[p1(1−p1)+p2(1−p2)][2.8/(p1−p2)]2$. Let us assume that we suspect that our candidate has 7.5% more support in California than in Arizona in a national election, and that we want to run two surveys to test this. If the surveys are equal in size, n, how big must the total sample size be such that there is 80% power and significance at 0.05, if the true difference in proportions is hypothesized to be 7.5%? For the 7.5% to be 2.8 standard errors from zero, we need n > (2.8/0.075)2 = 1393.8. What if the true difference in proportions is hypothesized to be 15%? Now, for the 15% to be 2.8 standard errors from zero, we need n > (2.8/0.15)2 = 348.44. Going the other way, what about a hypothesized 2.5% lead? Then n > (2.8/0.025)2 = 12544. This shows again the principle that larger sample sizes are required to reliably detect smaller effect sizes with fixed α and β.

More generally, suppose we state the sample sizes proportionally, q and (1 – q), such that qn is the size of the first group and (1 – q)n is the size of the second group. Now the standard error for difference of proportions is given by

$Display mathematics$
(15)

which has a cautious upper bound of

$Display mathematics$
(16)

With a little rearranging, we get

$Display mathematics$
(17)

$Display mathematics$
(18)

But this has σdiff in the denominator, which relies on some information about sample size besides proportional difference, which we do not have. This means that we need to rely on an approximation, the Fleiss (1981) equation:

$Display mathematics$
(19)

Since this is an estimate rather than a precise calculation, it has additional uncertainty included as part of the process. Unfortunately, since we are missing two quantities (n and σdiff), we need to resort to such a strategy. Obviously this should be noted in any subsequent write-up.

This section discussed the overt and proper ways that errors should be accounted for and discussed with survey and polling data. When statements are made about statistical analysis of such data, there is always some level of uncertainty, since the results are based on some unknown quantities. Furthermore, the data size, the sample variance, the (observed or desired) effect size, α, and power (1 – β) are all interacting quantities, and trade-offs have to be made. Therefore all aspects of the analysis with regard to these quantities should be reported to readers.

## Treating the Margin of Error Correctly

This section describes in more detail issues that come up regarding understanding the margin of error in reported results. Polling in advance of the 2016 National Democratic Primary, a YouGov poll for the Economist, asked 325 Democratic registered voters between May 15 and May 18, 2015, to identify their choice,8 producing the percentages shown in table 13.3.

Recall that a margin of error is half of a 95% confidence interval, defined by

$Display mathematics$
(20)

where Var(θ) comes from previous polls, is set by assumption, or is based on the actually observed sample proportions. Note that $θ¯$ is the random quantity and θ is fixed but unknown. Note further that given the varying sample proportions in a poll such as the one reported by YouGov, the individual estimates will have individual margins of error (p. 285) associated with them. For example, for Hillary Clinton, the 95% confidence interval would be calculated as follows:

$Display mathematics$
(21)

Table 13.3 Support from Democratic Registered Voters

Democratic Registered Voters

Clinton

60%

Sanders

12%

Biden

11%

Webb

3%

O’Malley

2%

Other

1%

Undecided

11%

N = 325

Since 95% is a strong convention in media polling, we restrict ourselves to this level.9 Accordingly, the margin of error for Hillary Clinton’s estimate would be roughly 5.3 points. However, for her potential competitor, Jim Webb, the margin of error would be considerably smaller. More specifically, we would get

$Display mathematics$
(22)

In other words, the margin of error would only be 1.9 points in this case. Despite these differences in margins of error for different statistics in the same poll, media reports of polling results will often only report one margin of error. Per convention, that margin reflects the maximum possible margin of error, which would theoretically only apply to observed sample proportions that are exactly even. While this is a conservative convention that is unlikely to drastically distort results, there is unfortunately also widespread confusion about the interpretation of confidence and (p. 286) margins of error in media reporting, which can be more dangerous. As an example, the following is a generic statement that regularly accompanies polling reports in the New York Times:

In theory, in 19 cases out of 20, the results from such polls should differ by no more than plus or minus four to five percentage points from what would have been obtained by polling the entire population of voters.

This is correct, but misinterpretations are unfortunately extremely common as well. In a piece from the Milwaukee Journal Sentinel by Craig Gilbert, rather tellingly titled “Margin of error can be confusing” (October 11, 2000), we find this seemingly similar statement:

When a poll has a margin of error of 3 percentage points, that means there’s a 95 percent certainty that the results would differ by no more than plus or minus 3 points from those obtained if the entire voting age population was questioned.

This is not true because of the word certainty. Instead, it means that in 95% of replications, we would expect the true parameter to fall into that confidence interval on average. And it gets worse (from the same article):

Let’s say George W. Bush is up by 5 points. It sounds like this lead well exceeds the 3-point margin of error. But in fact, Bush’s support could be off by three points in either direction. So could Al Gore’s. So the real range of the poll is anywhere from an 11-point Bush lead to a 1-point Gore lead.

Here the author assumes that the candidates’ fortunes are independent. However, since losses by one candidate clearly imply gains by others, there is no such independence. This is called compositional data.

To illustrate the inconsistencies that can arise when ignoring the presence of compositional data and to clarify the correct way of interpreting the margin of error in such settings, consider a poll with three candidates: Bush, Gore, and other. The correct distributional assumption is multinomial with parameters [p1, p2, p3], for the true proportion of people in each group. Define [s1, s2, s3] as the sample proportions from a single poll. We are interested in the difference s1s2 for the two leading candidates. The expected value of this difference is p1p2, and the variance is

$Display mathematics$
(23)

(p. 287) where the standard deviation of the difference between the two candidates is the square root of this. Note the cancellation of minus signs. Multiplying this by 1.96 gives the margin of error at the 95% confidence level. For specific hypothesis testing of a difference, the z-score is

$Display mathematics$
(24)

which is a simple calculation.

For example, assume that a poll with n = 1,500 respondents reports sBush = 0.47, sGore = 0.42, and sOther = 0.11. The newspaper claims that there is a 5 point difference with a 3% margin of error, so “the real range of the poll is anywhere from an 11-point Bush lead to a 1-point Gore lead.” The actual variance is produced by

$Display mathematics$
(25)

under the assumption that lost votes do not flow to the “other” candidate. The square root of this variance is 0.0243242. Finally, the margin of error, 1.96 × 0.0243242 = 0.04767543 ≈ 0.0477, is slightly less than the observed difference of 0.05, and therefore Gore could not actually be leading in terms of the 95% confidence interval. In fact, we should instead assume Bush’s lead to be anywhere between 5 – 4.77 = 0.23 and 5 + 4.77 = 9.77 percentage points. The formal hypothesis test (which gives the exact same information in different terms) starts with calculating z = 5/0.0243242 = 205.56, meaning that the test statistic is far enough into the tail to support a difference for any reasonable α value. Why such a large number for this test statistic? The answer is that n = 1,500 is such a large number of respondents that for a difference of 5 we can support a very small α. Suppose we wanted to calculate the power of this test with α = 0.01? Use the simulation method from above as follows:

• # SET THE SIMULATION SAMPLE SIZE
• m <– 1000000
• # GENERATE m NORMALS WITH MEAN 0.47 AND SD sqrt(0.47*(1 – 0.47)/1500)
• p.hat <– rnorm(m,0.47,sqrt(0.47*(1 – 0.47)/1500))
• # CREATE A 0.01 CONFIDENCE INTERVAL MATRIX THAT IS m * 2 BIG
• p.ci <– cbind(p.hat – 2.5758*0.5/sqrt(1500),p.hat + 2.5758*0.5/sqrt(1500))
• # GET THE PROPORTION OF LOWER BOUNDS GREATER THAN GORE
• sum(p.ci[,1] > 0.42)/m
• [1] 0.90264

(p. 288) So we have a 90% chance of rejecting a false null that the two candidates have identical support.

The purpose of this section has been to carefully describe the margin of error and how it is calculated. Since the margin of error is one-half of a confidence interval, its calculation is straightforward, even though the interpretation of the confidence interval is often mistaken. More subtly, with compositional data such as proportions of candidate support, the calculations must be done differently to account for the restriction that they sum to one. Failing to do so yields incorrect summaries that mislead readers.

# Understanding Proportions as Compositional Data

The data type represented by proportions of groups, by candidates, parties, and so forth is compositional. This means that the size of each group is described by a numerical ratio to the whole, and that these relative proportions are required to sum to one. Therefore, not only is the range of possible values bounded, the summation constraint also imposes relatively high (negative) correlations among values, since gains by one group necessarily imply aggregate losses by the others.

The statistical analysis of compositional data is much more difficult than it would initially appear. Since it is impossible to change a proportion without affecting at least one other proportion, these are clearly not independent random variables, and the covariance structure necessarily has negative bias. In fact the “crude” covariance matrix formed directly from a sample compositional data set will have the property that each row and column sum to zero, meaning that there must be at least one negative covariance term in every row and column. This means that correlations are not actually free to take on the full range of values from –1 to 1. Why is this important? Suppose we saw a correlation coefficient of 0.25. Most people would interpret this as indicating a weak relationship (subject to evaluation with its corresponding standard error, of course). However, it is possible that the structure of the compositional data at hand limits this correlation to a maximum of 0.30. Then it would be a strong effect, reaching 5/6 of its maximum possible positive value. Aitchison (1982) notes that these reasons lead to a lack of satisfactory parametric classes of distributions for compositional data.

There are several approaches in the methodological literature that have attempted but failed to develop useful parametric models of compositional data. One of the most common is to apply the Dirichlet distribution (Conner and Mosimann 1969; Darroch and James 1974; Mosimann 1975; James and Mosimann 1980; James 1981), a higher dimension counterpart to the beta distribution for random variables bounded by zero and one. This is a very useful parametrization, but it assumes that (p. 289) each of the proportions is derived from an independent gamma distributed random variable. In addition, the covariance matrix produced from a Dirichlet assumption has a negative bias, because it does not account for the summation restriction. Applying a multinomial distribution is unlikely to prove useful, since it also does not account for the summation requirement and focuses on counts rather than proportions (although this latter problem can obviously be solved with additional assumptions). Finally, linear approaches such as principal components analysis, principal components regression, and partial least squares will not provide satisfactory results because the is probability contours of compositional data are not linear (Hinkle and Rayens 1995).

The best manner for handling compositional data is Aitchison’s (1982) log-ratio contrast transformation. This process transforms the bounded and restricted compositions to Gaussian normal random variates. The primary advantage of this approach is that the resulting multivariate normality, achieved through the transformation and an appeal to the Lindeberg-Feller variant of the central limit theorem, provides a convenient inferential structure even in high dimensional problems.

## The Log-Ratio Transformation of Compositional Data

Compositional data with d categories on the unit interval are represented by a d – 1 dimensional simplex: $Sd={(x1,x2,…,xd):x1,x2,…,xd>0 ;x1+x2+⋯+xd=1}$. This composition vector actually represents only a single data value and is therefore indexed by cases as well (xi1, xi2, . . ., xid) for a collected data set. A single composition with d categories defines a point in an only d – 1 dimensional space, since knowledge of d – 1 components means the last can be obtained by the summation requirement. Often these compositions are created by normalizing data whose sample space is the d-dimensional positive orthant, but in the case of group proportions within an organization, the data are usually provided as racial, gender, or other proportions.

Aitchison (1982) introduced the following log-ratio transformation of the compositions on $Sd$ to the d-dimensional real space, $ℝd$:

$Display mathematics$
(26)

where xg is an arbitrarily chosen divisor from the set of categories. In the case of a data set of compositions, this transformation would be applied to each case-vector using the same reference category in the denominator. One obvious limitation is that no compositional value can equal zero. Aitchison (1986) deals with this problem by adding a small amount to zero values, although this can lead to the problem of “inliers”: taking the log of a very small number produces a very large negative value. Bacon-Shone (1992) (p. 290) provides a solution that involves taking the log-ratio transformation on scaled ranks to prevent problems with dividing or logging zero values. In practice, it is often convenient to collapse categories with zero values into other categories. This works because these categories are typically not the center of interest.

The log-ratio transformation shares the well-known linear transformation theory of multinomial distributions and has the class-preservation property that its distributional form is invariant to the choice of divisor category (Aitchison and Shen 1980). This means that the researcher can select the divisor reference category without regard for distributional consequences. The sample covariance matrix for the log-ratio transformed composition is mathematically awkward, so Aitchison (1982) suggests a “variation matrix” calculated term-wise by

$Display mathematics$
(27)

which is symmetric with zeros on the diagonal. This is now a measure of variability for xi and xj, which are vectors of proportions measured over time, space, or a randomized block design. Note that there is now no truncating on the bounds of the values of the covariance matrix, as there had been in the untransformed compositional form.

Aitchison further suggests that inference can be developed by appealing to the central limit theorem such that Y ~ MVN (μ, Σ). This is not an unreasonable appeal, since the Lindeberg-Feller central limit theorem essentially states that convergence to normality is assured, provided that no variance term dominates in the limit (Lehmann 1999, app. A1). This is guaranteed, since we start with bounded compositional data prior to the transformation.

To illustrate the application of Aitchison’s log-ratio contrast transformation, we use survey data from the fourth module of the Comparative Study of Electoral Systems (CSES).10 More specifically, in order to study the popular question of whether parties benefit from positioning themselves close to the mean voter position along the left-right scale, we employ two different questions that ask respondents to place themselves and each of their national parties on an eleven-point scale ranging from 0 (left) to 10 (right).11 Based on these questions, we first determine the mean voter position for a given country election by averaging all respondents’ left-right self-placements. We then compute party positions by calculating each party’s average placement. Our covariate of interest is then simply the absolute policy distance between each party’s position and the mean voter position in the respective election. Previous studies have repeatedly shown that as this policy distance increases, parties in established democracies tend to suffer electorally (Alvarez et al. 2000; Dow 2011; Ezrow et al. 2014).

To measure our outcome variable (party success), we employ two different techniques. The first is simply a given party’s observed vote share in the current lower house election. The second relies on the CSES surveys and is based on a question in which respondents indicate their vote choice in the current lower house election.12 Based on all nonmissing (p. 291) responses, we calculate each party’s vote share by dividing the number of respondents who indicated that they voted for a given party by the number of all respondents who indicated that they voted for any party in the respective country. We then apply Aitchison’s log-ratio transformation to both measures of party success, using the first party in each country’s CSES coding scheme (usually the largest party) as the reference category. Table 13.4 lists all these measures for the U.S. presidential election in 2012.

Table 13.5 presents the results of four OLS models that regress the different measures of party success on a party’s distance to the mean voter position. As expected, in all four model specifications, the coefficient estimate for policy distance is negative, indicating that as a party’s distance from the mean voter position increases, that party tends to lose public support. However, the more interesting part of this exercise is the effect of the log-ratio transformation on the results: using both the observed vote share and the CSES-based measure of indicated vote share, accounting for the compositional nature of the data by applying Aitchison’s transformation leads to a loss in reliability of the estimated coefficients. In other words, with this specific data set and model specification, not (p. 292) considering the compositional characteristics of the data at hand would lead journalists or scholars to potentially overestimate the reliability of their findings.13

Table 13.4 (Transformed) Vote Shares and Indicated Vote Shares, CSES USA 2012

Party

Vote Share

Transformed Vote Share

CSES Indicated Vote (N)

CSES Indicated Vote (%)

Transformed CSES Indicated Vote

Democratic Party

48.40

0

921

69.09

0

Republican Party

47.10

−.027

412

30.91

−.804

Missing

596

Table 13.5 The Effect of Policy Distance on Vote Share (CSES)

Vote Share

Transformed Vote Share

CSES Indicated Vote (%)

Transformed CSES Indicated Vote

Policy Distance

−2.35 (1.30)

[−5.18; .48]

−.11 (.08)

[−.29; .07]

−.03 (.01)

[−.06; −.00]

−.07 (.14)

[−.37; .22]

Constant

18.54 (2.88)

[12.27; 24.82]

−1.09 (.24)

[−1.62; −.57]

.21 (.03)

[.15; .27]

−1.36 (.28)

[−1.97; −.75]

Observations

81

81

87

87

Note: The table reports estimated coefficients from OLS regressions and robust standard errors (clustered by election) in parentheses. 95% confidence intervals are reported in brackets. The four different outcome variables are defined in the text.

Extending the previous discussion of the multinomial setup, this section has highlighted the unique challenges that researchers and journalists face when working with compositional data such as vote shares or proportions of party support. The summation constraint of compositional data requires different techniques if we want to convey results and the uncertainty associated with them correctly. Aitchison’s log-ratio contrast transformation offers one such approach, which we recommend here.14

# The Null Hypothesis Significance Test

This section discusses problems with the frequently used Null Hypothesis Significance Test (NHST). The key problem is that this procedure does not inform results in the way that many people assume. Such interpretation problems cause readers to believe that results are more reliable than they likely are. This was first discussed in political science by Gill (1999), followed by Ward et al. (2010) and Rainey (2014). Objections to the use of the NHST go all the way back to Rozeboom (1960), who described it as a “strangle-hold,” and Bakan (1960), who called it “an instance of the kind of essential mindlessness in the conduct of research.” Most of the early objections came from scholars in psychology, who have generated literally hundreds of articles and book chapters describing the problems with the NHST. Yet it endures and dominates in studies with survey research and polling data. Why? There are two main reasons. First, “it creates the illusion of objectivity by seemingly juxtaposing alternatives in an equivalent manner” (Gill 1999). So it looks and feels scientific. Second, faculty unthinkingly regurgitate it to their graduate students (and others), who graduate, get jobs, and repeat the cycle. Hardly a Kuhnian (1996) path of scientific progress. So the NHST thrives for pointless reasons.

To get a better understanding of the problems that commonly arise with respect to the NHST, we briefly describe some of the major flaws:

1. 1. The basis of the NHST is the logical argument of modus tollens (denying the consequent), which makes an assumption, observes some real-world event, and then determines the consistency of the assumption by checking it against the observation:

If X, then Y.

Y is not observed.

Therefore, not X. (p. 293)

The problem of modus tollens as part of NHST is that its usual certainty statements are replaced with probabilistic ones:

If X, then Y is highly likely.

Y is not observed.

Therefore, X is highly unlikely.

While this logic might seem plausible at first, it actually turns out to be a fallacy. Observing data that are atypical under a given assumption does not imply that the assumption is likely false. In other words, almost a contradiction of the null hypothesis does not imply that the null hypothesis is almost false. The following example illustrates the fallacy:

If a person is an American, then it is highly unlikely that she is the President of the United States.

The person is the President of the United States.

Therefore, it is highly unlikely that she is an American.

2. 2. The inverse probability problem highlights a common problem in interpreting the NHST. It is a widespread belief that the smaller the p-value, the greater the probability that the null hypothesis is false. According to this incorrect interpretation, the NHST produces P(H0|D), the probability of H0 being true given the observed data D. However, the NHST actually first assumes H0 as true and then asks for the probability of observing D or more extreme data. This is clearly P(D|H0). However, P(H0|D) would in fact be the more desirable test, as it could be used to find the hypothesis with the greatest probability of being true given some observed data. Bayes’s law allows for a better understanding of the two unequal probabilities:

$Display mathematics$
(28)

As a consequence, P(H0|D) = P(D|H0) is only true if P(H0) = P(D), for which we usually do not have any theoretical justification. Unfortunately P(H0|D) is what people want from an inferential statement. A practical consequence of this misunderstanding is the belief that three stars behind a coefficient estimate imply that the null is less likely than if the coefficient had only one star, although the whole regression table itself is created under the initial assumption that the null is in fact true.

3. 3. There are two common misconceptions about the role of sample size in NHST. First is the belief that statistical significance in a large sample study implies substantive real-world importance. This is a concern in polling and public opinion, because it implies a bias against work on small or difficult to reach populations that (p. 294) inherently only allow for smaller sample sizes and smaller p-values. The correct interpretation is that as the sample size increases, we are able to distinguish smaller population-effect sizes progressively. Second is the interpretation that for a given p-value in a study that rejects the null hypothesis, a larger sample size implies a more reliable result. This is false, as two studies that reject the null hypothesis with the same p-value are equally likely to make a Type I error, which is independent of their sample size.15

4. 4. A fourth criticism of the NHST is based on its asymmetrical nature. If the test statistic is sufficiently atypical given the null hypothesis, then the null hypothesis is rejected. However, if the test statistic is not sufficiently atypical, then the null hypothesis is not accepted. In other words, H1 is held innocent until shown guilty, whereas H0 is held guilty until shown innocent. As a consequence, failing to reject the null hypothesis does not rule out an infinite number of other competing research hypotheses. A nonrejected null hypothesis essentially provides no information about the world. It means that given the observed data, one cannot make any assertion about a relationship. There is a serious misinterpretation that can arise as a consequence of this asymmetry: the incorrect belief that finding a nonstatistically significant effect is evidence that the effect is zero. However, lack of evidence of an effect is not evidence of a lack of an effect. If published, such an incorrect statement (the hypothesized relationship does not exist) is damaging to our future knowledge, because it will discourage others from investigating this effect using other data or models. They will be discouraged from exploring other versions of this relationship and will move on to new hypothesized relationships, since the initial effect has already been “shown” to not exist, unless they are clearly aware of this falsehood.

There are more problems with the NHST, including the arbitrariness of α, its bias in the model selection process, the fallacy of believing that one minus the p-value is the probability of replication, the problems it causes with regard to cross-validation studies, and its detachedness of actual substantive significance (see Gill 1999, or Ziliak and McCloskey 2008). However, the four problems highlighted here and the examples in the next section should be enough to highlight the flawed nature of the NHST and warrant either a very cautious use of it or—even better—a switch to principled alternatives.

# Polling Examples

To illustrate some of the mistakes that are commonly made when scholars and journalists encounter nonrejected null hypotheses, we analyzed all twenty issues of Public Opinion Quarterly (POQ) published over the last five years (volume 74 in 2010 to volume 78 in 2014). More specifically, we searched for the expression “no effect” in (p. 295) all articles, research notes, and research syntheses and found it in 31 of 168 manuscripts (18.5%).16 Not all of those cases are necessarily problematic. In fact, many of them are referring to previous research and summarize earlier studies as finding no effects for a given hypothesized relationship.

Nonetheless, a number of cases are directly related to nonrejected null hypotheses and draw either implicit or explicit conclusions. While some are more carefully worded than others, all are methodologically problematic. Examples of somewhat careful wordings include formulations that do not unequivocally rule out any effect at all, but are a bit more cautious in describing their results. For example, in an article on voting technology and privacy concerns, the authors find that “being part of the political minority had little to no effect on post-election privacy judgments” (POQ 75; emphasis added). Similarly, in their study on different survey devices, another set of authors conclude that “[a]mong those who did not answer one or more items, there appears to be no effect from device on the number of items not answered” (POQ 75; emphasis added).

Other articles contain both cautiously and not so cautiously worded conclusions. For example, in an analysis of interviewer effects, the authors first describe a model that “also accounts for area variables, which have virtually no effect on either the interviewer-level variance or the DIC diagnostic,” but then later on incorrectly claim that “interviewer gender has no effect among male sample units” (POQ 74; emphasis added). A similarly problematic combination of conclusions can be found in another article on modes of data collection, in which the authors first correctly state “that very few of the study characteristics are significantly correlated with the observed positivity effect,” but then in the very next sentence wrongly state that “there are no effects on [odds ratios] for the negative end of the scale” (POQ 75; emphasis added).

These types of absolutist conclusions that claim a null effect based on a nonrejected null hypothesis are the most problematic, and we find them in POQ articles in each of the last five years. In 2010 a study claim that “[h]ousehold income has no effect on innumeracy” (POQ 74). In 2011 a set of authors conclude that “[r]esidential racial context had no effect on changes in perception” (POQ 75). The next year, an article stated that “the number of prior survey waves that whites participated in had no effect on levels of racial prejudice” (POQ 76), and in the subsequent year two authors claim that “fear has no effect on [symbolic racism]” (POQ 77). Examples from 2014 include the conclusions that “[f]or low-sophistication respondents who were unaware of the ACA ruling, conservatism has no effect at all on Supreme Court legitimacy”; that “[attitude importance] had no effect on the balance of pro and con articles read”; and that “[g]ender and marital status have no effect on perceptions of federal spending benefit” (POQ 78).

However, there are also articles that correctly deal with nonrejected null hypotheses. For example, in a study on the effect of issue coverage on the public agenda, the author correctly interpret the analysis with conclusions such as “[t]he null hypothesis that Clinton coverage had no effect cannot be rejected,” or “we cannot confidently reject the null hypothesis that President Clinton’s coverage had no effect on public opinion” (POQ 76). This is exactly how failing to reject the null hypothesis should be interpreted. Given the asymmetrical setup of NHST, a nonstatistically significant effect does not imply that (p. 296) the effect is (near) zero. Instead, it merely allows us to conclude that we cannot reject the null hypothesis.

The implication of the errors outlined here is that less savvy readers (or even sophisticated readers under some circumstances) will take away the message that the corresponding data and model have “shown” that there is no relationship. Returning to the quoted example above, “interviewer gender has no effect among male sample units,” the incorrect message is that interviewer gender does not matter, whereas it could matter with different but similar data/models, under different interviewing circumstances, when the questions are about gender, in different age or race groups, and so forth. As stated previously, publishing this mistake will have a chilling effect on future research unless the future researchers are clearly aware that the statement is in error. Errors of this kind may result from general sloppiness by authors, but the resulting effect is exactly the same.

# Conclusion

Survey research and polling is done by both academics and practitioners. Methodological training varies considerably between these groups. Here we have attempted to explain some underlying statistical principles that improve the interpretation of results from models and summaries. We have also tried to describe problematic procedures and practices that lead to misleading conclusions. Some of these are relatively benign, but others change how readers of the subsequent work interpret findings. A major theme in this process is correctly considering uncertainty that is inherent in working with these kinds of data. This uncertainty comes from sampling procedures, instrument design, implementation, data complexity, missingness, and model choice. Often it cannot be avoided, which makes it all the more important to analyze and discuss it appropriately. A second theme is the correct manner of understanding and reporting results. All statistical tests involve Type I and II errors, effect sizes, and a set of assumptions. Not considering all of these appropriately leads to unreliable conclusions about candidate support, the effect of covariates on choice, trends, and future predictions. We hope that we have provided some clarity on these issues.

## References

Aitchison, J. 1982. “The Statistical Analysis of Compositional Data.” Journal of the Royal Statistical Society, Series B 44: 139–177.Find this resource:

Aitchison, J. 1986. The Statistical Analysis of Compositional Data. London: Chapman & Hall.Find this resource:

Aitchison, J., and S. M. Shen. 1980. “Logistic-Normal Distributions: Some Properties and Uses.” Biometrika 67: 261–272.Find this resource:

Alvarez, R. M., J. Nagler, and S. Bowler. 2000. “Issues, Economics, and the Dynamics of Multiparty Elections: The 1997 British General Election.” American Political Science Review 42: 5596.Find this resource:

Bacon-Shone, J. 1992. “Ranking Methods for Compositional Data.” Applied Statistics 41 (3): 533–537.Find this resource:

(p. 298) Bakan, D. 1960. “The Test of Significance in Psychological Research.” Psychological Bulletin 66: 423–437.Find this resource:

Conner, R. J., and J. E. Mosimann. 1969. “Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution.” Journal of the American Statistical Association 64: 194–206.Find this resource:

Darroch, J. N., and I. R. James. 1974. “F-Independence and Null Correlations of Continuous, Bounded-Sum, Positive Variables.” Journal of the Royal Statistical Society, Series B 36: 467–483.Find this resource:

Dow, Jay K. 2011. “Party-System Extremism in Majoritarian and Proportional Electoral Systems.” British Journal of Political Science 41: 341–361.Find this resource:

Ezrow, L., J. Homola, and M. Tavits. 2014. “When Extremism Pays: Policy Positions, Voter Certainty, and Party Support in Postcommunist Europe.” Journal of Politics 76: 535–547.Find this resource:

Fleiss J. L. 1981. Statistical Methods for Rates and Proportions. 2nd Ed. New York: Wiley.Find this resource:

Gill, J. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Political Research Quarterly 52: 647–674.Find this resource:

Hinkle, J., and W. Rayens. 1995. “Partial Least Squares and Compositional Data: Problems and Alternatives.” Chemometrics and Intelligent Laboratory Systems 30: 159–172.Find this resource:

James, I. R. 1981. “Distributions Associated with Neutrality Properties for Random Proportions.” In Statistical Distributions in Scientific Work, edited by C. Taille, G. P. Patil, and B. Baldessari, 4:125–136. Dordecht, Holland: D. Reidel.Find this resource:

James, I. R., and J. E. Mosimann. 1980. “A New Characterization of the Dirichlet Distribution Through Neutrality.” Annals of Statistics 8: 183–189.Find this resource:

Kuhn, T. S. 1996. The Structure of Scientific Revolutions. 3rd ed. Chicago: University of Chicago Press.Find this resource:

Lehmann, E. L. 1999. Elements of Large-Sample Theory. New York: Springer-Verlag.Find this resource:

Mosimann, J. E. 1975. “Statistical Problems of Size and Shape: I, Biological Applications and Basic Theorems.” In Statistical Distributions in Scientific Work, edited by G. P. Patil, S. Kotz, and J. K. Ord, 187–217. Dordecht, Holland: D. Reidel.Find this resource:

Pawlowsky-Glahn, V., and A. Buccianti. 2011. Compositional Data analysis: Theory and Applications. Chichester, UK: Wiley.Find this resource:

Rainey, C. 2014. “Arguing for a Negligible Effect.” American Journal of Political Science 58: 1083–1091.Find this resource:

Rozeboom, W. W. 1960. “The Fallacy of the Null Hypothesis Significance Test.” Psychological Bulletin 57: 416–428.Find this resource:

Ward, M. D., B. D. Greenhill, and K. M. Bakke. 2010. “The Perils of Policy by P-Value: Predicting Civil Conflicts.” Journal of Peace Research 47: 363–375.Find this resource:

Ziliak, S. T., and D. N. McCloskey. 2008. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor: University of Michigan Press.Find this resource:

## Notes:

(2.) The reported margin of sampling error was ±3.5 percentage points.

(5.) The mathematical discussion below is based on the assumption that random samples do indeed reflect a random sample of the population of interest. While this assumption is commonly made, it clearly does not hold for opt-in Internet-based surveys and can be seriously doubted for conventional surveys with high levels of nonresponse. A recent discussion of the problems that can arise in these settings can be found at http://www.huffingtonpost.com/2015/02/03/margin-of-error-debate_n_6565788.html and http://www.washingtonpost.com/blogs/monkey-cage/wp/2015/02/04/straight-talk-about-polling-probability-sampling-can-be-helpful-but-its-no-magic-bullet/.

(7.) This can be calculated in R by using qnorm(0.8) = 0.84162, which in turn is equivalent to $Φ(0.8)=∫−∞0.8fN(x)dx$.

(9.) However, it is important to note that there is nothing theoretical or fundamental about this number; it is simply a common convention.

(10.) Our data come from the second advance release of Module 4 from March 20, 2015, which covers election studies from a total of seventeen different countries. http://www.cses.org/datacenter/module4/module4.htm.

(11.) The exact question wording is: “In politics people sometimes talk of left and right. Where would you place [YOURSELF/PARTY X] on a scale from 0 to 10 where 0 means the left and 10 means the right?”

(12.) For the 2012 French and U.S. elections, we used the respondents’ vote choice in the first round of the current presidential elections.

(13.) Moreover, the eclectic collection of countries covered in this advance release of the CSES Module 4 and the overly simplistic model specifications might cause the effects described above to be weaker than one would usually expect.

(14.) For a far more comprehensive discussion of the field and different techniques, see Pawlowsky-Glahn and Buccianti (2011).

(15.) This misconception results from a misunderstanding of Type II errors. If two studies are identical in every way apart from their sample size, and both fail to reject the null hypothesis, then the larger sample size study is less likely to make a Type II error.

(16.) When also including the four special issues of POQ that were published during that time, we find 34 of 203 articles include the term (16.7%).