Show Summary Details

Page of

PRINTED FROM OXFORD HANDBOOKS ONLINE (www.oxfordhandbooks.com). © Oxford University Press, 2018. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Handbooks Online for personal use (for details see Privacy Policy and Legal Notice).

Subscriber: null; date: 16 January 2019

Bayesian analysis in item response theory applied to a large-scale educational assessment

Abstract and Keywords

This article discusses the use of a Bayesian model that incorporates differential item functioning (DIF) in analysing whether cultural differences may affect the performance of students from different countries in the various test items which make up the OECD’s Programme for International Student Assessment (PISA) test of mathematics ability. The PISA tests in mathematics and other subjects are used to compare the educational attainment of fifteen-year old students in different countries. The article first provides a background on PISA, DIF and item response theory (IRT) before describing a hierarchical three-parameter logistic model for the probability of a correct response on an individual item to determine the extent of DIF remaining in the mathematics test of 2003. The results of Bayesian analysis illustrate the importance of appropriately accounting for all sources of heterogeneity present in educational testing and highlight the advantages of the Bayesian paradigm when applied to large-scale educational assessment.

Keywords: differential item functioning (DIF), cultural differences, Programme for International Student Assessment (PISA), mathematics, educational attainment, item response theory (IRT), three-parameter logistic model, Bayesian analysis, educational testing, educational assessment

22.1 Introduction

The OECD’s 1 Programme for International Student Assessment (PISA) collects information, every three years, about 15-year-old students in several countries around the world. It examines how well students are prepared to meet the challenges of the future, rather than how well they master particular curricula. The data collected in PISA surveys contains valuable information for researchers, policy makers, educators, parents and students. It allows countries to assess how well their 15-year-old students are prepared for life in a large context and to compare themselves to other countries. As it is mentioned in PISA 2003 Technical Report2, it is now recognized that the future economic and social well-being of countries is closely linked to the knowledge and skills of their population.

The first PISA survey was in 2000 and 43 countries participated in the assessment. The others were in 2003 with 41 countries, in 2006 with 57 countries and in 2009 with 67 countries. At least one country from each continent is present in the surveys, the majority being from Europe. A list with the countries participating in 2003 is presented in Appendix B.

Since PISA is applied to a large number of countries, many differences are found among the students participating, from cultural to curricular ones. These differences may influence the characteristics of some items in each country. For example, if an item is strongly related to some cultural aspect of an specific country, it is expected that this item would be easier for students from this country than for students from another one where such cultural aspect is not present. If this difference is not taken into account, such item is not good to assess the abilities of students from these two countries. Such phenomenon is called differential item functioning (DIF) and has been heavily studied in the (p. 625) psychometric literature. In general, one group is fixed as the reference group and the other ones as the focal groups, and the item functioning in the latter ones is compared to the item functioning in the reference group.

DIF analysis can be separated in two different stages. The first one is the detection of items with DIF and the second one is the quantification and explanation of the DIF detected in the first stage. Normally, the latter one is based on regression structures to identify features of the items that present differential functioning. Usually, the first stage is divided in two substages: the estimation of the students abilities or a matching of the students using groups, where students in the same group are assumed to have the same ability; and the DIF detection using these estimated abilities. Both substages are linked because, in general, it is impossible to generate abilities which are totally free of DIF. Therefore, the most natural approach would be to treat them together.

However, along with the integrated modelling of the DIF, comes the problem of model identifiability, which reflects the conceptual difficulty of the problem. For this reason, additional hypotheses on the DIF parameters of the model are assumed in order to guarantee identifiability. Nevertheless, such hypotheses may be too restrictive.

Situations like this, where, besides the information from the data, initial hypotheses on the parameters are needed, are very suitable for Bayesian approaches. The use of appropriate prior distributions, elicited in a Bayesian framework and based on previous knowledge on the items’ DIF, can be less restrictive than usual hypotheses, but still enough to guarantee identifiability.

Preparation of a large test like PISA somehow screens the DIF to avoid such items to compose the test. It is then not expected that items with large DIF, that have a great influence on the proficiencies, are found. On the other hand, it is impossible to eliminate all DIF in the tests, specially in such large-scale tests.

This chapter presents a DIF analysis of the Mathematics test in PISA 2003, for the English speaking countries, using an integrated Bayesian model. Such model allows items to present DIF without affecting the quality of the estimated proficiencies. Moreover, the DIF analysis may be useful to detect educational differences among the countries.

22.2 Programme for International Student Assessment (PISA)3

22.2.1 What does PISA assess?

PISA is based on a dynamic learning model that takes into account the fact that knowledge and ability must be continuously acquired to have a successful (p. 626) adaptation to a constant changing world. Therefore, PISA aims to assess how much knowledge and ability, essential for an effective participation in society, is acquired by students near the end of compulsory education. Differently from previous international assessments (IEA, TIMMS, OREALC, etc.),4 PISA does not focus only in curricular contents. It also emphasizes the knowledge required in modern life. Besides that, data about the student’s study habits and their motivation and preferences for different types of leaning situation is also collected. The word Literacy is used in the sequel to show the amplitude of the knowledge, abilities and competencies being assessed, and it covers:

  • contents or structures of the knowledge the students have to acquire in each domain;

  • processes to be executed;

  • the context in which these knowledge and abilities are applied.

In Mathematics, for example, the competence is assessed in items that cover from basic operations to high order abilities involving reasoning and mathematical discoveries. The Literacy in Maths is assessed in three dimensions:

  • the content of Mathematics – firstly defined in terms of wide mathematical concepts and then related to branches of the discipline;

  • the process of Mathematics – general mathematical competence in use of mathematical language, selection of models and procedures, and abilities to solve problems;

  • situations where Mathematics is used – varying from specific contexts to the ones related to wider scientific and public issues.

In PISA 2003, four subject domains were tested, with mathematics as the major domain, and reading, science and problem solving as minor domains. It was a paper-and-pencil test. The tests were composed by three item formats: multiple-choice response, closed-constructed response and open-constructed response. Multiple choice items were either standard multiple choice, where the students had to choose the best answer out of a limited number of options (usually four), or complex multiple choice where the students had to choose one of several possible responses (true/false, correct/incorrect, etc.) for each of the statements presented. In closed-constructed response items, a wider range of responses was possible. Open-constructed response items required more extensive writing, or showing a calculation, and frequently included some explanation or justification. Pencils, erasers, rulers, and in some cases calculators, were provided. The decision on providing or not calculators was made (p. 627) by the national centres and was based on standard national practice. No items in the pool required a calculator but, in some items, its use could facilitate computation. Since the model used here to analyse the data is for standard multiple-choice items, the items which do not have this format were converted into dichotomous items. Most of them were already dichotomized by PISA. The remaining items were dichotomized by putting 1 for full credit and 0 for partial or no credit.

The complex probabilistic sample, involving stratification and clustering, was obtained by randomly selecting the schools. Inside the selected schools, the students who were from 15 years and 3 months to 16 years and 2 months old were also randomly chosen. The students had to be in the 7th or 8th year of Basic School or in High School. The sample was stratified by the schools’ location (urban or rural). Besides, information on physical infrastructure of the school, geographic region, type of school (private or public) and number of enroled students were also used as variables in the implicit stratification.

22.2.2 The structure of PISA 2003 Maths test

More than a quarter of a million students, representing almost 30 million 15-year-old students enrolled in the schools of the 41 participating countries, were assessed in 2003.

The Maths test was composed by 85 items, from which 20 were retained from PISA 2000 for linking purposes. These items were selected from a previous set of 217 items by expert groups based on several criteria. Some important characteristics of the 85 selected items are presented in Tables 22.1, 22.2 and 22.3. The characteristics presented in these tables will be considered in the DIF analysis performed here.

22.3 Differential item functioning (DIF)

Differential item functioning (DIF) is the phenomenon where the characteristics of an item differ among groups of individuals. Such characteristics may

Table 22.1 Mathematics main study items (item format by competence cluster).

Item format

Competence cluster

Reproduction

Connections

Reflection

Total

Multiple-choice response

7

14

7

28

Closed-constructed response

7

4

2

13

Open-constructed response

12

22

10

44

Total

26

40

19

85

(p. 628)

Table 22.2 Mathematics main study items (content category by competence cluster).

Content category

Competence cluster

Reproduction

Connections

Reflection

Total

Space and shape

5

12

3

20

Quantity

9

11

3

23

Change and relationships

7

8

7

22

Uncertainty

5

9

6

20

Total

26(31%)

40(47%)

19(22%)

85

include discrimination and difficulty. For example, students with same level of knowledge and ability but different cultural background may lead to different probabilities of correctly answering an item with DIF.

The concern about DIF arose with the desire of creating test items that were not affected by cultural and ethnic characteristics of the individuals taking the tests (cf. Cole, 1993). If one does not consider DIF when it exists, one may be led to wrong conclusions regarding, especially, the students knowledge and abilities.

Studies conducted by the Educational Testing Service (ETS) (see Stricker and Emmerich, 1999), in the USA, indicate that DIF may exist due to three factors in a large-scale assessment context: the familiarity to the item’s content, which can be associated to the exposure to the theme or to a cultural factor; the personal interest in the content; and a negative emotional reaction caused by the item’s content.

It seems reasonable to think that items with DIF should be avoided since these would favor some group(s) of students, besides having some technical and statistical implications and other ethical issues. Nevertheless, DIF items may be very useful to study social and cultural differences that are not easily noticed. In particular, in educational assessment, DIF items can help detecting contents that are treated differently among the groups and may point out in which groups the instruction of such contents should change.

Table 22.3 Mathematics main study items (content category by item format).

Content category

Item format

Multiple-choice response

Closed-constructed response

Open-constructed response

Total

Space and shape

8

6

6

20

Quantity

6

2

15

23

Change and relationships

3

4

15

22

Uncertainty

11

1

8

20

Total

28

13

44

85

(p. 629) Explanation of DIF is a hard task. Besides that, the pedagogical and technical structure of a large-scale assessment like PISA aims to create high quality items that do not in general present differential functioning. However, it is known that the characteristics of a country and its level of economic development have great influence in the social and cultural life of its population, which reflects in education. For this reason, different countries are expected to organize the curricula in different ways, some countries may give more importance to some themes and explore more some contents. The challenge in explaining the DIF in PISA is to notice patterns in the items that present DIF. In order to do so, it is important to have a large number of items that are quite different.

22.4 Bayesian model for DIF analysis

The most common approach to undertake a statistical analysis of educational assessment data is by using item response theory (IRT). It is a psychometric theory extensively used in education assessment and cognitive psychology to analyse data arising from answers given to items belonging to a test, questionnaire, etc., and it is very useful to produce scales.

The IRT arose, formally, in the work of Lord (1952) and Rasch (1960). The basic idea of the theory is to apply models, generally parametric, where the parameters represent important features of the items and of the subjects answering the items. Some common item parameters are discrimination, difficulty and guessing. The subject parameters are individual characteristics, for example, a type of ability or some other latent trait possibly associated to his/her psychological condition.

An item can be dichotomous, if the answer is either correct or not; polytomous, if the answer can be classified in more than two categories; and also continuous, if the answer is classified in a continuous scale. There are IRT models for all those situations, but only models for dichotomic items will be considered here.

The increasing use of IRT in educational assessment and the concerns about differential item functioning are leading researchers to propose new models that take DIF into account. In this context, Soares et al. (2009) propose a new IRT Bayesian model that is a generalization of the three-parameter logistic model. The proposed model incorporates the detection, quantification and explanation of DIF in an integrated approach.

Typically, in educational assessment, a test is formed by I items, but a student j only answers a subset I(j) of these items. Let Yij, j = 1, … , J, be the score attributed to the answer given by the student j to the item iI(j) ⊂ {1, … , I}. In the case where i is a dichotomous item, Yij = 1 if the answer is correct and Yij = 0 if the answer is wrong. (p. 630)

Bayesian analysis in item response theory applied to a large-scale educational assessmentClick to view larger

Fig. 22.1 Item’s characteristic curve for the three-parameter logistic model with a = 0.5, b = 0 and c = 0.2.

Define Pr( Y ij =1| θ j , a i , b i )= π ij , and consider logit( π ij )=ln( π ij / 1 π ij ) . If logit( π ij )= Δ ij , then π ij =logi t 1 ( Δ ij )=1/1+ e Δ ij . One of the most used models in IRT is the three parameters logistic model proposed by Birnbaum (1968), which models the probabilities πij as

(22.1)

π ij = c i +( 1 c i )logi t 1 ( Δ ij ) Δ ij = D ai ( θ j b i )

where ai, bi and ci are the discrimination, difficulty and guessing of item i, respectively, ∀i. θj is the ability, proficiency or knowledge of student j, ∀j, and D is a scale factor designed to approximate the logistic link to the normal one (and set here to 1.7).

A good way to interpret the three-parameter logistic model is by analysing the item’s characteristic curve generated by the model, presented in Figure 22.1. Note from the model that the greater ai is, the greater will be the slope of the curve. This means that, the more discriminant an item is, the larger is the difference in the probability of correct answer for students with different proficiencies, that is, the larger is the capability of the item to discriminate the students. Moreover, the maximum slope is attained at θ = b, as can be seen in the figure.

The difficulty parameter bi interferes in the location of the curve. If the value of bi is increased, the curve moves right and the probability of correct answer is reduced (the item becomes harder). Alternatively, if bi is decreased, the curve moves left and the probability of correct answer is increased (the item becomes easier).

Furthermore, note that lim θ j Pr( Y ij =1| θ j , a i , b i , c i )= c i . Hence, ci is the minimum probability that a student has to correctly answer item i. ci is then (p. 631) called guessing parameter because, since it is a multiple choice item, there is always a probability that the student answers it correctly by guessing. Lord (1952) noticed that, in general, the percentage of correct answers for an item, in very low levels of proficiency, was smaller than the inverse of the number of options in the item. Experts in the area report that they have observed varied behaviours for those percentages.

In general, there can be different types of DIF (see Hanson (1998) for a wider characterization). For the three parameters model, the types of DIF can be characterized according to the difficulty, discrimination and guessing. The model proposed here does not consider the possibility of DIF in the guessing parameter. Although it is possible, the applicability of this case is substantially limited by the known difficulties in the estimation of this parameter and by practical restrictions.

Suppose that the students are grouped in G groups. The integrated Bayesian DIF model used in this chapter associates the student’s answer to his/her ability via (22.1) with

Δ ij =D a ig ( θ j b ig ),

where aig is the discrimination parameter for item i and group g, big is the difficulty parameter for item i and group g, ci is the guessing parameter for item i, for i = 1, … , I , j = 1, … , J and g = 1, … , G.

Since the item difficulty is a location parameter, it is natural to think about its DIF in the additive form and thus it is set as b ig = b i d ig b . Analogously, since the discrimination is a scale parameter, it is natural to think about its DIF in the multiplicative form and thus it is set as a ig = e d ig a a i ( >0 ) . Thus, d ig b ( with d i1 b =0 ) represents the DIF related to the difficulty of the item in each group and d ig a ( with d i1 a =0 ) represents the DIF related to the discrimination of the item in each group. Use of the exponential term in the discrimination places the DIF parameter over the line and combines naturally with a normal regression equation setting. Alternatively, the DIF parameter can be specified directly without the exponential form. This leads to a log-normal regression model. The two forms are equivalent but the former was preferred here.

It is assumed, a priori, that θ j | λ g( j ) N( μ g( j ) , σ g( j ) 2 ) , where g(j) means the group which the student j belongs to and λ g =( μ g( j ) , σ g( j ) 2 ),j=1,,J . It is admitted that λ 1 =( μ 1 , σ 1 2 )=( 0,1 ) to guarantee the identification of the model. On the other hand, λ g =( μ g , σ g 2 ),g=2,,G , is unknown and must be estimated along with the other parameters.

The model is completed with specifications of the prior distributions for the parameters. Let N be the Normal distribution, L N the Log-Normal distribution, Be the Beta distribution and IG the Inverse-Gamma distribution, so, the prior (p. 632) distributions assumed for the structural parameters are:

a i LN( μ a i , σ a i 2 ), b i N( μ b i , σ b i 2 )and c i Be( α c i , β c i ),fori=1,,I.

The prior distributions for the parameters of the abilities’ distributions are:

μ g N( μ 0g , σ 0g 2 )and σ g 2 IG( α g , β g ),g=2,G.

The set of anchor items (items for which d ig a = d ig b =0,g=1,,G ) is represented by IA ⊂ {1, … , I}. The set of items for which the parameters vary between the groups is represented by Idif = {1,…, I} − IA. Moreover, I dif a I dif is the set of items with DIF in the discrimination and I dif b I dif is the set corresponding to the DIF in the difficulty. Naturally I dif = I dif a I dif b . Notice that if an item belongs to Idif, it does not necessarily mean that this item has DIF in the usual meaning of the term. It means that it is not an anchor item and it can potentially have DIF. Besides that, this can be used as admissible information for the explanatory structure imposed to the DIF, which cannot be performed with the anchor items.

Let Z ig h , be the DIF indicator variable of item i in group g, for parameter h, for h = a, b. Therefore, Z ig h =1 if parameter h of item i has DIF in group g, and Z ig h =0 otherwise. Two possibilities may be considered: one where Z ig h is known, which means that the anchor items are known a priori and the DIF analysis considers all the other items; and another one where Z ig h is unknown and must be identified. In other words, it is not known a priori if the item has or does not have DIF. The latter one will be used in the analysis of PISA.

Finally, a regression structure is considered for d ig h in the DIF explanation as follows

(22.2)

d ig h = γ 0g h + k=1 K h γ kg h W ik h + η ig h , if Z ig h =1.

γ kg h are the fixed parameters of the DIF model, W ik h are the explanatory variables associated to the items and η ig h is the item specific random factor for each group. It is also assumed for modelling simplification that random factors are independent with η ig h N( 0, ( τ g h ) 2 ) , for g = 2, … , G.

The regression structure is imposed for all items but the anchor ones. Set W i h =( 1, W i1 h , W i K h h ) and γ g h = ( γ 0g h ,, γ K h g h ) ' . When Z ig h =1 the conditional distribution of d ig h is given by ( d ig h | γ g h , W i h , ( τ g h ) 2 )N( W i h γ g h , ( τ g h ) 2 ) . When Z ig h =0, d ig h will be assumed to have a reduced variance normal distribution ( d ig h | ( τ g h ) 2 , Z ig h =0 )N( 0, s 2 ( τ g h ) 2 ) where s2 is chosen to be small enough to ensure that d ig h is tightly concentrated around (but not degenerated at) 0. This strategy was proposed for variable selection in a regression model by George and McCulloch (1993).

(p. 633) The distribution of d ig h | γ g h , W i h , Z ig h , ( τ g h ) 2 can then be written as follows

( d ig h | γ g h , W i h , Z ig h , ( τ g h ) 2 )N( ( W i h γ g h ) Z ig h , [ s 2 ] 1 Z ig h ( τ g h ) 2 ).

Suitable prior distributions are γ g h N( γ 0 h , S 0 h ), ( τ g h ) 2 ~IG( α g h , β g h ) and Z ig h ~Ber( π ig h ) , where Ber is the Bernoulli distribution.

The model proposed here is very general. Apart from the usual parameters, it also has a DIF indicator variable Z ig h which can either be estimated along with all the other parameters of the model or be fixed a priori. The items for which Z ig h is not fixed at 0 include additional variables to the model. These variables are used for the DIF explanation and constitute a regression model which may or not have covariates.

Prior distributions play a very important role in a Bayesian model. In this particular model, they are very important in the selection of the anchor items, as it will be seen. If one wishes to set an item as an anchor one, one simply sets π ig h =0,g=2,G,h=a,b . Naturally, π ig h may be set as zero for some but not all groups. In the same way, if one wishes to include an item in the DIF analysis, independent of the DIF’s magnitude, one simply sets π ig h =1,g=2,,G , for some h. However, the most interesting use of this prior distributions is to consider previous information and beliefs about the items’ functioning to identify parameters more precisely and effectively. The estimation of the parameters is performed by using MCMC methods to obtain a sample from the joint posterior distribution of all the parameters. Details on these methods are presented in Appendix B.

The level of generality introduced by this model aims to represent the complexity of the problem. However, along with the integrated modelling of the DIF, comes the problem of model identifiability. Such problems are described and studied in Soares et al. (2009). The authors show that identifiability is achieved by either imposing informative prior distributions for some Z ig h parameters or fixing some items not to have DIF. Nevertheless, in many cases, the model is identifiable without any of these two actions because of the informative priors attributed to the other parameters.

22.5 DIF analysis of PISA 2003

The Bayesian model presented in Section 22.4 is now used to perform a DIF analysis of the PISA 2003 Maths test in English speaking countries. The database obtained had 84 of the 85 items of the test. The estimates of such missing item’s parameters are not presented in the PISA Technical Report 2003 either. Great Britain is defined as the reference group (group 1). Table 22.4 shows the other groups. (p. 634)

Table 22.4 English speaking countries used in the DIF analysis.

Country

Group

Great Britain

1

Canada

2

Australia

3

Ireland

4

USA

5

New Zealand

6

22.5.1 Prior distribution

The following prior distributions are used:

a i ~LN( 0,2 ), b i N( 0,1 ), c i Be( 5,17 ), μ g N( 0,1 ), σ g 2 IG( 0.1,0.1 ), γ g h N( 0,I ), ( τ g h ) 2 IG( 0.5,0.5 ) and Z ig h Ber( 0.5 ), for i=1,84,g =2,,6,h=a,b.

The value chosen for parameter s2 is 1/200,000.

The prior distributions of the item parameters are chosen according to what is expected, in general, when the proficiencies follow a standard Normal distribution, which is the case of the reference group. The discrimination parameters are expected to vary mostly between 0 and 3, and the L N(0, 2) has probability 0.70 of being smaller than 3. Most of the difficulty parameters are expected to lie between −2 and 2, and a few of them to have absolute value greater than 2. The standard Normal distribution is then a suitable prior to describe this behavior. The guessing parameter is expected to be low (less than 0.3), since many items were not multiple choice ones, and then should have a low probability of correct guessing. The Be(5, 17) distribution gives 0.8 probability to values smaller than 0.3.

The mean of the proficiencies in the focal groups are not expected to be very far (more than 1 unit) from the mean of the reference group. For this reason, a standard Normal distribution is used for these parameters. For the variance of the proficiencies, a prior distribution with large variance was preferred.

The DIF parameters are expected to have absolute value smaller than 0.5 and, for a few items, between 0.5 and 1. For this reason, the effect of a binary covariate is also expected to be around these values. Therefore, a standard Normal distribution is used for the coefficients in the regression analysis. For the variance of the regression error, a prior distribution with large variance is adopted. (p. 635)

Bayesian analysis in item response theory applied to a large-scale educational assessmentClick to view larger

Fig. 22.2 Estimates of difficulty parameters in the reference group and in the groups where they had a posterior probability of DIF greater than 0.5.

Since there is no prior information about how likely each item is to present DIF, a symmetric prior distribution Ber(0.5) is used for the DIF indicator variables Z ig h .

22.5.2 DIF detection

Figures 22.2, 22.3 and 22.4 show some results for the item parameters. For the discrimination and difficulty parameters, the estimates in the reference group (GBR) and in the groups where the item is likely to have DIF a posteriori are presented.

A nice feature of the model is that it incorporates the uncertainty about DIF in the estimates. This means that the model does not have to ‘decide’ if an item has DIF. It outputs a posterior distribution on the parameters that describes

Bayesian analysis in item response theory applied to a large-scale educational assessmentClick to view larger

Fig. 22.3 Estimates of discrimination parameters in the reference group and in the groups where they had a posterior probability of DIF greater than 0.5.

(p. 636)

Bayesian analysis in item response theory applied to a large-scale educational assessmentClick to view larger

Fig. 22.4 Estimates of guessing parameter.

the uncertainty about that, particularly, for the posterior distribution of the DIF parameters. It is up to the researcher to analyse this uncertainty and draw conclusions from it.

If an item i is likely to have DIF a posteriori in group g and parameter h, say P( Z ig h =1|Y )>0.5 , the posterior distribution of d ig h will probably be bimodal with one mode around 0. In order words, this posterior distribution is a mixture of a distribution with mean around 0 and a very small variance and another distribution. The former one is the distribution of ( d ig h | Z ig h =0,Y ) and the latter is the distribution of ( d ig h | Z ig h =1,Y ) . The more likely the item is to have DIF a posteriori, the greater is the weight of the second component of the mixture.

The decision on an item presenting or not DIF is based on the posterior distribution of the respective Z ig h . If one item is assumed to have DIF, the estimation of this DIF can be made by the distribution of ( d ig h | Z ig h =1,Y ) . For this reason, the estimates of the item parameters in the focal groups, when the item has a posterior probability of DIF greater than 0.5, presented in Figures 22.2 and 22.3, are the posterior means of ( d ig h | Z ig h =1,Y ) .

Concerning difficulty, 20 items have posterior probability of DIF smaller than 0.5 in all groups, that is, they are more likely not to present any noticeable DIF. No item has this DIF probability greater than 0.5 for more than three countries. Forty-five DIF parameters have absolute value greater than 0.3, which is a considerable magnitude for DIF, and nine have this value greater than 0.5, which is a large value for DIF.

Considering discrimination, seven items have posterior probability of DIF smaller than 0.5 in all groups. Nine items have this DIF probability greater than 0.5 for four countries and no item has this probability greater than 0.5 for all the countries. On the other hand, only eight DIF parameters are greater than (p. 637) 0.3 (which increases the discrimination by 35% if it is positive and decreases by 26% if it is negative). Only one DIF parameter is greater than 0.5; this value increases the discrimination by 65% if it is positive and decreases by 40% if it is negative.

Finally, note that most of the guessing parameters estimates are less than 0.15. This was expected since most of the items are not multiple choice ones. It would be reasonable to fix c = 0 for these items. However, this was not imposed, in order to enable the comparison of the results obtained with the integrated Bayesian model with the ones from Bilog-mg software (see Thissen, 2001) without DIF. Bilog-mg uses the same scale for the proficiencies and fits a three parameters logistic model.

22.5.3 DIF explanation

Four covariates were chosen to explain the DIF in the English speaking countries. The first one is the Content category shown in Table 22.2. It is a categorical variable with four categories: Space and shape; Quantity; Change and relationships; and Uncertainty. Three dummy variables are used to introduce this covariate to the regression model. Quantity is the base category, the first dummy variable refers to Space and shape, the second one to Change and relationships, and the last one to Uncertainty.

The second covariate used in the DIF explanation is Competence cluster shown in Table 22.1. It is also a categorical variable and has three categories: Reproduction; Connections; and Reflections. Two dummy variables represent the covariate where Connections is the base category, the first dummy variable refers to Reproduction and the second one represents Reflection.

The third covariate is a binary variable and indicates if the item has support of graphical resource. Finally, the fourth covariate represents the size of the question, measured by the number x of words and standardized using the rule (x − 75)/100.

Although the complete analysis can be performed in an integrated way, incorporating all model uncertainty, the DIF explanation is performed here in a separate step. The DIF magnitude among the six countries is not large, since few items with large DIF were detected in the previous section. This way, it would be difficult to highlight possible effects of covariates in the DIF.

The analysis is performed by fixing the items that had a small probability of presenting DIF in all groups in the first analysis as anchor items (13, 21, 43, 61, 67 and 74). All the other items are fixed to have DIF, that is π ig h =1 . This way, the analysis presented in this section is designed only to identify possible effects of covariates in the DIF. (p. 638)

Table 22.5 Results of the DIF explanation regression model for DIF in difficulty.

Covariates

Coefficients

Canada

Australia

Ireland

USA

New Zealand

Model 1

    Intercept

0.03

−0.09

0.001

0.01

−0.02

    Space and shape

−0.11

0.02

−0.27∗∗

−0.13

0.11

    Change

−0.13

−0.01

−0.06

−0.15

0.01

    Uncertainty

−0.14∗∗

−0.15∗∗

−0.15

−0.33

−0.07

Model 2

    Intercept

−0.03

−0.05

0.08

−0.03

0.03

    Space and shape

−0.04

0.06

−0.24∗∗

−0.07

0.10

    Uncertainty

−0.08

−0.15∗∗

−0.11

−0.29∗∗

−0.10

    Reproduction

0.04

0.003

−0.09

−0.003

0.009

    Reflection

−0.02

−0.03

−0.10

0.08

0.002

Model 3

    Intercept

−0.03

−0.19

0.03

0.02

−0.04

    Space and shape

−0.05

0.02

−0.25∗∗

−0.03

0.10

    Uncertainty

−0.08

−0.12

−0.13

−0.27∗∗

−0.09

    Graphical support

0.003

0.11

−0.02

−0.09

0.01

Model 4

    Intercept

0.02

−0.03

0.03

−0.01

0.02

    Space and shape

−0.04

0.05

−0.24∗∗

−0.02

0.11

    Uncertainty

−0.08

−0.15

−0.13

−0.23∗∗

−0.08

    Question size

0.03

−0.03

−0.01

0.07

0.002

∗ – significant at 90% and ∗∗ – significant at 95%. Significant at α%means that the α% posterior credibility interval does not include 0.

Several models were fitted for the DIF explanation. The first model for DIF in difficulty has only the covariates to indicate the Content category. In each of the following steps, a new covariate was included and the covariates that were not significant at 95% in all group in the previous model were removed. For the discrimination, the same models used for difficulty were fitted.

The results, presented in Tables 22.5 and 22.6, show that Space and shape items are more difficult for students from Ireland compared to the other countries. These items are also somewhat easier for students from New Zealand. Moreover, items related to Uncertainty are harder for students from the USA compared to the other five countries. They are also slightly harder for students from Australia and Ireland than for the ones from the other three countries.

Regarding the discrimination of the items, the results show that, in general, items with DIF in discrimination are more discriminant for students from Great Britain, followed by New Zealand and Ireland, Canada and Australia, and they are less discriminant for students from the USA. On the other hand, (p. 639)

Table 22.6 Results of the DIF explanation regression model for DIF in discrimination.

Covariates

Coefficients

Canada

Australia

Ireland

USA

New Zealand

Model 1

    Intercept

−0.14∗∗

−0.14∗∗

−0.12

−0.27∗∗

−0.05

    Space and shape

0.06

−0.006

0.12

0.09

0.008

    Change

−0.01

−0.02

0.02

−0.01

−0.02

    Uncertainty

−0.006

0.05

0.07

0.15

0.03

Model 2

    Intercept

−0.16∗∗

−0.17∗∗

−0.06

−0.26∗∗

−0.11

    Space and shape

0.08

0.008

0.10

0.09

0.02

    Uncertainty

−0.003

0.04

0.05

0.15

0.03

    Reproduction

−0.01

0.01

−0.08

−0.02

−0.02

    Reflection

0.02

0.10

−0.03

0.07

0.05

Model 3

    Intercept

−0.19∗∗

−0.16∗∗

−0.16∗∗

−0.27∗∗

−0.09

    Space and shape

0.06

0.02

0.09

0.09

0.05

    Uncertainty

0.01

0.07

0.07

0.17

0.03

    Graphical support

0.04

0.02

0.05

0.04

−0.05

Model 4

    Intercept

−0.17∗∗

−0.16∗∗

−0.12∗∗

−0.22∗∗

−0.09

    Space and shape

0.08

0.01

0.08

0.11

0.02

    Uncertainty

−0.02

0.05

0.01

0.16

0.01

    Question size

−0.06

−0.09

−0.20∗∗

−0.08

−0.13

∗ – significant at 90% and ∗∗ – significant at 95%. Significant at α% means that the α% posterior credibility interval does not include 0.

if only items related to Uncertainty are considered, they are, on average, as discriminant for students from the USA as they are for students from Great Britain and they are more discriminant in these two countries than in the other four countries. Furthermore, the question size has an influence on the DIF in discrimination in Ireland and New Zealand. Larger questions make the items less discriminant in these countries, specially in Ireland.

22.5.4 Analysis of the proficiencies

The results obtained for the distribution of the proficiencies in the DIF analysis without covariates is presented. They are compared with the results from the original analysis of PISA and from the analysis with the Bilog-mg software.

Bilog-mg also allows the existence of DIF, but only in the difficulty. The scale in Bilog-mg is defined by assuming a standard normal distribution for the proficiencies from the reference group. The same is used for the Bayesian model proposed in this chapter. (p. 640)

Table 22.7 Estimates of the mean and standard deviation of the proficiencies in each country.

Country

Parameter

PISA

Bilog-mg∗

IBM∗

Bilog-mg

IBM

Australia

Mean

524.11

522.95

522.15

0.1602

0.1515

    N = 235,486

Std. Dev.

95.60

94.40

93.78

1.0209

1.1042

Canada

Mean

532.70

529.43

528.77

0.2303

0.2231

    N = 330,098

Std. Dev.

87.33

90.00

85.26

0.9734

0.9221

Great Britain

Mean

508.14

508.14

508.14

0

0

    N = 696,215

Std. Dev.

92.47

92.47

92.47

1

1

Ireland

Mean

503.52

502.87

508.10

−0.0570

−0.0005

    N = 54,838

Std. Dev.

85.32

85.62

82.37

0.9259

0.8908

New Zealand

Mean

524.17

521.56

523.68

0.1452

0.1681

    N = 48,606

Std. Dev.

98.17

96.72

92.71

1.0459

1.0026

U.S.A.

Mean

483.64

484.85

474.99

−0.2518

−0.3585

    N = 3,140,301

Std. Dev.

95.37

91.40

97.52

0.9884

1.0546

Total

Mean

493.81

494.33

487.45

−0.1494

−0.2238

    N = 4,505,544

Std. Dev.

95.72

92.88

97.46

1.0045

1.0540

∗ refers to the modified scales.

Table 22.7 shows the mean and variance of the distributions of the proficiencies in each country considering:

  • the original PISA proficiency (pv1math), which does not consider DIF;

  • the results obtained with Bilog-mg without DIF;

  • the results obtained with Bilog-mg without DIF in a modified scale;

  • the results obtained with the integrated Bayesian model (IBM) proposed in this chapter;

  • the results obtained with the integrated Bayesian model (IBM) in a modified scale.

The modified scale referred to above consists in transforming the estimates in order to make the mean and variance of the reference group (GBR) the same as in the PISA scale.

PISA uses the Rasch model and a partial credit model and fixes the mean and standard deviation of all the proficiencies to be 500 and 100, respectively. For this reason, the proficiencies obtained with the integrated Bayesian model can not be directly compared to the ones from the original PISA results. They should be compared to the results from Bilog-mg, with GBR as the reference group. The transformed scales of the IBM and Bilog-mg are presented just to give an idea about the differences among the countries, compared with the original results from PISA.

Table 22.7 shows that the results are very similar with and without DIF in four of the six countries. Differences are only found in Ireland, where the mean increases when DIF is considered, and in the USA, where the mean decreases (p. 641) if DIF is accounted. It is important to mention that the data was weighted by the sampling weights.

22.6 Conclusions

The analysis presented here shows the importance of appropriately accounting for all sources of heterogeneity present in educational testing. Incorporation of differentiation in the education pattern of countries allows the possibility for explanation of possible causes for it. This can lead the way for improvement in the schooling systems. The use of the Bayesian paradigm provides a number of advantages ranging from inclusion of relevant background information to allowance for model identification.

In the context of the specific application considered, a host of indicators differentiating the educational systems of the English-speaking countries were identified. These may help to understand the nature and possible origins of the difference between them and lead the way for incorporation of beneficial practices in the currently available systems.

Appendix

A. Broader context and background

A.1 Markov chain Monte Carlo methods

The use of Bayesian statistics to handle statistical inference problems has grown considerably since the 1990s. This is due, mainly, to the advances in the field of stochastic simulation, particulary Markov chain Monte Carlo (MCMC) methods.

Bayesian inference is mainly based on the posterior distribution of the quantities of interest (generally parameters). But in most practical situations this distribution is not fully available. In most of the cases it is known apart from a constant. Bayes theorem states that the posterior distribution of a quantity θ, which is the distribution of θ given some observed data X, is given by:

(22.3)

p( θ|X )= p( X|θ )p( θ ) Θ p( X|θ )p( θ )dθ .

The posterior distribution of θ is proportional to the product of the likelihood function and the prior distribution of θ as it is shown in the numerator of (22.3). The denominator of (22.3) is a constant with respect to θ and is generally not analytically available due to the complexity of the integral.

Since such integral cannot be solved analytically, it is not possible to compute expectations with respect to the posterior distribution of θ. Examples include the expectation and variance of θ or any probability like P(θ ∈ A|X) for a given subset A of the state space of θ. Therefore, specially when θ is multidimensional, (p. 642) knowing only the numerator in (22.3) is not enough to have useful posterior information on θ.

However, if it is possible to have a sample from this posterior distribution, one could obtain the Monte Carlo (MC) estimator of the desired expectations. The Monte Carlo estimator of E[g(θ)|X] is given by

(22.4)

E ^ [ g( θ )|X ]= 1 n i=1 n g( θ ( i ) )

where θ(1)(1), , θ(n) is a sample from the posterior distribution of θ.

Monte Carlo estimators have some nice properties. First of all, they are unbiased estimators. Secondly, if g2(θ) has finite expectation under p(θ|X), the variance of the MC estimator is O(1/n). Moreover, they are strongly consistent estimators, since Ê[g(θ)|X] converges almost surely to E[g(θ)|X] by the strong law of large numbers.

Markov chain Monte Carlo methods consist of obtaining a sample from the posterior distribution of θ through iterative simulation and using this sample to calculate estimates of expectations. The idea is to construct a Markov chain for θ which has p(θ|X) as invariant distribution. This means that, p( θ ( i ) ) i p( θ|X ) , for any arbitrary initial value θ(0)(0) of the chain. Thus, for an arbitrarily large iteration i, θ(i)p(θ|X) approximately, and consequently θ(i+1)p(θ|X). This way, starting from an arbitrary seed θ(0)(0), values simulated from such Markov chain after a large number of iterations should come from a distribution very close to the posterior distribution of θ.

In practice, a number N of iterations is chosen in a way that the chain is believed to be in equilibrium after this iteration and a sample θ(N+1), , θ(N+M) is used to obtain estimates of expectations. These first N iterations of the chain are called burn-in period and the estimators used are

(22.5)

E ^ [ g( θ )|X ]= 1 M i=N+1 N+M g( θ ( i ) ) .

Note that the sample obtained is not independent since it consists of a realisation of a Markov chain. However, it is still reasonable to use the estimator in (22.5) since the ergodic theorem states that, if the chain is ergodic (see Gamerman and Lopes, 2006) and E[g(θ)|X] < ∞, then

(22.6)

1 n i=1 n g( θ ( i ) ) a.s. E[ g( θ )|X ], asn .

This is a Markov chain equivalent of the law of large numbers and states that averages of values from the Markov chain are strongly consistent estimators of (p. 643) expectations w.r.t. to the invariant distribution, despite the dependence structure imposed by the chain.

The most used MCMC methods in Bayesian inference are the Gibbs sampling and the Metropolis – Hastings algorithms. The former consists in defining the transition distribution of the chain as the full conditional distributions.

Suppose that θ = (θ1, , θd), where each component θi can be a scalar, a vector or a matrix, and that the full conditional distributions pii, X) are known and it is possible to simulate from them, where θi is the vector θ without the i-th component.

It can be shown that a Markov chain with transition distributions given by the full conditional distributions has p(θ|X) as invariant distribution. This way, the Gibbs sampling algorithm is given by:

  1. 1. Initialize the iteration counter of the chain j = 1 and set initial values θ ( 0 ) =( θ 1 ( 0 ) ,, θ d ( 0 ) ) .

  2. 2. Obtain a new value θ ( j ) =( θ 1 ( j ) , θ d ( j ) ) from θ(j−1) through successive generation of values

    θ 1 ( j ) p( θ 1 | θ 2 ( j1 ) ,, θ d ( j1 ) ) θ 2 ( j ) p( θ 2 | θ 1 ( j ) , θ 3 ( j1 ) ,, θ d ( j1 ) ) θ d ( j ) ~p( θ 1 | θ 1 ( j ) ,, θ d1 ( j ) ).

  3. 3. Change counter j to j + 1 and return to step 2 until convergence is reached.

In the case where it is not feasible to simulate directly from the full conditional distributions, the Metropolis – Hastings algorithm can be used. The idea is to choose a transition distribution p(ψ, ϕ) in a way that it constitutes a Markov chain that has p(θ|X) as invariant distribution. A sufficient condition for that is to have a reversible chain, that is

(22.7)

π( ψ )p( ψ,ϕ )=π( ϕ )p( ϕ,ψ ), ( ψ,ϕ ) ,

where π represents the posterior distribution.

The distribution p(ψ, ϕ) consists of two elements: an arbitrary transition distribution q(ψ, ϕ) and a probability α(ψ, ϕ) such that

(22.8)

p( ψ,ϕ )=q( ψ,ϕ )α( ψ,ϕ ) ifψϕ.

This transition kernel defines a density p(ψ, ϕ) for every possible value of θ different from ψ. Therefore, there is a positive probability left for the chain to (p. 644) remain in ψ given by

(22.9)

p( ψ,ψ )=1 q( ψ,ϕ ) α( ψ,ϕ )dϕ.

Hastings (1970) proposed to define the acceptance probability in a way that it defines a reversible chain when combined with the arbitrary transition distribution. Such probability is given by

(22.10)

α( ψ,ϕ )=min{ 1, π( ϕ )q( ψ,ϕ ) π( ψ )q( ψ,ϕ ) }.

Note that it is not necessary to know the constant in the denominator of (22.3), since it cancels in the expression of the acceptance probability. The Metropolis – Hastings algorithm may be detailed as follows:

  1. 1. Initialize the iteration counter of the chain j = 1 and set initial values θ ( 0 ) =( θ 1 ( 0 ) , θ d ( 0 ) ) .

  2. 2. Move the chain to a new value ϕ generated from the density q(j−1), ·).

  3. 3. Evaluate the acceptance probability of the move α(j−1), ϕ) given by (22.10).

  4. 4. If the move is accepted, θ(j) = ϕ. If it is not accepted, θ(j) = θ(j−1) and the chain does not move.

  5. 5. Change counter j to j + 1 and return to step 2 until convergence is reached.

Step 4 is performed by generating a value u from a unit uniform distribution. If uα, the move is accepted, if u > α the chain does not move.

In the case where some (but not all) the full conditional distributions are known, the algorithm sometimes called Metropolis-within-Gibbs can be used. This is the algorithm used in this chapter to draw samples from the joint posterior distribution of all the parameters.

The idea of the Metropolis-within-Gibbs algorithm is to construct a Gibbs sampler and, for the components of θ that cannot be directly sampled from the full conditional distribution, a Metropolis – Hastings step is used. That is, such components are drawn from an arbitrary transition distribution and have the moves accepted with an acceptance probability that constitutes a chain that has the correspondent full conditional as invariant distribution. In other words, on the ( j + 1)-th iteration, a component θi which cannot be directly drawn from its full conditional distribution is drawn from an arbitrary transition distribution q( θ i ( j ) ,φ ) (p. 645) and has the move accepted with probability

α( θ i ( j ) ,ϕ )=min{ 1, π i ( ϕ )q( ϕ, θ i ( j ) ) π i ( θ i ( j ) )q( θ i ( j ) ,ϕ ) }

where πi is the full conditional distribution of the i-th component of θ.

Such transition distributions constitute a Markov chain with invariant distribution given by the posterior distribution of θ. For more details on the MCMC methods described here and other issues on stochastic simulation in Bayesian inference see Gamerman and Lopes (2006).

A.2 MCMC details for our DIF model

The estimation of the parameters of the model presented in this chapter is performed by using MCMC methods to obtain a sample from the joint posterior distribution of these parameters. The method used is Gibbs sampling with Metropolis – Hastings steps.

Basically, all the parameters that appear explicitly in the likelihood are drawn using a Metropolis – Hastings step with a suitable random walk as the proposal distribution since it is not possible to directly draw from their full conditional distributions. These parameters are the proficiencies, the item parameters and the DIF parameters. For all the other parameters, it is possible to directly draw from their full conditional distributions.

The parameters are simulated as follows:

  • Abilities.

    Samples from p(θ|β, d, λ, γ, T, Y, W, Z), are drawn from

    p( θ| θ j ,β,d,λ,γ,T,Y,W,Z )=p( θ j | β I( j ) , d I( j )g( j ) , λ g( j ) , Y j ) p( Y j | θ j , β I( j ) , d I( j )g( j ) , λ g( j ) ) p( θ j | β I( j ) , d I( j )g( j ) )p( θ j | λ g( j ) ) =p( Y j | θ j , β I( j ) , d I( j )g( j ) )p( θ j | λ g( j ) ) = iI( j ) p( Y ij | θ j , β i , d ig( j ) )p( θ j | λ g( j ) ), j=1,,J.

    The calculations above are obtained by assuming independence between the students’ abilities, and between the answers Yij when conditioned to the abilities and to the items’ parameters. It is not easy to directly draw from the distribution above because of its complexity. So, the Metropolis – Hastings algorithm is used. A normal transition kernel is used and the proposal for (p. 646) the new state is

    θ j l q( θ j | θ j l1 )=N( θ j l1 , c θ 2 ).

    The tuning parameter is set as cθθ = 0.1, chosen after a pilot study to assure an appropriate acceptance rate of the chain.

  • Parameters of the distributions of the groups’ abilities.

    It is possible to directly draw from the full conditional distributions of the means and variances of the abilities’ distribution since conjugate prior distributions were chosen.

    • Mean of the distributions of the groups’ abilities.

      If a prior distribution μ g ~N( μ 0g , σ 0g 2 ) is chosen, the following full conditional distribution is obtained:

      p( μg| )=p( μ g | θ J( g ) , σ g 2 )p( θ J( g ) | μ g , σ g 2 )p( μ g | σ g 2 ) = jJ( g ) p( θ j | μ g , σ g 2 )p( μ g )

      so:

      ( μ g | )N( m g , s g 2 ), where : m g = jJ( g ) θ j σ 0g 2 + μ 0g σ g 2 n g σ 0g 2 + σ g 2 and s g = σ g σ 0g n g σ 0g 2 + σ g 2 .

      J(g) represents the set of the students in group g and ng is the number of students in group g, for g = 2, … , G.

    • – Variance of the distributions of the groups’ abilities.

      p( σ g 2 | )=p( σ g 2 | θ J( g ) , μ g ) jJ( g ) p( θ j | μ g , σ g 2 ) p( σ g 2 ).

      If a prior distribution σ g 2 ~IG( α g , β g ) is chosen, the following full conditional distribution is obtained:

      ( σ g 2 | )IG( α g + n g 2 , jJ( g ) ( θ j μ g ) 2 +2 β g 2 ),g=2,,G.

  • (p. 647) Structural parameters β.

    Under the hypotheses of local independence of the items, samples of p(β|·) are drawn from:

    p( β i | θ J( i ) , d i , Y J( i ) )p( Y J( i ) | θ J( i ) , β i , d i )p( β i | θ J( i ) , d i ) = jJ( i ) p( Y ij | θ j , β i , d ig( j ) )p( a i )p( b i ) p( c i ), i=1,,I.

    The last equality comes from the prior independence between item parameters. The chosen prior distributions of the parameters are:

    a i ~LN( μ ai , σ a i 2 ); b i Ñ( μ b i , σ b i 2 ) and c i ~Beta( α c i , β c i ).

    In general, μ a i =0, σ a i 2 =2, μ b i =0, σ b i 2 =1, α c i =5, β c i =17 . These values are used, for example, as default values in the software Bilog-mg. In some cases, these values have to be modified to assure a better fit of current data or to add past information about the parameters.

    Once again, it is not possible to directly draw from the full conditional distribution and the Metropolis – Hastings algorithm is used assuming the following transition kernels:

    a i l ~LN( ln( a i l1 ) c a ), b i l Ñ( b i l1 , c b 2 )e c i l U[ c i l1 δ, c i l1 +δ ].

    In general, the values ca = 0.02, cb = 0.1, δ = 0.05 are used.

  • Structural DIF parameters.

    To draw samples from p(dh|·), h = a, b, samples are independently drawn from:

    p( d ig h | )=p( d ig h | d i,g h , d g h , Z h , θ J( i,g ) , β i , γ g ,T, Y J( i,g ) ,W ) p( Y J( i ) | θ J( i,g ) , β i , d ig h )p( d ig h | d i,g h , d g h ,W, γ g ,T, Z ig h ) = jJ( i,g ) p( Y ij | θ j , β i , d ig h ) p( d ig h | W i h , γ g h , τ g h , Z ig h ), i I dif h ,g=2,,G.

    In the last equality, it is assumed that T h = ( τ g h ) 2 I , where I is the identity matrix nidh × nidh, and nidh is the number of items for which DIF in parameter h is assumed, h = a, b. The conditional prior distribution of the DIF parameters is ( d ig h | W i h , γ g h , τ g h , Z ig h )~N( W i h γ g h , ( τ g h ) 2 )if Z ig h =1 , and the transition kernel used in the Metropolis – Hastings algorithm is the (p. 648) following:

    ( d ig h ) l+1 Ñ( ( d ig h ) l ,0.1 ),i.

    On the other hand, ( d ig h | W i h , γ g h , τ g h , Z ig h )~N( 0, s i 2 ( τ g h ) 2 )if Z ig h =0 . In practical situations, since si is very small, d ig h ~0if Z ig h =0 .

  • Parameters of the DIF regression structure.

    For the parameters γg, g = 2, … , G, samples are drawn from:

    p( γ g h | )=p( γ g h | d g h , T g h , W h , Z h )p( d g h | γ g h , W h , T g h , Z h )p( γ g h ).

    If a prior distribution γ g h ~N( γ 0 h , S 0 h ) is assumed, the following full conditional distribution is obtained:

    ( γ g h | d g h , T g h , W h , Z h )Ñ( H,L ), where:

    L= [ ( W I( Z ig h =1 ) h ) T ( T g h ) 1 W I( Z ig=1 h ) h + ( S 0 h ) 1 ] 1 and H=L[ ( W I( Z ig h =1 ) h ) T ( T g h ) 1 d I( Z ig h =1 ),g h + ( S 0 h ) 1 γ 0 h ].

    Samples of ( τ g h ) 2 are drawn from:

    p( ( τ g h ) 2 | )=p( ( τ g h ) 2 | d g h , γ g h , W I( Z ig h =1 ) h , Z h ) p( d g h | ( τ g h ) 2 , γ g h , W I( Z ig=1 h ) h , Z h )p( ( τ g h ) 2 ).

    If a prior distribution ( τ g h ) 2 ~IG( α τ g h , β τ g h ) is assumed, the following full conditional distribution is obtained:

    ( ( τ g h ) 2 | )IG( α τ g h + Z ig h 2 ,[ 1 2 ( d g h W I( Z ig h =1 ) h γ g ) T ( d g h W I( Z ig=1 h ) h γ g )+ β τ g h ] ),

    for g = 2,… , G.

  • DIF indicator variable.

    p( Z ig h | )=p( Z ig h | d ig h , T h , W i h , γ h )=p( d ig h | Z ig h , T h , W i h , γ h )p( Z ig h )

    (p. 649) If a prior distribution Z ig h ~Ber( π ig h ) is assumed, the full conditional distribution ( Z g h | )~Ber( ω ig h ) is obtained, where:

    ω ig h = 1 c ω π ig h exp( 1 2( τ g 2 ) ( d ig W i h γ g h ) 2 ),and: c ω = π ig h exp( 1 2( τ g 2 ) ( d ig W i h γ g h ) 2 )+( 1 π ig h )exp( 1 2( s i 2 τ g 2 ) ( d ig h ) 2 ), forg=2,,G.

A.3 Bayesian methods in item response theory

Since the late 1960s, knowledge about populations of students has been used in order to provide improved estimates for individual abilities. In the early works of Birnbaum (1969) and Owen (1975), Bayes estimates of ability parameters were obtained in item response models under the assumption that the item parameters are known. Novick et al. (1972) and Rubin (1980) proposed Bayesian and empirical Bayesian solutions, respectively, to predict the performance of students from a Law school using information from other Law schools. Also, the most used traditional Bayesian methods for ability estimation were proposed in the early 1980s: the Expected A-Posteriori (EAP; Bock and Mislevy 1982) and Modal A-Posteriori (MAP; Samejima 1980).

Bayesian methods have been used in IRT for the estimation of structural parameters since the early 1980s. Applying the hierarchical Bayesian approach suggested in Lindley and Smith (1972), Swaminathan and Gifford (1982), Swaminathan and Gifford (1985) and Swaminathan and Gifford (1986) proposed Bayesian procedures for simultaneous estimation of the item and ability parameters in the Rasch, two-parameter and three parameter models, respectively. The authors showed that the non-convergence of the estimates of the discrimination and guessing parameters, which is common in joint maximum likelihood methods, can be controlled by appropriate specification of prior distributions for all parameters. Mislev (1986) and Tsutakawa and Lin (1986) proposed to use the EM algorithm (Dempster et al. 1977) to obtain a modal marginal posterior distribution estimator for the item parameters, as it is proposed by Bock and Aitkin (1981) for maximum marginal likelihood estimation.

The importance of Bayesian approaches in item response theory have been steadily growing since the 1980s, with incorporation of new models. Such models try to include the effect of covariates or grouping in the estimation of latent abilities. Multidimensional structures and local dependence of the items have also been proposed. This way, as the complexity of the models increase, also does the difficulty on the estimation of the parameters of the models. (p. 650) Classical methods, like maximization of marginal likelihood, and even some Bayesian methods, like maximum posterior distribution, use the EM algorithm or some of its variations and, basically, treat the latent variables as ‘missing data’.

However, the use of the EM algorithm becomes harder as the complexity increases. For this reason, the use of methods based on stochastic simulation has grown since the 1990s, specially MCMC methods. In a pioneering study, Albert (1992) applied the Gibbs sampler to the two-parameter model using the normal distribution function as the link function. Béguin and Glas (2001) proposed a Gibbs sampler for the three-parameter and the multidimensional models. Fox and Glass (2001, 2003) studied an IRT model with a hierarchical regression structure on the ability parameters, with both latent and observed covariates. The latent covariates are jointly estimated with the parameters of the model. All of those works use normal distributions for the latent variables and estimate the abilities using augmented data techniches (cf. Tanner and Wong, 1987).

Patz and Junker (1999b) applied the Gibbs sampler to estimate parameters in a two-parameter model. Differently from the works cited above, that use augmented data, the authors use the Metropolis-within-Gibbs algorithm. Patz and Junker (1999a) extended this work for the three-parameter model, polytomic models and models with missing data.

DIF analysis is an appropriate environment for a genuine Bayesian formulation due to the complex structure of the models and the subjective decision features involved, which can be formulated through Bayesian arguments. For example, Zwick et al. (1999, 2000) and Zwick and Thayer (2002) considered a formulation where the MH D-DIF statistic is represented by a normal model where the mean is equal to the real DIF parameter. These authors use Empirical Bayes (EB) for the posterior estimation of the parameters. Sinharay et al. (2006) considered the same formulation and proposed informative prior distributions based on past information. They showed that the full Bayes (FB) method leads to improvements if compared to the two other approaches, specially in small samples. Wang et al. (2008) proposed a Bayesian approach to study DIF based on testlet response theory (cf. Wainer et al., 2007).

B. Countries that participated in PISA 2003

OECD countries:

Australia, Austria, Belgium, Canada, Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Italy, Japan, Korea, Luxembourg, Mexico, Netherlands, New Zealand, Norway, Poland, Portugal, Slovak Republic, Spain, Sweden, Switzerland, Turkey, United Kingdom, Scotland, United States.

(p. 651) Partner countries:

Brazil, Hong Kong-China, Indonesia, Latvia, Liechtenstein, Macao-China, Russian Federation, Serbia and Montenegro, Thailand, Tunisia, Uruguay.

Acknowledgments

The first author acknowledges grants from CNPq-Brazil and FAPERJ-Brazil.

References

Albert, J. H. (1992). Bayesian estimation of normal ogive item response models using Gibbs sampling. Journal of Educational Statistics, 17, 251–269.Find this resource:

    Béguin, A. A. and Glass, C. A. W. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66, 541–562.Find this resource:

      Birnbaum, A. (1968). Some latent traits models and their use in inferring examinee’s ability. In reissue of Statistical Theories of Mental Test Scores (ed. F. M. Lord and M. R. Novick). Addison-Wesley, Reading, MA.Find this resource:

        Birnbaum, A. (1969). Statistical theory for logistic mental test models with a prior distribution of ability. Journal of Mathematical Psychology, 6, 258–276.Find this resource:

          Bock, R. D. and Aitkin, M. (1981). Marginal maximum likelihood estimator of item parameters: an application of an EM algorithm. Psychometrika, 46, 443–459.Find this resource:

            Bock, R. D. and Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431–444.Find this resource:

              Cole, N. S. (1993). Differential Item Functioning, Chapter on History and development of DIF. Lawrence Erlbaum, Hillsdale, NJ.Find this resource:

                Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38.Find this resource:

                  Fox, J. P. and Glass, C. A. W. (2001). Bayesian estimation of a mutilevel model using Gibbs sampling. Psychometrika, 66, 271–288.Find this resource:

                    Fox, J. P. and Glass, C. A. W. (2003). Bayesian modeling of measurement error in predictor variables using IRT. Psychometrika, 68, 169–191.Find this resource:

                      Gamerman, D. and Lopes, H. F. (2006). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, (2nd edn). Chapman & Hall/CRC, New York.Find this resource:

                        George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of American Statistical Association, 85, 398–409.Find this resource:

                          Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109.Find this resource:

                            Lindley, D. V. and Smith, A. F. (1972). Bayesian estimates for the linear model (with discussion). Journal of the Royal Statistical Society B, 34, 1–41.Find this resource:

                              Lord, F. M. (1952). A theory of test scores. Psychometric Monograph, 7. Psychometric society.Find this resource:

                                Mislev, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195.Find this resource:

                                  Novick, M. R., Jackson, P. H., Thayer, D. T. and Cole, N. S. (1972). Estimating multiple regression in M-groups: a cross-validation study. British Journal of Mathematical and Statistical Psychology, 5, 33–50.Find this resource:

                                    (p. 652) Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive mental testing. Journal of the American Statistical Association, 70, 351–360.Find this resource:

                                      Patz, R. J. and Junker, B. W. (1999a). Applications and extensions of MCMC in IRT. Journal of Educational and Behavioral Statistics, 24, 342–366.Find this resource:

                                        Patz, R. J. and Junker, B. W. (1999b). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146–178.Find this resource:

                                          Rasch (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Institute for Educational Research.Find this resource:

                                            Rubin, D. B. (1980). Using empirical Bayes techniques in the law school validity studies. Journal of the American Statistical Association, 73, 801–827.Find this resource:

                                              Samejima, F. (1980). Is Bayesian estimation proper for estimating the individual’s ability? Research Report 80–3, University of Tennessee, Department of Psychology, Knoxville, TN.Find this resource:

                                                Sinharay, S., Dorans, N. J., Grant, M. C., Blew, E. O. and Knorr, C. M. (2006). Using past data to enhance small-sample DIF estimation: a Bayesian approach. Technical Report ETS RR-06-09, Educational Testing Service, Princeton, NJ.Find this resource:

                                                  Soares, T. M., Gonçalves, F. B. and Gamerman, D. (2009). An integrated Bayesian model for DIF analysis. Journal of Educational and Behavioral Statistics, 34, 348–377.Find this resource:

                                                    Stricker, L. J. and Emmerich, W. (1999). Possible determinants of differential item functioning: familiarity, interest and emotional reaction. Journal of Educational Measurement, 36, 347–366.Find this resource:

                                                      Swaminathan, H. and Gifford, J. A. (1982). Bayesian estimation in the rasch model. Journal of Educational Statistics, 7, 175–191.Find this resource:

                                                        Swaminathan, H. and Gifford, J. A. (1985). Bayesian estimation in the two parameter logistic model. Psychometrika, 50, 349–364.Find this resource:

                                                          Swaminathan, H. and Gifford, J. A. (1986). Bayesian estimation in the three parameter logistic model. Psychometrika, 51, 589–601.Find this resource:

                                                            Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distribution by data augmentation. Journal of the American Statistical Association, 82, 528–550.Find this resource:

                                                              Thissen, D. (2001). IRTLRDIF v.2.0.b: Software for the Computation of the Statistics Involved in Item Response Theory Likelihood-Ratio Tests for Differential Item Functioning. http://www.unc.edu/dthissen/dl.html: Scientific Software International.

                                                              Tsutakawa, R. K. and Lin, H. Y. (1986). Bayesian estimation of item response curves. Psychometrika, 51, 251–267.Find this resource:

                                                                Wainer, H., Bradlow, E. T. and Wang, X. (2007). Testlet Response Theory. Cambridge University Press, New York.Find this resource:

                                                                  Wang, X., Bradlow, E. T., Wainer, H. and Muller, E. S. (2008). A Bayesian method for studying DIF: a cautionary tale filled with surprises and delights. Journal of Educational and Behavioral Statistics, 33, 363–384.Find this resource:

                                                                    Zwick, R. and Thayer, D. T. (2002). Application of an empirical Bayes enhancement of mantel–haenszel DIF analysis to a computerized adaptive test. Applied Psychological Measurement, 26, 57–76.Find this resource:

                                                                      Zwick, R., Thayer, D. T. and Lewis, C. (1999). An empirical Bayes approach to Mantel–Haenszel DIF analysis. Journal of Educational Measurement, 36, 1–28.Find this resource:

                                                                        Zwick, R., Thayer, D. T. and Lewis, C. (2000). Using loss functions for DIF detection: an empirical Bayes approach. Journal of Educational and Behavioral Statistics, 25, 225–247.Find this resource:

                                                                          Notes:

                                                                          (1) OECD – Organisation for Economic Co-Operation and Development.

                                                                          (3) The information presented in this section is based on the PISA 2003 Technical Report. http://www.pisa.oecd.org/dataoecd/49/60/35188570.pdf.

                                                                          (4) IEA – International Association for Evaluation of Educational Achievement, TIMSS – Trends in International Mathematics and Science Study, OREALC/UNESCO – Oficina Regional de Educación para América Latina e Caribe.