feds · May 31, 2015

Achievement Gap Estimates and Deviations from Cardinal Comparability

Abstract

This paper assesses the sensitivity of standard empirical methods for measuring group differences in achievement to violations in the cardinal comparability of achievement test scores. The paper defines a distance measure over possible weighting functions (scalings) of test scores. It then constructs worst-case bounds for the bias in the estimated achievement gap (or achievement gap change) that could result from using the observed rather than the true test scale, given that the true and observed scales are no more than a fixed distance from each other. The worst-case weighting functions have simple, closed-form expressions consisting of achievement thresholds, flat regions in which test scores are uninformative, and regions in which the observed test scores are actually cardinally comparable. The paper next estimates these worst-case weighting functions for black/white and high-/low-income achievement gaps and gap changes using data from several commonly employed surveys. The results of this empirical exercise suggest that cross-sectional achievement gap estimates tend to be quite robust to scale misspecification. In contrast, achievement gap change estimates seem to be quite sensitive to the choice of test scale. Standard empirical methods may not robustly identify the sign of the trend in achievement inequality between students from different racial groups and income classes. Furthermore, ordinal methods may be more powerful and will continue to have the correct size when the test scale has been misspecified.

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C. Achievement Gap Estimates and Deviations from Cardinal Comparability Eric R. Nielsen 2015-040 Please cite this paper as: Nielsen, Eric R. (2015). “Achievement Gap Estimates and Deviations from Cardinal Comparability,” Finance and Economics Discussion Series 2015-040. Washington: Board of Governors of the Federal Reserve System, http://dx.doi.org/10.17016/FEDS.2015.040. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY ERICR.NIELSEN THEFEDERALRESERVEBOARD Abstract. This paper assesses the sensitivity of standard empirical methods for measuring group differencesinachievementtoviolationsinthecardinalcomparabilityofachievementtestscores. The paper defines a distance measure over possible weighting functions (scalings) of test scores. It then constructs worst-case bounds for the bias in the estimated achievement gap (or achievement gap change)thatcouldresultfromusingtheobservedratherthanthetruetestscale,giventhatthetrue and observed scales are no more than a fixed distance from each other. The worst-case weighting functions have simple, closed-form expressions consisting of achievement thresholds, flat regions in whichtestscoresareuninformative,andregionsinwhichtheobservedtestscoresareactuallycardinallycomparable. Thepapernextestimatestheseworst-caseweightingfunctionsforblack/whiteand high-/low-income achievement gaps and gap changes using data from several commonly employed surveys. Theresultsofthisempiricalexercisesuggestthatcross-sectionalachievementgapestimates tend to be quite robust to scale misspecification. In contrast, achievement gap change estimates seemtobequitesensitivetothechoiceoftestscale. Standardempiricalmethodsmaynotrobustly identifythesignofthetrendinachievementinequalitybetweenstudentsfromdifferentracialgroups andincomeclasses. Furthermore,ordinalmethodsmaybemorepowerfulandwillcontinuetohave thecorrectsizewhenthetestscalehasbeenmisspecified. JELCodes: C18,I24,I26 1. Introduction Researchers frequently use test-score data to assess group differences in achievement. The vast majorityofsuchinvestigationsassumethatsomeknownnormalizationofthetestscoresrendersthem cardinally comparable in the sense that a given score change has the same meaning throughout the range of possible scores. Furthermore, such investigations typically assume that a given test score has the same meaning across different surveys, ages, or time periods.1 However, neither of these assumptions are well motivated by either economic or psychometric theory. If either fails, standard estimates Date:May,122015. Preliminary and incomplete. Please do not cite or circulate without explicit permission of the author. Rick Ogden provided excellent research assistance for this project. The views and opinions expressed in this paper are solely those of the author and do not reflect those of the Board of Governors or the Federal Reserve System. Contact: Division of Research and Statistics, Board of Governors of the Federal Reserve System, Mail Stop 97, 20th and C Street NW, Washington,D.C.20551. eric.r.nielsen@frb.gov. (202)872-7591. 1ConsiderSATscores. IfSATscoresarecomparableovertime,astudentwhoearnsa600onthemathsectionin1980 shouldhavethesameachievementasastudentwhoearnsa600in2010. IftheSAThasacardinal(interval)scale,then a student who improves her math score from 400 to 500 has improved by the same amount as a student whose score increasedfrom600to700. 1

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 2 of achievement gaps and achievement-gap changes (“gaps/changes”) can be severely biased. Such estimates are no longer even guaranteed to correctly identify the sign of the achievement gap/change. Inaparallelworkingpaper, Ishowhowtomakeachievementcomparisonsbetweendifferentgroups ofstudentsusingonlytheordinalcontentofachievementtestscores.2 Ialsoshowthatfocusingonthe cardinal/ordinal distinction is not mere methodological pedantry; standard, cardinal methods suggest that the gap in achievement between youth from high- and low-income households widened in recent decades, whereas more-robust ordinal methods strongly suggest the opposite. Thenecessaryconditionsforordinalstatisticstounambiguouslyidentifyachievementgaps/changes are quite demanding. Two main conditions are needed, and each is likely to fail in many applied settings. First, it must be possible to place test scores on a common scale so that a given score corresponds to the same underlying level of achievement regardless of the year, cohort, or age group from which the score was drawn.3 Second, various first-order stochastic dominance conditions must hold between the relevant test-score distributions.4 Although these dominance conditions are satisfied in some instances, for many economically interesting achievement comparisons they are not met. Thestringencyofthenecessaryconditionsforvalidordinalinferencemeansthatmanyachievement comparisons are inherently ambiguous or scale dependent. There are many situations in which we really cannot determine with certainty how achievement inequality has changed, as much as we would like to and as strongly as standard cardinal methods suggest that we can. Should researchers then simply plead ignorance when ordinal estimates are inconclusive or infeasible? Therearegoodreasonstoresistsuchradicalagnosticism. Testscalesmaynotbeperfectlycardinal, yet they may still carry useful cardinal information. For example, suppose we are comparing three studentswithSATscoresof1000,1500,and1510. Itseemsplausiblethatthestudentwitha1500truly is closer to the 1510 student than she is to the 1000 student, even if the ratio of the score differences in the true cardinal scale is not exactly 1/50. Eschewing cardinality completely may be throwing awayalotofusefulinformationand, thus, unnecessarilydecreasingone’spowertodetectachievement differences. Ifsomeknowntestrescalingistrulycardinal,thencardinalstatisticaltestsappliedtothis scalewillhavegreaterpowertodetectachievement-gapchangesthanwillentirelyordinalapproaches.5 2For an up-to-date draft of that paper, please see the top link at https://sites.google.com/site/ericnielsenecon/ research. 3Manystandardizedtestsarerenormedeveryyear,violatingthecommon-scaleassumption. Iftherearecommonitems across the different tests, or if a group of students were randomly assigned to each different test, then it is possible to construct a common scale against which all test-takers from any survey can be coherently ordered. I abstract from thisprobleminthetheorysectionsofthispaper,andItakegreatcareputscoresin“equivalentunits” intheempirical sections. 4In particular, the “high” group score distribution must first-order dominate the “low” group score distribution within a given year/cohort for the sign of the cross-sectional achievement gap to be unambiguous. For an achievement gap change to be unambiguous, the high group in the earlier period must first-order dominate the high group in the later period,andthelowgroupinthelaterperiodmustfirst-orderdominatethelowgroupintheearlierperiod. 5Section7demonstratesthegreaterpowerofcardinalmethodsinthecasethattestscoresarenormallydistributedand cardinallycomparable.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 3 Intuitively, if a known test scale is “almost” cardinal, cardinal statistical tests may correctly identify the sign of an achievement gap/change in the limit and have greater power than ordinal tests in finite samples. Ofcourse,ifthetestscaleusedisactuallyveryfarfromthetruecardinalscale,thencardinal methods may misidentify achievement gaps/changes in the limit and will definitely have incorrect size and power in finite samples. In order to operationalize this intuitive tradeoff, it is necessary to formalize what it means for the true test scale and the observed scale to be “far” from each other. Therefore, I study the failure of cardinal comparability as a specification problem. In particular, I introduce a distance measure that allows me to quantify how far apart two candidate test scales are. Next, I suppose that nothing is known about the true cardinal test scale other than that it lies withinafixeddistanceoftheobservedtestscale. Ithensearchfortheunobservedtruescalesatisfying the hypothesized distance restriction that maximizes the difference between the observed and true achievementgaps/changes. Bystudyingtheworst-casebiasasafunctionofthehypothesizeddistance between the true and observed scales, I can test the sensitivity of standard methods to deviations in the cardinality of test scales. Onthetheoreticalside,Ideriveclosed-formexpressionsforthetestscalesthatmaximizepositiveand negative bias relative to the observed scale. Under fairly general conditions, these weighting functions depend only on the distance restriction imposed and a finite vector of statistics of the component test-score distributions being compared. The worst-case weighting functions are all piecewise-linear, with both flat regions (where changes in observed test scores are uninformative) and cardinal regions (where changes in observed test scores map linearly to changes in true achievement). Furthermore, the weighting functions often contain discontinuous jumps, or achievement thresholds, where a small change in the observed test score corresponds to a large change in true achievement. I estimate the worst-case weighting functions and resulting scale sensitivities for black/white and high/low-income achievement gaps/changes in the National Longitudinal Surveys of Youth (NLSY) and the National Education Longitudinal Surveys (NELS/ELS). The cross-sectional achievement gap estimates are quite robust in these data. It is often not possible to find a rescaling of the test scores thatflipsthesignofagivenestimateregardlessofthedistancerestriction. Inothercases,theminimum distanceneededfortheobservedscaletomisidentifythesignofthetruegapisverylarge. Forinstance, to flip the sign of the black/white reading achievement gap in the NLSY97, the weights placed on test scores by the true and observed scales must differ by at least 2 standard-deviation units somewhere on the range of observed scores. In contrast, gap-change estimates are typically much more sensitive to scale deviations. It is possible to pick a large enough distance restriction to flip the sign of the estimate for every gap-change estimate I examine. Furthermore, the size of the deviations required to

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 4 affect asign flip areoften quite small. For example, if thetrue and observed scale areallowedto differ by only 0.15 standard deviations somewhere on their support, the sign of the income-achievement gap change for reading may be misidentified in the NELS/ELS. My empirical results cast serious doubt on research that uses cardinal methods to measure time trends in achievement inequality. Some of the most well-studied achievement-gap changes estimated using very widely used data sources are not robust to minor changes in the test scale used. Since there are not good reasons to prefer the observed test scale to any other, estimates of changes in achievement inequality over time using this scale are not credible. Researchers assessing changes in achievement inequality over time should be much more circumspect in their deployment of standard cardinal methods or should eschew scale-dependent techniques entirely. This paper is not entirely negative, because I also develop a set of tools that allow a researcher to assess whether a particular achievement gap/change estimate is sensitive to the choice of scale. These tools are straightforward to apply and do not require more data than would be used in standard empirical gap/change calculations. With my toolkit in hand, empirical researchers can proceed using standard methods and simply check whether or not their particular conclusions are overly scale dependent before switching to less familiar and less powerful ordinal approaches. The rest of the paper is as follows. Section 2 reviews the relevant literature on achievement gaps, test score cardinality, and the relationship between stochastic dominance and social welfare. Section 3 lays out the notation, defines the necessary mathematical objects, and justifies the normalizations and simplifications I employ. Section 4 derives the worst-case weighting functions for a general class of achievement gap/change estimates. Section 5 outlines a number of empirically relevant extensions to the theoretical bounding analysis. Section 6 assesses the sensitivity of a number of achievement gap/change estimates to cardinal deviations using the NLSY and NELS/ELS data. Section 7 investigates the power and size of cardinal and ordinal tests in the presence of cardinal deviations. Section 8 concludes. Appendices A through D contain figures, point estimates, additional background, and technical discussion. 2. Literature Review The economics literature using cardinal methods to assess group differences in achievement is vast. Fryer and Levitt[7, 8], Clotfelter, Ladd, and Vigdor[5], Duncan and Magnuson[6], Hanushek and Rivkin[9], and Neal[15], among many others, use cardinal methods to assess changes in black/white achievement inequality in the United States.6 Reardon[19] employs cardinal methods to argue that 6Neal[15] does recognize, however, that “[a]chievement has not natural units,” and so he also analyzes the percentile rankingsofblackversuswhitetesttakers.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 5 the gap in achievement between high- and low-income youth has widened tremendously over the past several decades. Finally, research assessing school and teacher performance through value-added models(VAMs)andpapersestimatingtheproductivityofvariousinputssuchasclasssizeandteacher quality on student achievement also typically assume that test scores are cardinal measures.7 This paper is not the first in either economics or psychometrics to argue that normalized test scores are not cardinally comparable. In psychometrics, Stevens[23] and Lord[14] argue that most psychometrictestscoresareinherentlyordinal. Ineconomics,Lang[13],BondandLang[3],Cascioand Staiger[4], Reardon[18], andNielsen[16]alldiscussthesensitivityofstandardachievementgap/change estimates to order-preserving transformations of the test scores. The analysis in Bond and Lang[3] is particularly relevant to this paper. These authors search over a fairly general class of order-preserving transformations of test scores in order to find rescalings that maximize and minimize the apparent changeinblack/whiteachievementinequalitythroughthefirstseveralyearsofschool. Theirworst-case transformations typically consist of a set of achievement thresholds with mostly flat regions between sharp jumps. Interestingly, their functional forms are quite similar to those I derive theoretically in this paper. Ultimately,economistsandpolicymakersarenotinterestedinthetestscoresthemselves,butrather in the (social) value of the achievement represented by the test scores. This formulation yields an isomorphism between measuring achievement gaps and using social welfare functions to rank income distributions. In this context, ithas been shown that first-orderstochastic dominance (FOSD) is both necessary and sufficient for all increasing social welfare functions to agree on the ranking of two distributions, while all concave functions will rank second-order dominance (SOSD) identically.8 Aaberge, Havnes, and Mogstad[20] note that first- and second-order dominance often fail to hold in empirical applications ranking income distributions. In response, they derive economically interpretable preference functions that allow unambiguous ranking of distribution functions under dominance of any order. In principle, their approach could also be used to rank test-score distributions when FOSD and SOSD fail to hold. However, doing so would require imposing conditions on the social welfare functionthatarelessplausiblewhenappliedtotestscoresthanwhenappliedtoincome. Forexample, concavitycanbejustifiedforincomebyappealingtodiminishingmarginalutility. However, concavity may not make sense for test scores because the relationship between scores and life outcomes may be 7For example, Krueger[12] and Hoxby[11] both use test scores cardinally to estimate the effect of class size on student achievementgains. Value-addedmethodologiessuchasthoseexpoundedinRaudenbush[17]andelsewherealsosuppose that(normalized)testscoresarecardinallycomparable. 8Indeed,inthispaper,andinNielsen[16],Imakeextensiveuseofthisfact.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 6 quite convex. Even if the social welfare function is concave in life outcomes, it may not be concave in test scores.9 3. Formal Setting and Assumptions Suppose a population of students have test scores s distributed according to cumulative density function (cdf) F. Furthermore, suppose that the test scores are weakly ordinally perfect in the sense that true achievement a corresponding to test score s is given by a=ψ(s) for some weakly increasing function ψ.10 Let W (s) be the true value of the underlying achievement corresponding to test score s. W is the 0 0 composition of several conceptually distinct maps: the map ψ from test scores to true achievement, themapfromtrueachievementtoeconomicallyrelevantlifeoutcomes,andthemapfromlifeoutcomes to social welfare. Even assuming that the choice of the social welfare function is uncontroversial, the first of these maps is not knowable and the second is very difficult to estimate even with the richest data.11 Therefore, I will assume throughout this paper that W is inaccessible to the researcher. 0 The only a priori restriction I place on W is that it is weakly increasing in s: s>s(cid:48) = W (s) 0 0 ⇒ ≥ W (s(cid:48)) W(s) > W (s(cid:48)) = s > s(cid:48). Weak monotonicity is a natural assumption in this setting 0 0 ∧ ⇒ because higher test scores must correspond to weakly higher underlying achievement, and positive life outcomesshouldbecausallylinkedtohighertrueachievement. IdonotassumeW isstrictlymonotone 0 because I want to allow for the possibility that changes in test scores in some regions do not change overall welfare, either because the scores themselves are uninformative or because higher achievement does not always lead to better outcomes.12 Even if the map from test scores to achievement is strictly monotone, either or both of the maps from achievement to life outcomes or from life outcomes to social welfare may have flat regions. Weak monotonicity does not rule out the possibility that W is 0 constant everywhere. The worst-case W ’s may actually be constant when the true scale is allowed to 0 be very different than the observed scale. However, the worst-case weighting functions will be strictly increasing somewhere in all but the most extreme cases. Unless I explicitly specify otherwise, I will 9Forexample,consideratestofathleticabilityandsupposethatweareinterestedinlifetimelaborincome. Reasonable preferences on income will likely be concave, but the relationship between athletic ability and income may be highly convex. Theincreaseinincomeassociatedwithmovingfromthelevelofagoodcollegebasketballplayertothelevelof LebronJamesissolargethatitmaywellswampanyconcavityinsocialwelfare. 10This implies that for two students i and j with test scores si > sj, ai should be weakly greater than aj. Whether ψ isweaklyorstrictlymonotoneisnotcrucialfortheanalysis. Theadvantageofmaintainingonlyweakmonotonicity conceptually is that is allows test scores to be uninformative in some regions. Of course, ψ must be strictly increasing somewhereifthetestistobeusefulatallindifferentiatingstudentsbyachievement. 11Life outcomes such as longevity, health, total labor market earnings, marriage quality, and so forth are only fully revealed decades after most achievement test scores are recorded. Estimating even some of these outcomes with the bestlongitudinaldataavailableisamajoreconometricchallenge. Nielsen[16]carriesoutsuchacalculationforlifetime earningsintheNationalLongitudinalSurveysofYouth(NLSY)data. 12Real-worldinstitutionsoftentreattestscoresinsomerangesasbeinguninformative;forexample,graduateeconomics departmentstypicallydonotdistinguishbetweenGREscoresof165-170.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 7 treatgenericweightingfunctionsW intheremaininganalysisashavingatleasttwovaluess>s˜such 0 that W (s)>W (s˜). 0 0 Consider the problem of comparing two distinct test-score distributions F and F˜ given that W is 0 ´ unknown. The total value of F depends on W because V(W ,F)=E [W (s)]= W (s)dF(s). It is 0 0 F 0 0 straightforwardtoshowthatV(W ,F)>V(W ,F˜)foranyincreasingW ifandonlyifF F˜, where 0 0 0 (cid:31) denotes strict FOSD. If FOSD does not hold, there is no unambiguous way to compare F and F˜ (cid:31) in that there must exist distinct increasing functions W and W˜ such that V(W,F) > V(W,F˜) and V(W˜,F)<V(W˜,F˜). In contrast, misspecifications of W cannot lead to erroneous conclusions about 0 the sign of the achievement gap if F F˜, although the relative magnitudes of the true and observed (cid:31) achievement gaps may be very different. I make a number of technical assumptions and normalizations on the observed test-score distributions and true score weighting functions in order to simplify the analysis. These assumptions do not ruleoutanyeconomicallyinterestingcasesandpermitmuchcleanerstatementsandproofsofthemain results. Definition 3.1. F satisfies (A1) iff: (i) F , the space of univariate distributions with continuous densities everywhere on their ∈ F support. Let f denote the probability density function (pdf) associated with F. (ii) Support(F)=[0,1] Part (i) of definition 3.1 is convenient for technical reasons and does not rule out any interesting cases. Part (ii) is just a normalization and is also without loss of generality since test scores can always be rescaled to fit in [0,1] from whatever cardinal scale the researcher prefers.13 Definition 3.2. W satisfies (A2) iff: 0 (i) W is integrable with respect to any F satisfying (A1). 0 (ii) W is weakly increasing and right-continuous in s. 0 (iii) W (s) [0,1] for all s Support(F). 0 ∈ ∈ Part (i) of definition 3.2 is again a technical assumption and does not rule out any interesting cases. Theweaklyincreasingassumptioninpart(ii)wasjustifiedpreviously. Therequirementinpart(ii)that W beright-continuousisanothertechnicalassumptionthatguaranteesuniquenessofthe“worst-case” 0 weighting functions.14 Part (iii) normalizes W (s) to have the same support as F. This normalization 0 13Suppose a researcher has a candidate cardinal scale such that test scores follow distribution F˜ with Support(F˜) = (a,b)⊂(−∞,∞). Sinceaandbarefinite,anaffinetransformationwillrescaletestscoresto[0,1]whilepreservingthe purportedcardinalityofF˜. 14In particular, the worst-case W0’s will often have discontinuous jumps somewhere on Support(F). Right-continuity rulesouttheexistenceofmultipleW0’sthatdifferonlyonthese(measure-0)jumps.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 8 is without loss of generality because welfare is bounded and can only ever be identified up to affine transformations. One can change the units of W without changing anything in the analysis except 0 for the units of the distance restriction and the resulting biases. For the remainder of the paper, I will always suppose that (A1) and (A2) hold. Figure A.1 plots several possible W ’s when (A2) hold. The 0 figure shows that W may be convex, concave, linear, and discontinuous while still satisfying (A2). 0 In order to assess how sensitive a given achievement gap/change estimation method is to scale deviations, I must first define a distance measure on test scales. Given two candidate test scales, I define the distance between them using the sup norm. Definition 3.3. Let W and W˜ be test-score weighting functions on [0,1]. The distance between W and W˜ is D(W,W˜) sup W(x) W˜(x). ≡ | − | x∈[0,1] D is a well-defined distance function on the space of weakly increasing functions with domain and range on [0,1].15 Thesupnormgivesanintuitivewaytoassessthedegreetowhichtwoweightingfunctionsdisagree. If D(W,W˜) is very small, then at no point on [0,1] do W and W˜ differ by very much. In contrast, when k is large, there are regions where W and W˜ weigh scores very differently. Definition 3.3 is not the only way to formalize the notion of distance between weighting functions. For instance, one could ´ define (W,W˜) W(x) W˜(x)dx. Thisalternativedefinitionhastheadvantagethatitwillassess D ≡ | − | a large difference in the case that W and W˜ differ by a small amount everywhere on [0,1]. Using D instead of D substantially complicates the analysis and is therefore left for future work. Consider measuring the cross-sectional achievement gap between two groups of students as well as the changes in the cross-sectional gap over time. Labeling the groups A and B, and letting F and A,t F denotetheirtest-scoredistributionsinperiodt, thetruecross-sectionalachievementgapbetween B,t them is given by ˆ 1 ∆V(W ,A,B,t) V(W ,F ) V(W ,F )= W (s)[f (s) f (s)]ds. 0 0 A,t 0 B,t 0 A,t B,t ≡ − − 0 (cid:124) (cid:123)(cid:122) (cid:125) ≡∆ft(s) Similarly, the change in the achievement gap between A and B from t to t+1 is16 ˆ 1 ∆V(W ,A,B,t,t+1) ∆V(W ,A,B,t+1) ∆V(W ,A,B,t)= W (s)[∆f (s) ∆f (s)]ds. 0 0 0 0 t+1 t ≡ − − 0 (cid:124) (cid:123)(cid:122) (cid:125) ≡∆ft+1,t(s) 15Thatis,foranythreesuchfunctionsW,X,andY,thefollowinghold: (i)D(W,X)≥0,(ii)D(W,X)=0ifandonly W =X,(iii)D(W,X)=D(X,W),and(iv)D(X,W)≤D(X,Y)+D(Y,W). 16I will exclusively use language describing gap-changes over time. However, nothing in the analysis requires time to be the dimension along which change is assessed. For instance, one could replace “t” with “urban school district” and “t+1” with“suburbanschooldistrict,” andnothingaboutthemathematicswouldchange.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 9 In both of these cases, the object of interest consists of an integral from 0 to 1 of the function W (s)∆f(s),where∆f issomesumanddifferenceofdensityfunctionsacrosstherelevantcomparison 0 groups. The specific context matters only insofar as it alters ∆f. Therefore, I will characterize bias in ´ 1 expressions with the general form ∆V(W ,∆f) W (s)∆f(s)ds, while leaving the exact objective 0 ≡ 0 0 (cross-sectional or gap-change) in the background. Suppose that I(s) = s were used to calculate ∆V instead of W . The “pseudo-gap” as measured 0 ´ by I would then be given by ∆V(I,∆f)= 1 s∆f(s)ds. The bias created from using I instead of W 0 0 is just the difference between these two ∆V’s. There are two cases to consider, one that maximizes the degree to which the true difference is larger than the observed difference, one that maximizes the degree to which the observed difference overestimates the true difference. ˆ 1 (3.1) +(I,W ,∆f)= (W (s) s)∆f(s)ds 0 0 B − 0 ˆ 1 (3.2) −(I,W ,∆f)= (s W (s))∆f(s)ds. 0 0 B − 0 + will be large when ∆f(s) and (W (s) s) have the same sign, while − will be large when the 0 B − B opposite is true. The worst-case W ’s for a given k are just those weighting functions that maximize 0 + and − among all weighting functions that satisfy D(W,I) k. B B ≤ Definition 3.4. Suppose that all component test-score distributions in ∆f satisfy (A1). The worstcase W ’s satisfying (A2) and D(I,W) k for a given distance restriction k are then given by 0 ≤ W+(sk,∆f) max +(I,W,∆f) 0 | ≡ W∈W∧D(I,W)≤kB W−(sk,∆f) max −(I,W,∆f). 0 | ≡ W∈W∧D(I,W)≤kB Let ¯+(k) = +(I,W+(sk,∆f),∆f) and ¯−(k) = −(I,W−(sk,∆f),∆f) denote the values of the B B 0 | B B 0 | worst-case biases given k. Although W+ and W− both depend on k and ∆f, I will often omit these arguments for brevity 0 0 when their specific identities are not important. Unless certain symmetry conditions hold on ∆f, ¯+(k)= ¯−(k) for most values of k greater than 0. Both biases are 0 when k =1 as W+ and W− are B (cid:54) B 0 0 both identically equal to I in this case. Itisverydifficulttomakeprecisestatementsaboutthesebiasesifthevariouscomponenttest-score densities are unrestricted other than the conditions imposed by (A1). Therefore, I will consider a number of special cases that encompass many realistic empirical scenarios.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 10 Definition 3.5. ∆f satisfies (A3) iff all of its component densities satisfy (A1) and if !s∗ (0,1) ∃ ∈ such that ∆f(s∗)=0, ∆f(s)<0, s (0,s∗) and ∆f(s)>0, s (s∗,1). ∀ ∈ ∀ ∈ Assumption(A3)simplysaysthat∆f isnegativeforlowvaluesofs,positiveforhighvaluesofs,and crosses0onlyonceon(0,1).17 Although(A3)mightappeartobeverynarrow,itactuallyencompasses a number of empirically relevant cases. For example, suppose that ∆f(s) = f (s) f (s). If the A,t B,t − raw distributions of A and B are both unimodal and symmetric with similar variances and if A has a higher mean than B, ∆f will typically satisfy (A3) after normalization.18 Whenever F F , A B (cid:31) ∆f(s) = f (s) f (s) will satisfy (A3). The reverse implication ((A3) implying FOSD) does not A B − generally hold. However, even in cases where FOSD does not hold, achievement gap/change estimates will typically be quite robust under (A3). In many interesting applications, particularly those involving achievement gap changes, ∆f will cross 0 more than once on (0,1). Definition 3.6 extends definition 3.5 to allow for multiple crossing points. Definition 3.6. ∆f satisfies (A4) for N >1 if the following conditions hold: (i) s∗s∗,s∗,...,s∗ ,s∗ with s∗ 0 < s∗ < s∗ < ... < 1 s∗ such that ∆f(s∗) = 0 i ∃ 0 1 2 N N+1 0 ≡ 1 2 ≡ N+1 i ∀ ∈ 1,...,N . { } (ii) ∆f(s)=0 if s / 0,s∗,...,s∗ ,1 (cid:54) ∈{ 1 N } (iii) ∆f(s) < 0 for s (0,s∗) and sign[∆f(s)] = sign[∆f(s(cid:48))] whenever s (s∗ ,s∗) and s(cid:48) ∈ 1 − ∈ i−1 i ∈ (s∗,s∗ ), i 1,...,N . i i+1 ∈{ } Definition 3.6 says that there are exactly N points on (0,1) where ∆f is 0 and that at none of these pointsdoes d∆f(s) equal0. Furthermore,thedefinitionsaysthat∆f isnegativebeforethefirstinterior ds 0. This means that if N is odd ∆f(s) > 0 on (s∗ ,1) and if N is even ∆f(s) on this interval. Figure N A.2 in appendix A displays three ∆f’s consistent with (A3) and two ∆f’s consistent with (A4) for N =6. Assumption (A4) defines a very general class of functions. Since (iii) can always be guaranteed by choosing which distributions to label A and which to label B, the only substantive restrictions placed on ∆f by (A4) are that it only cross 0 a finite number of times, that there be no intervals with positivemeasureonwhich∆f is0,andthattherebeno0’satwhich d∆f(s) =0. (A4)willbesatisfied ds genericallyforvirtuallyanyfinitesumordifferenceofdensitiesfromanycommonlyuseddistributional families. 17Notethat(A3)onlyrestricts∆f(0)tobelessthanorequalto0and∆f(1)tobegreaterthanorequalto0. 18For example, if A ∼ N(µA,σ) and B ∼ N(µB,σ), then ∆f will satisfy (A3) once the normalizations in (A1) are imposed.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 11 4. Bounding Analysis Using the Sup Norm I now construct closed-form expressions for W+ and W− when either (A3) or (A4) hold. Under 0 0 either assumption, both W+ and W− have relatively simple functional forms for any value of k 0 0 ∈ [0,1]. Unfortunately, it will not generally be possible to find closed-form expressions for ¯+(k) and B ¯−(k). Nonetheless, knowing the forms of W+ and W− makes simulating ¯+(k) and ¯−(k) relatively B 0 0 B B straightforward. The worst-case weighting functions under (A4) nest the worst-case functions under (A3) as special cases. Eventhoughdoingsoistechnicallyredundant, Iwillpresentresultsfor(A3)separately, asW+ 0 and W− have particularly simple and intuitive interpretations in this case. Therefore, suppose first 0 that(A1)-(A3)hold. Theorem4.1belowshowsthattheonlyinfluencethat∆f hasonW+ isthrough 0 s∗. Theorem 4.1. If (A1)-(A3) hold, then W+ has the form19 0  max s k,0 , s [0,s∗) (4.1) W+(sk,s∗)= { − } ∈ 0 | min s+k,1 , s [s∗,1] { } ∈ Proof. In appendix C. (cid:3) Although equation (4.1) in theorem 4.1 is somewhat difficult to parse, the intuition behind it is quite simple. Recall that + is large when [W (s) s] and ∆f(s) have the same sign, implying that 0 B − + will be maximized when W+ is as far as possible below the 45 degree line for values of s less B 0 than s∗ and as far above the diagonal when s is greater than s∗. The farthest possible value below s consistent with D(I,W+) is just max s k,0 , which is the expression for W+ on [0,s∗), while the 0 { − } 0 farthest possible value above is min s+k,1 , which defines W+ on [s∗,1]. { } 0 In order to understand the definition in more detail, it is helpful to examine a number of cases determined by the size of s∗, 1 s∗, and k. If k max s∗,1 s∗ , then the D k restriction never − ≥ { − } ≤ binds and W+ is just a step function given by W+(sk,s∗) = 0 for s < s∗ and W+(sk,s∗) = 1 for 0 0 | 0 | s s∗. Ifk <min s∗,1 s∗ , theconstraintthatD k bindsonbothintervals[0,s∗)and[s∗,1]and ≥ { − } ≤ 19I will always include s∗ in the “upper half” of W+ or W−. This choice is arbitrary and unimportant since s∗ has 0 0 0 measure. Therefore, W+(s|k,s∗) below could just as well be defined by max{s−k,0} on [0,s∗] and min{s+k,1} on 0 (s∗,1].

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 12 W+(sk,s∗) becomes 0 |   0 s , k, s s ≤ ( k k,s∗) (4.2) W+(sk,s∗)= − ∈ 0 |  s 1, +k, s s ∈ [ 1 s∗,1 k − . k) ≥ − Figure A.3 in appendix A plots equation (4.2). The analysis for W− under (A1)-(A3) is substantially more involved than the analysis for W+. 0 0 The complicating factor is that − is large when [W (s) s] and ∆f have opposite signs. Therefore, 0 B − W− would “like” to be as far above the diagonal as possible on [0,s∗) and as far below the diagonal 0 as possible on [s∗,1]. But W− must be weakly increasing, so the larger W−(s∗) is, the smaller the 0 0 possible bias contribution is on [s∗,1]. W− must trade off these competing forces. 0 Equation (4.3) in theorem 4.2 defines W−(sk,s∗,s ), where s =W−(s∗). The functional form of 0 | c c 0 W− is straightforward to derive given s . W− must be as far above the diagonal as possible on [0,s∗) 0 c 0 consistent with both D(I,W−) k and W−(s∗) = s , while W− must be as far below the diagonal 0 ≤ 0 c 0 for values of s greater than s∗. Each potential choice of s trades off bias creation below and above c s∗differently. Since (A1)-(A3) imply that this tradeoff is a smooth function of s , there must be some c value of s that maximizes ¯−(k). c B Theorem 4.2. If (A1)-(A3) hold, then for some s [max s∗ k,0 ,min s∗+k,1 ], W− is given c ∈ { − } { } 0 by  min s+k,s c , s [0,s∗) (4.3) W−(sk,s∗,s )= { } ∈ 0 | c max s k,s c , s [s∗,1]. { − } ∈ Proof. In appendix C. (cid:3) Equation (4.3) is much easier to understand if one considers several special cases. For example, if s >k and s +k <1, then equation (4.3) simplifies to c c   s+k, s<s c − k (4.4) W 0 −(s | k,s∗,s c )=  s s c , k, s s ∈ > [ s s c c − k k . ,s c +k] − − Equation (4.3) in some sense defines the most general form for W−. The other possible forms are just 0 those cases where one or both of the kink points s k and s +k lie on the boundary of [0,1]. If k is c c −

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 13 largeenoughthats k 0ands +k 1,W− willsimplyequals everywhereon[0,1]. Ifs k >0 c − ≤ c ≥ 0 c c − and s +k 1 then only the lower kink point is still present. Similarly, if s k 0 and s +k < 1 c c c ≥ − ≤ then only the upper kink point is present. 20 Figure A.4 illustrates these possibilities by plotting W− 0 for three different values of s . c Theorem 4.2 does not fully characterize W− because it does not pin down s . Since s and k 0 c c jointly determine the form of W−, for a fixed k, s indexes all of the possible W−’s consistent with 0 c 0 D(I,W−1) k. Each candidate s yields a different negative bias −(I,W−((cid:5)s ),∆f) and the worst- 0 ≤ c B 0 | c cases isjustthepointin[s∗ k,s∗+k]thatmaximizes −(I,W−((cid:5)s ),∆f). Inpractice, calculating c − B 0 | c this worst-case s explicitly is very tedious and fairly uninformative.21 One exception is the special c case that ∆f(s∗ x)= ∆f(s∗+x), x [0,1], which implies s∗ =s =0.5. − − ∀ ∈ 2 c BothW+ andW− under(A1)-(A3)haveanintuitiveinterpretationforcross-sectionalachievement 0 0 gaps in the case that F F . FOSD implies that any weighting scheme will measure a positive A B (cid:31) achievement gap between A and B. The maximum possible true gap between A and B is given by ∆V(W+(sk,s∗),∆f). Since the scores in A dominate those in B, type-B students have relatively 0 | greater density among scores close to 0 and relatively lower density among scores close to 1. The true gapbetweenAandB willthereforebeverylargeifscorescloseto0aregivenaslittleweightaspossible while scores close to 1 are weighted quite heavily, which is exactly what W+ does. Symmetrically, 0 the true gap between them will be as small as possible exactly when low scores are given as much as weight as possible relative to high scores, which, again, is just what W− does. 0 The robustness of a cardinal gap/change estimate to deviations in scale depends on how rapidly the associated biases + and − increase as k increases. If these biases increase rapidly with k, then B B relatively small cardinal deviations may be sufficient to flip the sign of the gap/change estimate. In contrast, if they increase slowly, such a reversal will only be possible when k is quite large. In general, it is not possible to derive closed-form expressions for ∂B+ and ∂B− because these derivatives depend ∂k ∂k on the particular shape of ∆f. Nonetheless, in the case that ∆f satisfies (A3) or (A4), it is still possibletogainsomeintuitionaboutwhatfeaturesof∆f determinehowquicklythepositive-sideand negative-side biases increase with increases in k. I only present the analysis for the case that (A3) holds; the results are qualitatively similar under (A4), but the exposition is messier and less intuitive. 20Thatis,ifsc−k>0andsc+k≥1,equation(4.3)becomes (cid:40) W−(s)= s+k, s<sc−k 0 sc, s∈[sc−k,1]. Ifsc−k≤0andsc+k<1thenequation(4.3)simplifiesto (cid:40) W−(s)= sc, s<sc+k 0 s−k, s≥sc+k. ´ 21Thedifficultyisthattheintegral 0 1(s−W 0 −(s|sc))∆f(s)dsdoesnotgenerallyhaveaclosedform.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 14 Theorem 4.3. If (A1)-(A3) hold and k is sufficiently close to 0, then ˆ ∂ + 1−k B = ∆f(s)ds ∂k | | k ˆ ˆ ˆ ∂ − 1 sc−k sc+k ∂s B = ∆f(s)ds ∆f(s)ds c ∆f(s)ds. ∂k − − ∂k sc+k 0 sc−k Proof. In appendix C. (cid:3) Theorem 4.3 characterizes ∂B+ and ∂B− for values of k relatively close to 0.22 The theorem shows ∂k ∂k that ∂B+ depends on the total area (both positive and negative) between ∆f and 0 on the interval ∂k [k,1 k]. If ∆f is mostly far away from 0 in this central subinterval, then the positive-side bias will − increase rapidly with k. Furthermore, ∂B+ is monotonically decreasing in k and approaches 0 from ∂k above as k approaches 0.5. The expression for ∂B− is somewhat harder to interpret because s is ∂k c only defined implicitly. For simplicity, suppose that ∆f satisfies ∆f(0.5 x) = ∆f(0.5+x) for − − any x [0,0.5]. It is immediate in this case that s is equal to 0.5 for all values of k, which implies c ∈ that ∂B− depends only on the total area (positive and negative) between 0 and ∆f on the intervals ∂k [0,0.5 k] and [0.5+k,1], that is, on the “tails” of [0,1]. − will generally be more sensitive than − B + to the properties of ∆f near the endpoints of [0,1], even in the typical case that s depends on c B ´ k. Theorem 4.3 also implies that ∂B+ = ∂B− = 1 ∆f(s)ds. For values of k very close to ∂k | k=0 ∂k | k=0 0 | | 0, + and − increase mostly symmetrically with k. As k grows larger, the relevant subintervals of B B [0,1] contributing the most to + and − become more and more different. This divergence, coupled B B withpossibleincreasesordecreasesins ask growslarger, meansthat ∂B+ and ∂B− willnotbeequal c ∂k ∂k generically when k is greater than 0. I now relax the single-crossing assumption in favor of (A4). This modification substantially complicatesthedeterminationofW+ andW−,althoughclosed-formexpressionsstillexistforbothweighting 0 0 functions. The source of the complication is the tension between setting W+ or W− as low (or high) 0 0 as possible over an interval [s∗,s∗ ] and setting it as high (or low) as possible on [s∗ ,s∗ ]. For i i+1 i+1 i+2 example, consider W+ in the case that N =2. Since + is large when [W−(s) s] and ∆f have the 0 B 0 − same sign, the contribution to + on [s∗,s∗] is maximized when W+(s)=s+k. However, W+(s) on B 1 2 0 0 [s∗,1]cannotbelessthanW+(s∗),butbiasinthisregionismadelargerthemorenegativeW−1(s) s 2 0 2 0 − is. Therefore, maximizing the bias contribution on [s∗,s∗] minimizes the bias contribution on [s∗,1]. 1 2 2 FindingW+ requiresthatonebalancethesecompetingforces,andthestrengthoftheseforcesdepends 0 solely on the particular shape of ∆f and the value of k. 22Inparticular,theexpressionfor ∂B+ assumesthatk<min{s∗,1−s∗},whiletheexpressionfor ∂B− supposesthat ∂k ∂k k<min{sc,1−sc}.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 15 The functional forms of W+ and W− under assumption (A4) also depend on whether N is even or 0 0 odd. As with W− under (A3), both W+ and W− are parametrized by the values they take at the 0 0 0 various ∆f crossing points. In particular, W+ is parametrized by its values at even-indexed crossing 0 points (s∗ such that i is even), while W− depends on its values at the odd-indexed crossing points. i 0 Theorem 4.4 below characterizes W+ when (A4) holds for an arbitrary N. Figure A.5 plots potential 0 worst-case weighting functions W+ for the cases N =2 and N =3. 0 Theorem 4.4. If (A1), (A2), and (A4) hold for N N, then there exists a non-decreasing sequence ∈ 0 s+ s+ ... 1 such that W+(s∗ k) = s+ [max s∗ k,0 ,min s∗+k,1 ] for even i N ≤ 2 ≤ 4 ≤ ≤ 0 i| i ∈ { i − } { i } ≤ and such that  (4.5) W+(sk)=  m m m a i a n x x { { { s s s + − − k k k , , , s s 0 + 2 + 2 } } , } , , s s s ≤ ∈ ∈ ( ( s s s ∗ 1 ∗ 1 ∗ 2 , , s s ∗ 2 ∗ 3 ] ] 0 |  . . . m m a in x { { s s + − k k , , 1 s } + N , } , s s ∈ ∈ ( ( s s ∗ N ∗ N , , 1 1 ] ] ∧ ∧ N N e o v d e d n . Proof. In appendix C. (cid:3) Theorem4.5belowcharacterizesW− foranarbitraryN. FigureA.6plotspotentialworst-caseweight- 0 ing functions W− when N =2 or N =3. 0 Theorem 4.5. If (A1), (A2), and (A4) hold for N N, then there exists a non-decreasing sequence ∈ 0 s− s− 1 such that W−(s∗ k)=s− [max s∗ k,0 ,min s∗+k,1 ] for odd i N and ≤ 1 ≤ 3 ≤···≤ 0 i| i ∈ { i − } { i } ≤ such that  (4.6) W−(sk)=  m m m i a i n n x { { { s s s + + − k k k , , , s s s − 1 − 3 − 1 } } } , , , s s s ≤ ∈ ∈ ( ( s s s ∗ 1 ∗ 1 ∗ 2 , , s s ∗ 2 ∗ 3 ] ] 0 |  . . . m m i a n x { { s s + − k k , , 1 s } − N , } , s s ∈ ∈ ( ( s s ∗ N ∗ N , , 1 1 ] ] ∧ ∧ N N e o v d e d n .

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 16 Proof. In appendix C. (cid:3) Theorems4.1through4.5showthatbiasismaximizedwhenthetrueweightingfunctionhasachievement thresholds, flat regions, cardinal regions, and kinks. Both W+ and W− generically consist of 0 0 regions where increases in scores are not valuable, regions where the true value increases 1-1 with observed test scores, and discontinuous achievement thresholds where the true value jumps up by a largeamount. Althoughtheseworst-caseW ’smaylookextremecomparedwithmosttestscales,they 0 are not economically implausible. For example, consider a test score equal to the share of the Russian Cyrillicalphabetthatastudentknows. Thistestscaleisintervalinthesensethateachscoreincrement of 1 corresponds to a new, identifiable skill: knowing a letter of the alphabet. However, a plausible 33 economic weighting should be mostly flat for scores between 0 and 32 and display a sizable jump up 33 between 32 and 1 because knowing the whole alphabet is a prerequisite for reading and writing in 33 the Russian language. Similarly, a job may require a constellation of skills such that the productivity of a worker lacking any one of the skills is 0 while the productivity of a worker possessing all of the requisite skills is quite high. Finally, selective institutions may employ admissions thresholds, again creating discontinuities and kinks in the economically-relevant score weighting function. 5. Extensions The approach presented in section 4 is substantially more general than it might at first appear. In particular, similar methods can be applied to bound both the bias in regressions using test scores as outcome variables and the bias in mean difference calculations when achievement has multiple dimensions. A complete, formal analysis of these extensions is beyond the scope of the present paper. In this section, I sketch out results for two special cases. First, I demonstrate that theorems 4.4 and 4.6canbestraightforwardlyappliedtoboundthebiasinregressioncoefficientsoftestscoresonbinary predictor variables. Second, I show that these same theorems can be used to bound mean differences when there are multiple dimensions of achievement that enter W additively separably.23 0 Consider the ordinary least squares (OLS) regression of s on some binary indicator D. The goal is to characterize the worst-case bias in the resulting regression coefficient on D due to misspecification in the scale of s. The probability limit (plim) of the OLS estimator in this baseline regression is β(I) = E[sD = 1] E[sD = 0]. If instead we had regressed D on W (s), the plim of the resulting 0 | − | regression coefficient would be β(W ) = E[W (s)D = 1] E[W (s)D = 0]. The difference in these 0 0 0 | − | 23The techniques from section 4 can also be adapted to study regressions of test scores on continuously distributed covariates. Arigorousanalysisofthisextensionisthesubjectofongoingworkthatshouldappearasaseparateworking paper in the coming months. In contrast, analyzing multiple dimensions of achievement when W0 is not additively separablepresentssubstantialtechnicalproblemsandisanareaofactiveresearch.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 17 plims is ∆β (E[W (s)D =1] E[W (s)D =0]) (E[sD =1] E[sD =0]). 0 0 ≡ | − | − | − | Let f denote the pdf of s conditional on D = 0, and f the pdf conditional on D = 1. ∆β can then 0 1 ´ 1 be written as ∆β = (W (s) s)[f (s) f (s)]ds. This is exactly the same objective function that 0 0 − 1 − 0 yields W+(sk) and W−(sk) as worst-case weights under the restriction that D(W,I) k assuming 0 | 0 | ≤ that ∆f f (s) f (s) satisfies either (A3) or (A4).24 1 0 ≡ − Theassumptionmaintainedthusfarthatachievementhasonlyonedimensionisunrealistic: alarge and growing body of research suggests that there are multiple types of achievement relevant for labor market outcomes.25 The mean-difference bounding analysis can be easily extended to the special case that there are multiple types of achievement that enter welfare additively separably. In particular, suppose that achievement has two dimensions with corresponding ordinally perfect test scores x and y.26 Let W (x,y) denote the true cardinal value of the test-score pair (x,y), and suppose that this 0 function is known to be additively separable: W(x,y) = H(x)+G(y) for two increasing functions H and G. Denote by F, F , and F the joint and marginal distributions of x and y, respectively. x y Additive separability in W implies that V(W,F) can be decomposed into the sum V(H,F ) + x V(G,F ).27 In turn, this implies that F will be preferred to F for all increasing functions G and H y A B only if F F and F F both hold. The dependence between x and y does not matter A,x B,x A,y B,y (cid:23) (cid:23) here; all joint distributions F with equal marginals will be ranked equally by any additively separable W. AdditiveseparabilityinW doesnotimplythattheboundinganalysiscanbecarriedoutseparately for each dimension of achievement. There are two subtleties that prevent one from considering each margin separately in constructing worst-case bounds. The first subtlety is that using the sup norm to operationalize the distance restriction between W and I links the two dimensions of achievement because the magnitude and sign of the difference 0 in one dimension determines the range of feasible differences along the other dimension.28 A minor tweak to the definition of D stating that the sup norm distance restriction must hold separately in each dimension is sufficient to remove this dependence. Formally, define the new distance measure as follows: 24The assumption that ∆f satisfies (A3) or (A4) in this context is again quite general and will be satisfied in many economicallyrelevantsettings. D canalwaysbedefinedsuchthat∆f,andnot−∆f,satisfies(A3)or(A4). 25Kautz,Heckman,etal.[24]providesagoodintroductiontoandoverviewofthisliterature. 26Inempiricalwork,researcherstypicallyassumethatthesedimensionsarelatentfactorsandthatobservedtestscores depend on some combination of the underlying factors. I abstract from these issues here, and simply suppose that we cancrafttestswhichordinallymeasureach˜ievementalongeachrel˜evantdimension. ˜ 27To see this, note that V(W,F) = 1H(x)f(x,y)dydx + 1G(y)f(x,y)dydx. But 1H(x)f(x,y)dydx = ´ ˜ 0 ´ 0 0 0 1H(x)fx(x)dx=V(H,Fx)and 0 1G(y)f(x,y)dydx= 0 1G(y)fy(y)dy=V(G,Fy). 28To see this, consider the restriction D(W0,W)≤k and suppose that sup x [H0(x)−H(x)]=λk for some λ∈(0,1). Thenthemaximumpossiblevalueofsup y [G0(y)−G(y)]is(1−λ)k,whiletheminimumpossiblevalueis−(1+λ)k.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 18 Definition 5.1. Suppose that W(x,y) = H(x)+G(y) and W˜(x,y) = H˜(x)+G˜(y). The pairwise distance between W and W˜ is defined as (cid:40) (cid:41) D (W,W˜)=max sup H(x) H˜(x), sup G(y) G˜(y) . p | − | | − | x∈[0,1] y∈[0,1] It is straightforward to verify that D is a valid distance measure. Under D , the possible values of p p G˜(y) consist of the entire interval [G(y) k,G(y)+k] for any functions H˜ and H. − The second subtlety is that it may not be possible to define A and B such that ∆f and ∆f x y simultaneously satisfy (A3) or (A4). For example, if ∆f and ∆f both satisfy (A4), then no x y − reshuffling of labels can bring both ∆f’s into alignment. Since A and B may be interchanged freely, there are only two distinct situations to consider: ∆f and ∆f both satisfy (A4) or only one of x y them does. These cases can be handled by noting that in the single-dimensional case +(sk,∆f)= W | −(sk, ∆f) and −(sk,∆f)= +(sk, ∆f) always hold. W | − W | W | − Theorem 5.2. Suppose that (A1) and (A2) hold and that D is used as the measure of distance p between weighting functions. If ∆f and ∆f both satisfy (A4) for N and N , +and +are given x y x y Wx Wy by equation 4.5 while − and − are given by equation 4.6. If instead ∆f and ∆f satisfy (A4), Wx Wy x − y then the worst-case weights for x are unchanged. In contrast + is given by equation 4.6 and − is Wy Wy given by equation 4.5. Theorems5.2,4.4,and4.5giveageneralmethodforconstructingworst-caseweightingfunctionsinthe two dimensional case. This analysis can be generalized easily to more than two dimensions, provided that W is additively separable in all dimensions. 0 6. Empirical Sensitivity Analysis This section assesses the sensitivity to cardinal scale misspecifications of standard achievement gap/change estimates derived from several commonly used data sets. My basic approach is to use empirical test-score distributions to estimate the ∆f associated with some achievement gap/change of interest. Given an estimate for ∆f, I then numerically approximate ¯+(k) and ¯−(k) for various B B values of k. The headline conclusion from this exercise is that cross-sectional gaps are often quite robust to cardinal deviations, whereas gap changes are typically much less robust. The values of k thatareneeded toflipthesignofmostcross-sectionalestimatesarequitelarge(ornon existentinthe commonly occurring case that FOSD holds), while the values of k that are needed to flip the sign of many gap change estimates are much smaller. 6.1. Data and Method. I employ four commonly-used surveys in this paper: the NLSY 1979 and 1997, the NELS 1988, and the ELS 2002. The two NLSY surveys were designed to be nationally

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 19 representativeanddirectlycomparabletoeachother, asweretheNELSandtheELS.Allfoursurveys have comparable demographic, income, and achievement data that allow me to estimate both income and racial achievement gaps/changes. Please refer to appendix D for a more detailed discussion of these data. I always restrict my analysis to students who were between the ages of 15 and 17 at the time of testing. I make this restriction for two reasons. First, students in this age range are relatively close to completing school, so their test scores should provide a summary of the cumulative effects of endowments and investments over time by parents, schools, and the students themselves. Second, estimates using a narrow range of student ages are not sensitive to how test scores are adjusted for student age. I do not age adjust the test scores in my baseline specifications. However, using age-adjusted scores yields similar conclusions about the sensitivity of achievement gaps to scale misspecification. Because of the timing of the surveys, I use the first follow-up survey from the NELS, collected in 1990. I use base-year data for the remaining three surveys. Valid gap change estimates require that test scores have a constant interpretation over time.29 Fortunately, it is possible to scale achievement scores in these surveys such that students from the NELS can be ranked consistently against students from the ELS and students in the NLSY79 can be ranked consistently against students in the NLSY97. Although the exact psychometric details differ somewhat between the pairs of surveys, the basic feature that allows such a scaling is the existence of a group of test takers who answered test questions appearing on both of the relevant achievement tests. Eachpairofsurveyscollectconsistentlydefinedandcomparablestudentdemographicandhousehold income variables. The demographic comparisons I make are by race, sex, and household income. The only subtleties involve the use of income. For the NLSY surveys, I use a comprehensive measure of household income that sums income for all household members from all sources. I use this continuous variabletodefinehigh-incomeyouthasthoserespondentswithhouseholdincomeinthetop20%ofthe year-specific household income distribution and low-income youth as those in the bottom 20%. The NELS and ELS surveys only record income categorically, so I define “high-income” and “low-income” to be the sets of categories that most closely approximate the upper and lower quintiles. The ELS employs imputation to fill in missing values of income and other demographics. I drop the imputed values, and I also drop missing observations and invalid responses for all variables in all four surveys. At present, my analysis does not adjust for selection into the final sample.30 29In many data sets, test scores are renormed each year, invalidating this assumption. Simply normalizing scores to haveameanof0andastandarddeviationof1withineachyear/agegroupisnotlikelytobeanadequateresponse. 30In Nielsen[16] and follow-up work using the NELS/ELS, I find that neither ordinal nor cardinal income-achievement gap/change estimates are sensitive to these choices. This does not automatically imply, however, that the estimated

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 20 I approximate ∆f, W+, W−, ∆V(W+), and ∆V(W−) numerically. I estimate the various ∆f’s 0 0 0 0 by first estimating each component density on a grid using a smoothed kernel estimator. I then renormalize the densities so that each has support on [0,1] and estimate ∆f as the sum or difference in these normalized distributions. Importantly, I use the same normalization for all of the component densities in ∆f, which guarantees that the normalized scores will still correctly order students from different surveys by their underlying achievement. W+ and W− are parametrized by their values at 0 0 thezerosof∆f. Therefore,Isearchoveragridofallpossiblevaluesofthesecrossingpointsandselect the configuration that maximizes bias given k. The results are not very sensitive to the fineness of the grid I employ. 6.2. Black/White Achievement Inequality. The ∆f functions relevant for assessing black/white achievement inequality all satisfy either (A3) or (A4) for N = 2 or N = 3. Both ∆f and ∆f 1990 2002 satisfy(A3)intheNELS/ELSdata;whiteachievementismuchhigherthanblackachievementinboth surveys. Furthermore, ∆f satisfies (A4) for N = 3 for both math and reading.31 All of the t+1,t − cross-sectional ∆f ’s again satisfy (A3) in the NLSY data, while the gap-change ∆f ’s satisfy (A4) t t+1,t for either N =3 (math) or N =2 (reading). Figures A.7-A.8 plot these ∆f functions. Figure A.9 plots ∆V(I,∆f), ∆V(W−,∆f), and ∆V(W+,∆f) as functions of k for both math 0 0 and reading achievement in the NELS/ELS data. The qualitative results are the same for both achievement measures, so I will discuss only the math estimates. The observed math gap in the ELS is somewhat larger than the observed gap in the NEL90. Standard methods would therefore conclude that achievement inequality increased between the two surveys.32 As k grows larger, ∆V(W+,∆f) 0 and∆V(W−,∆f)divergefromtheobservedcross-sectionalgapsineachsurvey. ∆V(W−,∆f)crosses 0 0 0 and turns negative for at k 0.34 in the NELS; the observed black/white achievement gap in the ≈ NELS may not even correctly identify the sign of the true gap. In contrast, ∆V(W−,∆f) never 0 crosses 0 in the ELS data; misspecified test scales will never misidentify the sign of the black/white achievementgapinthissurvey. Theobservedblack/whiteachievementgapchangebetweentheNELS and ELS is slightly greater than 0. As before, both ∆V(W+,∆f) and ∆V(W−,∆f) fan out from 0 0 sensitivity to cardinal deviations will be similarly unaffected. I will check the robustness of my results to these data choicesinfuturework. 31(A4) with N = 3 only holds for math achievement after the difference in the kernel-smoothed density estimates is smoothed one more time. For low values of s, the “raw” density difference bounces around close to 0, barely crossing 0 a number of times. Technically, then, I should compute bias in this case using (A4) and N =5. I smooth a second time because removing these wiggles results in substantial improvements in computational speed and code simplicity. Furthermore,sincetheinitialsmootheddensityestimatesareonlyapproximations,andsinceregionswhere∆f isclose to 0 cannot contribute much to total bias, the conclusions derived using the twice-smoothed data should be almost identicaltothoseusingtheunsmootheddensitydifference. 32The thought experiment here is that these observed gaps and ∆f estimates are the population values as the group samplesizestendtoinfinity.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 21 the observed gap as k increases. ∆V(W−,∆f) crosses 0 at k 0.29. The change in the black/white 0 ≈ achievement gap is relatively robust to cardinal deviations in these data. Figure A.10 plots ∆V(I,∆f), ∆V(W−,∆f), and ∆V(W+,∆f) as functions of k for the NLSY 0 0 data. The cross-sectional achievement gap estimates are somewhat less sensitive to k than the gaps in the NELS/ELS data. The sign of the math gap will never be misidentified in either survey. For k > 0.39, the reading gaps using W− turn negative, but they remain very close to 0. With slightly 0 different smoothing settings on the kernel estimation, these asymptotes also remain above 0.33 In contrast to the NELS/ELS data, the observed mean difference in scores suggests that black/white inequality decreased moderately between these two surveys. However, these gap change estimates are much more sensitive to changes in k. ∆V(W+) crosses 0 and becomes positive at k 0.1. 0 ≈ 6.3. High-/Low-Income Achievement Gaps/Changes. I repeat the sensitivity analysis in the NELS/ELSandNLSYforachievementgaps/changesbetweenyouthfromhigh-andlow-incomehouseholds. Generally, the cardinal sensitivity is more pronounced for income-achievement gaps/changes than for black/white estimates. Figure A.12 shows that the cross-sectional ∆f’s for math and reading in the NELS/ELS data satisfy (A3), while the gap-change ∆f’s satisfy (A4) for N = 3 (math) or N =2 (reading). Figure A.13 plots the cross-sectional and gap-change ∆V’s for different values of k. The observed cross-sectional gaps are positive and quite large.34 For math achievement, the observed gap in the NELS is slightly larger than the observed gap in the ELS, while for reading achievement the situation is reversed. In neither survey does ∆V(W−) for math ever drop below 0. For reading 0 achievement, ∆V(W−) barely dips below 0 for k >0.4 in the NELS and never crosses 0 in the ELS. 0 In contrast, the income-achievement gap change estimates are not at all robust. The observed gap changes are fairly close to 0, so that relatively small values of k are sufficient to flip the sign of the observed versus the true gap change. In the NELS/ELS data, ∆V(W+) for math goes from 0 negative to positive at k 0.1, while ∆V(W−) for reading flips from positive to negative at k 0.04. ≈ 0 ≈ Cardinal methods applied to almost any test scale would correctly identify a large positive income achievement gap in any cross section, but cardinal methods applied to misspecified scales could quite easily misidentify the sign of the gap change in the NELS/ELS data. 6.4. WhatifZ-ScoresAreUsed? Thecalculationsinsection6.3deviatefrommostoftheliterature on achievement inequality in that they do not use cohort/year/age z-scores to estimate achievement differences. Instead, theyuseequivalentscoresthatenableonetorankstudentsfromdifferentsurveys against each other consistently. There are strong reasons to prefer equivalent scores, and there is no 33Iplantodevelopvalidinferentialproceduresforthissettinginfuturework. 34Inmanydatasetscoveringrecentdecades,thetopvs. bottomquintileachievementgapmeasuredinstandard-deviation unitsisroughlyequaltotheblack/whiteachievementgap.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 22 reasontothinkthatz-scoregap/changeestimateswillbemorerobusttocardinaldeviations. Indeed,I demonstrate in this section that estimates computed using z-scores are similarly, if not more, sensitive to cardinal deviations than estimates using equivalent scores. FiguresA.11andA.14reproducefiguresA.9andA.13forblack/whiteachievementgaps/changesin theNELS/ELSdatausingsurvey/agez-scoresinsteadofequivalentscores. Thedifferenceinrobustness betweencross-sectionalandgap-changeestimatesisevenstarkerusingz-scores. Neithercross-sectional black/white math gap ever falls below 0, while the W−-measured gap change flips sign at k 0.06. 0 ≈ This is a much lower critical value than the k 0.3 needed to flip the sign using equivalent scores. ≈ The z-score gap/change estimates for reading achievement likewise do not suggest greater robustness to cardinal deviations. The differences in cross-sectional income-achievement gap sensitivity are less dramatic. The income-achievement gap change estimates are substantially more sensitive to cardinal deviations than are the cross-sectional estimates; the observed gap change using reading z-scores is very close to 0, so that the sign of the estimate flips at k 0. ≈ 6.5. The Magnitude of k. The empirical estimates using the NELS and ELS cohorts showed that someachievementgaps/changesareidentifieduptosignnomatterhowdifferentthetrueandobserved test scales are. For other achievement gaps/changes, the sign may be misidentified by the observed test scores for sufficiently large values of k. The magnitude of the smallest k for which a sign reversal is possible varies enormously across different comparisons, from a minimum of 0.04 to a maximum of 0.4. Sincetheboundinganalysisiswell-definedforanyk in[0,1],avalueof0.04mightseemsmalland 0.4 might seem large. However, it is not actually clear what the scale of k means. Pinning down the scale of k is a fundamentally hard problem since the relevant units of achievement are not knowable (remember that I simply normalized both s and W (s) to be in [0,1]). This section explores a number 0 of methods to determine what constitutes a “large” or a “small” value of k. Education researchers are familiar with test scores normalized to have a mean of 0 and a standard deviation of 1. Although my work here and in other papers questions whether such z-scores have an interpretable scale, it is still possible for me to report ∆V+, ∆V−, and k in standard-deviation units. For instance, the math z-scores in the NELS and ELS have a range of -2.2 to 2.4, which implies that k =0.04 corresponds to 0.18=(2.4+2.2) 0.04 standard-deviation units, while k =0.4 corresponds × to1.8standard-deviationunits. Studentstypicallygainabout0.07standarddeviationsofachievement permonthin primaryschool, so adifferenceof0.18isneitherverylargenorvery smallbythismetric, while 1.8 is huge.35 Cross-sectional black/white and high-/low-income mean achievement gaps are 35Krueger[12]usestheTennesseeSTARexperimenttoestimatethatsmallerclasssizescorrespondtoabout0.22standard deviations. Hearguesthatthisfigurecorrespondstoabout3monthsofprogressinschool. Sincemostoftheliterature examining the effects of various inputs on student achievement apply cardinal methods to z-scores, I can compare the “z-score” units of k to virtually any educational effect size I wish. For example, Hanushek and Rivkin [10] review the

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 23 typically around 0.5 to 0.8 standard deviations, again making k = 0.04 seem relatively small and k =0.4 relatively large.36 Figure A.15 plots W+ and W− for k = 0.1 and k = 0.4 using the income-achievement math ∆f 0 0 estimated from the NELS/ELS data. For these data, k =0.1 is sufficient for the observed test scores to misidentify the sign of the true gap change. The right panel of figure A.15 shows that the worst case weighting functions for k =0.1 do not look particularly extreme. Under both W+ and W−, the 0 0 observed scores are cardinal for most of [0,1], and neither weighting function ever strays too far from theidentityfunction. Incontrast,W+ andW− lookverydifferentfromtheidentitywhenk =0.4;the 0 0 observed scores are almost never cardinal and the jumps at the achievement thresholds are very large. FigureA.16plotsW−(sk =0.04)andW+(sk =0.04)forthecasethat∆f issymmetricandsatisfies 0 | 0 | (A3). Since ∆f is symmetric, all of the weighting functions are symmetric as well. Visual inspection suggests that k =0.04 is not much of a deviation, while k =0.40 marks a substantial departure from cardinality. 6.6. Estimation Error. The analysis so far has ignored estimation error in calculating the values k∗ for which the W+ or W−-weighted gaps/changes flip sign relative to their observed counterparts.37 0 0 The ∆f’s that critically determine the sensitivity of the gap/change estimates to cardinal deviations are themselves estimated from the data. The true ∆f’s might differ substantially from their sample analogues, which implies that the estimated k∗’s may differ from their population values. From one perspective, this concern is secondary to the main thrust of the paper. The estimated ∆f’sareconsistentestimatesofthepopulation∆f’s, and, assuch, theyareplausibleguessesfor∆f’s thatgovernbiasinimportant,appliedsettings. Theempiricalresultsshowthatformostofthese∆f’s, it is possible to flip the sign of the gap/change estimate for sufficiently large values of k. Furthermore, the results show that k∗ is often quite small. Even without knowing the estimation errors associated with my empirical procedure, I have certainly supplied ample evidence that cardinal methods applied to test-score data are quite likely to be sensitive to scale misspecification. However, in order to state with confidence that the specific gaps/changes I have identified as being sensitivetocardinaldeviationsareinfactsensitivetocardinaldeviations, Ineedsomewaytoaccount for the effect of estimation error on ∆V+ and ∆V−. Bootstrapping is difficult to implement in this literatureonteachervalue-addedmodelsandreportthatastandarddeviationinteacherperformanceisassociatedwith studentgainsontheorderof0.1to0.2standarddeviations. 36Inmydata,theblack/whitemathgapis0.79intheNELSand0.84intheELS.FryerandLevitt[7]estimateblack/white achievementgapsforearlyelementaryschoolstudentsofbetween0.4to0.7. Reardon[19]estimatesthemathachievement gap between students from the 90th and 10th percentiles of the household income distribution to be around 1 in the NELSand1.1intheELS.Incontrast,IestimatethattheNELSmathincome-achievementgapis1.039andintheELS itis0.904. 37Formally, define k∗ = inf{k|∆V(W−(s|k),∆f) < 0} if ∆V(I,∆f) > 0 and k∗ = inf{k|∆V(W+(s|k),∆f) > 0} if 0 0 ∆V(I,∆f)<0inthecasetheasignflipispossible. Ifthereisnosuchk,thensetk∗=1.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 24 setting because the forms of W+ and W− depend on the number of times ∆f crosses 0, and different 0 0 bootstrapiterationsmayresultin∆f’sthatcross0adifferentnumberoftimes. Thisproblemismost acutefortheempiricalestimatesofgapchanges;thecross-sectional∆f’sessentiallynevercross0more than once on the interior of [0,1]. An additional difficulty is that the bootstrap has not been formally justified in this setting. Working out these theoretical and empirical challenges is on the agenda for future research. 7. Power and Size Calculations I have shown that for sufficiently large values of k, cardinal methods using observed test scores may misidentify the sign of an achievement gap/change in the limit as the group sample sizes tend to infinity. For small values of k, the incorrectly-specified scale will correctly identify the sign of the achievementgap/change,althoughtherelativemagnitudesofthetrueandobservedgapsmaybequite far off. In this case, is there any advantage to using the comparatively simple, cardinal approaches familiar to most researchers? It turns out that there is: statistical power. In a loose sense, cardinal methods use more of the information contained in the test-score distribution. If that information mostly preserves the relevant cardinal differences in the true test scale, then such methods may be more likely to reject false null hypotheses at a given level. 7.1. Theoretical Discussion, Cross-Sectional Achievement Gaps. Consider the problem of assessing which of two test-score distributions, F or F , represents greater overall achievement given A B independent, random samples of sizes N and N from each population. Suppose that F F , so A B A B (cid:31) that any reasonable method for assessing achievement differences should asymptotically reject with probability 1 the null hypothesis that group B has more achievement. Given that F F is true, A B (cid:31) the power of a given testing procedure is just the probability that the false null F F is rejected. B A (cid:23) I use the procedure developed in Barrett and Donald [2] to test for stochastic dominance. This methodallowsonetotestthenullH :F (s) F (s) sagainstthealternativeH : s˜F (s˜)>F (s˜) 0 B A 1 B A ≤ ∀ ∃ | using a test statistic, BˆD, that is a modified form of the well-known Kolmogorov-Smirnoff statistic.38 Since the null of this test is exactly the false null that we wish to reject when F F , the relevant A B (cid:31) power is just the probability that this null is rejected. To my knowledge, there is no analytic formula forthepowerofthistest. Therefore,Iusesimulationinthenextsection(7.2)tocomparethepowerof the Barrett and Donald testing procedure to the power of (cardinal) z-tests of the difference in group means when F F . A B (cid:31) 38Formally, they define B(cid:100)D ≡ (cid:113) N N A A + N N B B sup s (cid:16) Fˆ A(s)−Fˆ B(s) (cid:17) , Fˆ G(s) = (cid:80)N i GI(si < s) and show that Pr(B(cid:100)D > c)→exp (cid:0) −2c2(cid:1) (cid:113) when(NA,NB)→(∞,∞)suchthat NB N + A NA →λ>0. Thisimpliesthatthelevel-αcriticalvaluecα isgivenbycα= −1 2 ln(α).

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 25 Suppose it is known that the test-score distributions of groups A and B have the same shape but that F is shifted to the right relative to F .39 Let σ and µ represent the standard deviation and A B G G mean respectively of the test scores in groups G A,B . Since F is simply F shifted to the right, A B ∈{ } σ = σ and µ > µ . In this case, the null and alternative hypotheses that correspond to the A B A B BˆD test of FOSD are H 0 : ∆µ 0 and H 1 : ∆µ > 0 where ∆µ µ A µ B . The statistic Z(cid:100)∆µ ≤ ≡ − ≡ ∆(cid:100)µ is asymptotically a standard normal random variable that can be used for hypothesis σ√N−1+N−1 A B testing. It is straightforward to show that the power function for this test at level α is π(∆µ) = (cid:18) (cid:19) 1 Φ z ∆µ .40 − α − σ√N−1+N−1 A B Section 4 showed that for a fixed k, W+ and W− have flat regions and/or discontinuous jumps. 0 0 Thesefeatureshavethepotentialtoaffectthepowerofbothcardinalandordinaltestsofachievement gaps. Toseewhy,considerordinaltestinginthecasethatF F suchthatF (s)=F (s) s / [s,s¯] A B A B (cid:31) ∀ ∈ and W (s) = c, s [s,s¯]. Under these assumptions, the observed test score distribution for group 0 ∀ ∈ A dominates the score distribution from group B, but the economically relevant score distributions of the two groups, H and H , are equal. In this case, FOSD tests will always reject the null that A B F F as the group sample sizes jointly tend to infinity. However, the economically relevant null A B (cid:22) is not whether F dominates F but whether H dominates H . Since H = H by construction, B A B A B A FOSD tests of the correctly weighted score distributions will never reject the null for arbitrarily large samples. This situation will also cause z-tests on the observed scores to lead researchers to the wrong conclusion; the observed difference in means will be positive while the true difference in means is 0. The example in the previous paragraph is quite extreme. In all of the simulations and empirical estimatesIhavepresented, Fˆ (s)=Fˆ (s)almosteverywhere. Furthermore, W+ andW− willtypically A (cid:54) B 0 0 havenon-flatregionspreciselywhereFˆ andFˆ aremostdifferent. Nonetheless,stochasticdominance A B tests and z-tests (or t-tests) of mean differences will typically have different rejection rates depending on whether the observed scores or the true scores are used. Ordinal FOSD tests using W+ and W− reject the null at different rates than tests using the 0 0 observed scores only because W+ and W− are not strictly monotone functions of the observed scores. 0 0 However,thereisaninterpretationofW+ andW− thatavoidsthisproblem. Consideranamendment 0 0 to assumption (A2) stating that W be strictly increasing everywhere on [0,1] with derivative never 0 less than ε > 0. Under this alternative version of (A2), it is straightforward to show that W−(sk) 0 | and W+(sk) as defined in theorems 4.1 to 4.5 are just the limits of W−(sk,ε) and W+(sk,ε) as 0 | 0 | 0 | 39Thatis,fA(s)=fB(s−δ)∀s. 40If the variances are estimated from the data, then t-tests should be used instead of z-tests. In practice, for group samplesizeslargerthan50,t-testsandz-testsprovidevirtuallyidenticalpowerforagivenlevelα. Alloftheseformulas hold exactly in the limit as NA and NB jointly go to ∞, or in the case that the score distributions are jointly normal andindependent. However,theformulaswillbeverycloseapproximationsinevenmoderatelysizedsamples.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 26 ε 0. Since + and − are smooth functions of W+ and W−, the upper and lower bounds for ∆V → B B 0 0 can be thought of as the limits of the bounds using W−(sk,ε) and W+(sk,ε) as ε 0. For a very 0 | 0 | → small value of ε, these bounds will be indistinguishable from each other. As long as ε>0, the power of ordinal tests will be unchanged for any k. Put differently, there is a discontinuity in the power function of the ordinal tests when ε hits 0. Please refer to appendix C for a formal demonstration of these various claims. Figure C.1 in that appendix plots W−(sk,ε) and W+(sk,ε) in the case that 0 | 0 | ∆f satisfies (A3). Unlike ordinal tests, the power of z-tests using the correctly weighted sample means is a smooth function of ε for any k. This fact has several important implications. First, it implies that the power of the z-test will generally be a function of k for any ε 0. Therefore, z-tests using the observed ≥ score distributions will either be too likely or too unlikely to reject the relevant null compared with thesametestappliedtothetruetestscale. Second, itimpliesthatforε 0andk small, z-testsusing ≈ the observed scores will have greater power than ordinal FOSD tests. However, as k increases, the power of the z-tests applied to the true scores will decrease (or increase) depending on the sign of the observed gap and whether one looks at + or −. At some point, the power of the cardinal tests for B B either W+ or W− may fall below the power of the FOSD tests. Furthermore, as k grows large, the 0 0 difference between the power of the z-test applied to the observed scores and the power of the z-test applied to the true scores will widen. In contrast, the power of the ordinal test does not depend on k. TheapplicationoftheB(cid:100)Dstatistictotestingachievementgapchangesisonlyslightlymoreinvolved. Inaparallelworkingpaper,Ishowthattherearetwoconditionsnecessarytoinferthattheachievement gap between groups A and B narrowed unambiguously between periods t and t+1: F F A,t A,t+1 (cid:23) and F F . In other words, group A’s achievement needs to have declined unambiguously, B,t+1 B,t (cid:23) while group B’s achievement needs to have increased. If at least one of these stochastic dominance relationships is strict, then any increasing set of weights W would assess a smaller achievement gap in t+1thanint.41 ThesestochasticdominancerelationshipscanbetestedusingthesameBˆD statistics that I used to test cross-sectional gaps. The cardinal analysis for gap-changes involves only a slight modification of the cross-sectional approach. Suppose now that F , F , F , and F are identical except for location. Let ∆µ A,t B,t A,t+1 B,t+1 t denote the difference means between group A and group B in time t and suppose µ µ and A,t A,t+1 ≥ µ µ hold with at least one of the inequalities strict. These assumptions imply that the B,t B,t+1 ≤ 41Asformulated,itispossibleforeitherFB,t(cid:31)FA,t orFB,t+1(cid:31)FA,t+1. Iwillusuallystudyempiricalsettingswhere the“high” groupAdominatesthe“low” groupB ineachcrosssection,butifthisisnotthecase,nothingofimportance changes. Insteadofthegapnarrowing,onewouldjustsaythatB gainedrelativetoAunambiguously.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 27 achievement gap unambiguously decreased between t and t+1. As before, an appropriately chosen z-test is adequate to test the null that the gap increased against the alternative that it decreased. 7.2. Simulation Results. I simulate cross-sectional achievement gaps when F = N(µ ,σ2) and A A F = N(µ ,σ2) and µ µ . If µ is strictly greater than µ , then F first-order stochastically B B A B A B A ≥ dominates F , which implies that cardinal methods will correctly identify the sign of the achievement B gap for any k < 0.5 and will never identify a negative gap for any k.42 Since cardinal and ordinal methods will agree in the limit for any k, it is sufficient in this case to compare cardinal tests of µ µ against ordinal tests of FOSD. I use simulated data to estimate the power of the BD test A B ≥ and compare it to the theoretical z-test power. Figure A.17 shows the simulated power of the BD test against the theoretical power of the z-test. Theleftpanelshowsthatbothofthesepowersincreaseasthesamplesizesincrease, holding∆µfixed. The right panel plots both powers as a function of ∆µ holding N fixed at 500. For small ∆µ, neither test is very powerful, and both powers increase monotonically as ∆µ increases. Strikingly, for a given pair(N =N =N ,∆µ), thepowerofthez-testliesalwaysabovethepoweroftheBDtest.43 When A B the observed test scores are cardinally comparable, cardinal methods are always more powerful. The figure also shows the power curves using test scores rescaled according to W−((cid:5)k = 0.1). The basic 0 | patterns are largely unchanged, but the tests using W− are uniformly less powerful than those using 0 therawscores. Thisisintuitive,asbyconstructionW− narrowsthetruemeangapasmuchaspossible 0 given k. It is interesting to note that drop-off in power as k increases is much more dramatic for the BD tests than for the z-tests. Figure A.18 compares the power of the z-test applied to W−(sk) for different values of k to the 0 | power of the BD test applied to the original test scores when ∆µ=0.25 and N =200. Applying the BDtesttotherawscoresismotivatedbythere-conceptualizationofW∗(sk)aslim W∗(sk). The 0 | ε→0 0 | poweroftheBDtestdoesnotdependonk,whilethepowerofthez-testappliedtoW−(sk)decreases 0 | monotonically in k. For small values of k, the power of the z-test is very close to its power applied to the raw scores and is strictly above the power of the BD test. As k increases from 0, these two powers get closer to each other, eventually crossing. This means that for values of k close to 0, z-tests applied to the true scores will be more powerful than ordinal tests of FOSD. However, when the k is large, ordinal tests will actually be more powerful. The simulation results for achievement gap changes yield essentially the same conclusions. If the observed test scores are truly cardinal, then cardinal tests will have greater power. Cardinal tests will continue to be superior for small values of k, but as k grows, cardinal tests lose power. Ordinal tests 42Whenk≥0.5,W− is0on[0,0.5)and1on[0.5,1],resultinginagapestimateof0. 0 43Exceptfor∆µ=0,inwhichcasebothhavepowerequaltoα.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 28 will be more powerful in most cases given a sufficiently large k provided that one adopts the “small ε” interpretation of W+ and W−. 0 0 Thealertreadermayhavenoticedsomethingpeculiaraboutthisdiscussion. Myclaimisthatwhen k is large, the correctly weighted test score gaps/changes may be quite close to 0. For a given sample size, this means that as k increases, the power of cardinal tests applied to the true scores to determine the sign of the achievement/gap change decreases. At the same time, under the small-ε interpretation ofW+ andW−,ordinaltestsareunchangedforanyk suchthatthetrueandtheobservedgap/change 0 0 have the same sign. But if the true mean difference in the scores is very close to 0, shouldn’t cardinal tests on these scores accurately measure this difference? Why is it desirable for ordinal statistics to identifyanarbitrarilysmallgap/change? Thesolutiontothisconundrumconsistsoftwoobservations. First,thedifferenceinpowerisdrivenbythefactthatordinalstatisticsonlyattempttodeterminethe signofagivengap/change,whilecardinalmethodsattempttodetermineboththesignandmagnitude of the gap/change. Second, W [0,1] is just a normalization. The economic scale of W might 0 0 ∈ be huge. For example, consider W denominated in units of lifetime income. For such a weighting 0 function, even a very small difference in the normalized scale might correspond to an economically significant difference in the un-normalized scale. 8. Conclusion and Extensions This paper develops a method for assessing the sensitivity of standard achievement gap/change estimates using test-score data to cardinal deviations in the test scale. The method makes precise the intuitive idea that cardinal methods will provide mostly valid inference on achievement gaps/changes when the true scale and the observed scale are very close to each other and very incorrect inference when the two scales are very different. The approach is readily interpretable and straightforward to apply in most real-world empirical scenarios. Iusemyproposedmethodtoinvestigatethecardinalsensitivityofstandardachievementgap/change estimates in the NLSY and NELS/ELS data. I find that cross-sectional black/white and high-/lowincome achievement gaps are usually robust to cardinal deviations in these data. In many cases, there is no rescaling of the test scores that would reverse the sign of the estimated gap, while in other cases the true scale would have to be quite different from the observed scale in order for the sign of the estimate to be misidentified. In contrast, achievement gap change estimates in these data are much less robust; even small deviations in the cardinality of the true scale relative to the observed scale are often sufficient to reverse the sign of the estimate. Not only might standard methods misidentify the sign of an achievement gap/change in the limit as the sample sizes tend to infinity, they will also have

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 29 incorrect size and lower power than ordinal methods in finite samples if the test scale is incorrectly specified. Cardinal statistical methods are easy to use and familiar to most researchers. If the observed test scale is close to the true scale, cardinal methods are preferable because they have greater power than ordinalapproaches. Thispaperhasshownthatrelyingonsuchmethodsmayleadoneveryfarastrayif thetruescaleandtheobservedtestscalearesufficientlydifferentfromeachother. Ultimately,thetrue scale of achievement is unknowable in most applied work. The researcher must use her own judgment abouthowtousetest-scoredata. However,ifmysensitivitymethodshowsthatagivenconclusionusing cardinal methods is quite sensitive to the (essentially arbitrary) test scale used, applied researchers may wish to abandon cardinal approaches and instead rely only on the scale-independent, ordinal content of the test scores. Both the theoretical and empirical work presented here are quite preliminary, and each calls out for a number of extensions. The bounding analysis depends on the choice of distance measure. The sup norm is a plausible distance measure to use, and it yields tractable expressions for the worst-case score weighting functions. Nonetheless, other distance measures, such as the Wasserstein distance, may produce bounds that are easier to interpret. Empirically, it would be worthwhile to extend the sensitivity analysis to other achievement gaps/changes and other data sets. It would also be useful to workoutmorecompletelyhowtoconductvalidinferenceonk∗. Finally,futureworkshouldinvestigate the applicability of the methods presented here to non mean-based cardinal uses of test scores. References [1] Joseph Altonji, Prashant Bharadwaj, and Fabian Lange. Changes in the Characteristics of American Youth: ImplicationsforAdultOutcomes.Journal of Labor Economics,30,4:783–828,2011. [2] GarryBarrettandStephenDonald.ConsistentTestsforStochasticDominance.Econometrica,71:71–104,2003. [3] Timothy Bond and Kevin Lang. The Evolution of the Black-White Test Score Gap in Grades K-3: The Fragility ofResults.Review of Economics and Statistics,95:1468–1479,2013. [4] ElizabethCascioandDouglasStaiger.Knowledge,Tests,andFadeoutinEducationIntervention.NBER Working Papers,18038,2012. [5] CharlesClotfelter,HelenLadd,andJacobVigdor.TheAcademicAchievementGapinGrades3-8.The Review of Economics and Statistics,91:398–419,2009. [6] Greg Duncan and Katherine Magnuson. The Role of Family Socioeconomic Resources in the Black-White Test ScoreGapAmongYoungChildren.Developmental Review,87:365–399,2006. [7] RolandG.FryerandStevenD.Levitt.UnderstandingtheBlack-WhiteTestScoreGapintheFirstTwoYearsof School.The Review of Economics and Statistics,86(2):447–464,2004. [8] RolandG.FryerandStevenD.Levitt.TheBlack-WhiteTestScoreGapThroughThirdGrade.AmericanLawand Economics Review,8:249–81,2006. [9] EricHanushekandStevenRivkin.SchoolQualityandtheBlack-WhiteAchievementGap.NBERWorkingPapers, 12651,2006. [10] EricHanushekandStevenRivkin.TheDistributionofTeacherQualityandImplicationsforPolicy.AnnualReview of Economics,4:131–57,2012. [11] Caroline Hoxby. The Effects of Class Size on Student Achievement: New Evidence from Population Variation. Quarterly Journal of Economics,115(4):1239–1285,2000. [12] Alan Krueger. Experimental Estimates of Education Production Functions. Quarterly Journal of Economics, 115(2):497–532,1999. [13] KevinLang.MeasurementMatters: PerspectivesonEducationPolicyfromanEconomistandSchoolBoardMember. Journal of Economic Perspectives,24:167–181,2010.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 30 [14] FredericLord.The‘Ability’ScaleinItemCharacteristicsCurveTheory.Psychometrika,40:205–217,1975. [15] Derek Neal. Why Has Black-White Skill Convergence Stopped?, volume 1, chapter 9, pages 511–576. Elsevier, Amsterdam,2006. [16] Eric Nielsen. The Income-Achievement Gap and Adult Outcome Inequality. PhD thesis, University of Chicago, 2014. [17] StephenRaudenbush.WhatAreValue-AddedModelEstimatingandWhatDoesThisImplyforStatisticalPractice? Journal of Educational and Behavioral Statistics,29(1):121–129,2004. [18] Sean Reardon. Thirteen Ways of Looking at the Black-White Test Score Gap. CEPA Working Paper, Stanford University,2007. [19] Sean Reardon. The Widening Academic Achievement Gap Between the Rich and the Poor: New Evidence and Possible Explanations,chapter5,pages91–116.RussellSageFoundation,NewYork,July2011. [20] Tarjei Havnes Rolf Aaberge and Magne Mogstad. A Theory for Ranking Distribution Functions. IZA Discussion Papers no 7738,2013. [21] D.Segall.EquatingtheCAT-ASVAB.InComputerized Adaptive Testing: From Enquiry to Operation.American PsychologicalAssociation,1997. [22] D. Segall. Chapter 18: Equating the CAT-ASVAB with the P&P-ASVAB. (from) CATBOOK, Computerized Adaptive Testing: From Enquiry to Operation. Technical report, United States Army Research Institute for the BehavioralandSocialSciences,1999. [23] S.Stevens.OntheTheoryofScalesofMeasurement.Science,103:677–680,1946. [24] RonDirisBasterWeelTimKautz,JamesJ.HeckmanandLexBorghans.FosteringandMeasuringSkills: Improving CognitiveandNon-CognitiveSkillstoPromoteLifetimeSuccess.OECD Report,2014. Appendix A. Figures Figure A.1. W Functions Satisfying (A2) 0 1 (1,1) W (s) 0 0 s 1 Note: Plotshowsfiveweightingfunctionsconsistentwith(A2). Thereddottedcurveistheidentityandisthe weightingfunctionassumedwhenachievementgap/changesareestimatedusingdifferencesinsamplemeans. The othercurves(purplesolid,greendash-dot,orangedash-dot-dot,andbluedashed)demonstratetheW0 canbeconvex, concave,discontinuous,andnondifferentiableandstillsatisfy(A2).

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 31 Figure A.2. Examples of ∆f’s Satisfying (A3) and (A4) (A3) 0 s ∗ (A4) 0 0 s s s s s s 1 ∗1 ∗2 ∗3 ∗4 ∗5 ∗6 Note: (A3)and(A4)donotrequiretheretobeasinglepointbetweenconsecutivezerosatwhich ∂∆f =0. This ∂s conditiondoesholdforthe∆f’sdrawnasreddash-dotlinesbutnotforthosedrawnasdashedgreenlines. Furthermore,asthesolidmagentacurvesdemonstrate,neither(A3)nor(A4)require∆f tobe0ats=0ands=1. (A3)and(A4)alsodonotrequirethatthe0’sbeevenlyspacedon[0,1],asdepictedabove. Figure A.3. W+(sk), s∗ >k and 1 s∗ >k 0 | − 1 W+(s)=s+k 0 k I(s) k W+(s)=s k 0 0 − ∆f 0 0 s 1 ∗ Note: Thereddottedlinerepresentsthenaï¿œveweightingfunction. ThegreencurveplotsW+ when 0 k<min{s∗,1−s∗}. Forvaluesofslessthank orgreaterthan1−k,W+ isflat. W+ increases1-1withsonthe 0 0 interval[k,1−k]exceptforthepoints∗=0.5,whereW+ jumpsby2k. 0

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 32 Figure A.4. W−(ss ,k) for Three Different Values of s 0 | c c 1 W (ss )s =s +k 0− | c | c ∗ W (ss ) s 0− | ∗ c W (ss )s =s k k 0− | c | c ∗ − s c I(s)=s s c k 0 ∆f 0 0 s 1 ∗ Note: ThefunctioningreenplotsW−(s|s∗,k)whens∗−k>0ands∗+k<1. Inthiscase,theconstraintthat 0 D(I,W 0 −)≤k bindsbothaboveandbelows∗. ThepurpledashedcurveshowsW 0 −(s|sc,k)forsc=s∗−k wherek is suchthatsc−k<0. Inthiscase,D(I,W 0 −)onlybindsaboves∗. Symmetrically,thetealdash-dotcurveplots W 0 −(s|sc,k)whensc=s∗+k andk issuchthatD(I,W 0 −)onlybindsbelows∗.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 33 Figure A.5. W+ for N =2 and N =3 0 1 s+ 2 k s+ 2 s+ 2 I(s)=s W+(sk,s+) s+=s +k 0 | 2 | 2 ∗2 W+(sk,s+) s+=s k k 0 | 2 | 2 ∗2− W+(sk,s+) s s k,s +k 0 0 | 2 | ∈{ ∗2− ∗2 } ∆f 0 0 s s 1 ∗1 ∗2 1 I(s)=s s+ 2 s+ 2 s+ 2 W+(sk,s+)s+=s k 0 | 2 | 2 ∗2− W+(sk,s+) s+ (s k,s +k) 0 | 2 | 2 ∈ ∗2− ∗2 k W+(sk,s+) s+=s +k 0 | 2 | 2 ∗2 ∆f 0 0 s s s 1 ∗1 ∗2 ∗3 Note: ThepotentialW+’sareindexedbyW+(s∗)≡s+. Thedashedmagentacurvesdepictthecasethats+=s∗−k 0 0 2 2 2 2 whilethetealdash-dotcurvesassumes+=s∗+k. Thesolidgreencurvesshowintermediatecaseswheres+ lies 2 2 2 betweenthesetwoextremes.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 34 Figure A.6. W− For N =2 and N =3 0 1 W (sk,s ) s =s +k 0− | −1 | −1 ∗1 k W (sk,s ) s (s k,s +k) 0− | −1 | −1 ∈ ∗1− ∗1 I(s)=s W (sk,s ) s =s k 0− | −1 | −1 ∗1− s −1 k k s −1 s −1 0 ∆f 0 0 s s 1 ∗1 ∗2 1 s−3 I(s)=s k s−3 s−3 s−1 W 0−(s | k,s−1 ,s−3 ) | s−1 =s ∗1 +k ∧ s−3 =s ∗3 +k s−1 W 0−(s | k,s−1 ,s−3 ) | s−1 =s ∗1− k ∧ s−3 =s ∗3− k k 0 s−1 W 0−(s | k,s−1 ,s−3 ) | s−1 ∈ (s ∗1− k,s ∗1 +k) ∧ s−3 ∈ (s ∗3− k,s ∗3 +k) ∆f 0 s s s 1 ∗1 ∗2 ∗3 Note: ThepotentialW−’sareindexedbyW−(s∗)≡s− andW−(s∗)≡s− (forN =3). Themagentadashedcurves 0 0 1 1 0 3 3 depictthecasethats−=s∗−k, i∈{1,3},whilethetealdash-dotcurvessets−=s∗+k. Thesolidgreencurves i i i i showintermediatecaseswherebothvaluesofs− liebetweenthesetwoextremes. i

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 35 Figure A.7. Black/White Achievement ∆f’s, NELS/ELS 1.5 1 0.5 0 −0.5 −1 −1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Cross−Sectional ∆ f, Math 0.6 NELS90 ELS02 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Gap Change ∆ f, Math ∆ f 1.5 1 0.5 0 −0.5 −1 −1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Cross−Sectional ∆ f, Reading 1 NELS90 ELS02 0.8 0.6 0.4 0.2 0 −0.2 −0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Gap Change ∆ f, Reading ∆ f Sources: U.S.DepartmentofEducation,NationalEducationLongitudinalStudyof1988(NELS:88), nces.ed.gov/surveys/nels88/andEducationLongitudinalStudyof2002(ELS:02),nces.ed.gov/surveys/els2002/ Note: CurvesestimatedusingEpanechnikovsmoothingkernelsonagridof5,000points. Datacleanedasdescribedin section6andappendixD. Themathgap-change∆f “wiggles” around0forlowvaluesofs. Thesewigglescomplicate theuseof∆f intheboundinganalysis,soIsmooththecurveagainpriortoestimatingW+ andW−. Thisadditional 0 0 layerofsmoothingaltersthefinalsensitivityestimatesnegligiblyandgreatlyspeedsupthecomputation.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 36 Figure A.8. Black/White Achievement ∆f’s, NLSY79 and NLSY97 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Cross−Sectional ∆ f, Math 1.5 NLSY79 NLSY97 1 0.5 0 −0.5 −1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Gap Change ∆ f, Math ∆ f 2 1.5 1 0.5 0 −0.5 −1 −1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Cross−Sectional ∆ f, Reading 0.6 NLSY79 NLSY97 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Gap Change ∆ f, Reading ∆ f Sources: BureauofLaborStatistics,NationalLongitudinalSurveysofYouth,NLSY79andNLSY97, www.bls.gov/nls/nlsy79.htm,andwww.bls.gov/nls/nlsy97.htm Note: CurvesestimatedusingEpanechnikovsmoothingkernelsonagridof5,000points. Datacleanedasdescribedin section6andappendixD.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 37 Figure A.9. Black/White Achievement Gap/Change Bounds, NELS/ELS 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −0.05 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gaps, Math 0.12 positive real 1990 negative real 1990 observed 1990 positive real 2002 0.1 negative real 2002 observed 2002 0.08 0.06 0.04 0.02 0 −0.02 −0.04 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gap Changes, Math observed positive actual negative actual 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −0.05 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gaps, Reading 0.12 positive real 1990 negative real 1990 observed 1990 positive real 2002 0.1 negative real 2002 observed 2002 0.08 0.06 0.04 0.02 0 −0.02 −0.04 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gap Changes, Reading observed positive actual negative actual Sources: U.S.DepartmentofEducation,NationalEducationLongitudinalStudyof1988(NELS:88), nces.ed.gov/surveys/nels88/andEducationLongitudinalStudyof2002(ELS:02),nces.ed.gov/surveys/els2002/ Note: Curvesestimatedusing∆f’scalculatedonagridof5,000evenlyspacedpointsand50evenlyspacedvaluesofk. Theleft-handpanelsshowthecross-sectionalgapsfortheNELSandELScalculatedsuchthatthedifferencesinthe observedcurves(perfectlyhorizontal)equaltheobservedgapchangesintheright-handpanels. Datacleanedas describedinsection6andappendixD.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 38 Figure A.10. Black/White Achievement Gap/Change Bounds, NLSY79 and NLSY97 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gaps, Math 0.04 positive real 1979 negative real 1979 observed 1979 positive real 1997 0.02 negative real 1997 observed 1997 0 −0.02 −0.04 −0.06 −0.08 −0.1 −0.12 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gap Changes, Math observed positive actual negative actual 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −0.05 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gaps, Reading 0.04 positive real negative real observed 0.02 0 −0.02 −0.04 −0.06 −0.08 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gap Changes, Reading observed positive actual negative actual Sources: BureauofLaborStatistics,NationalLongitudinalSurveysofYouth,NLSY79andNLSY97, www.bls.gov/nls/nlsy79.htm,andwww.bls.gov/nls/nlsy97.htm Note: Curvesestimatedusing∆f’scalculatedonagridof5,000evenlyspacedpointsand50evenlyspacedvaluesofk. Theleft-handpanelsshowthecross-sectionalgapsfortheNLSY79andNLSY97calculatedsuchthatthedifferencesin theobservedcurves(perfectlyhorizontal)equaltheobservedgapchangesintheright-handpanels. Datacleanedas describedinsection6andappendixD.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 39 Figure A.11. Black/White Achievement Gap/Change Bounds Using Z-Scores, NELS/ELS 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gaps, Math 0.1 positive real 1990 negative real 1990 observed 1990 0.08 positive real 2002 negative real 2002 observed 2002 0.06 0.04 0.02 0 −0.02 −0.04 −0.06 −0.08 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gap Changes, Math observed positive actual negative actual 0.3 0.25 0.2 0.15 0.1 0.05 0 −0.05 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gaps, Reading 0.04 positive real 1990 negative real 1990 observed 1990 0.02 positive real 2002 negative real 2002 observed 2002 0 −0.02 −0.04 −0.06 −0.08 −0.1 −0.12 −0.14 −0.16 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gap Changes, Reading observed positive actual negative actual Sources: U.S.DepartmentofEducation,NationalEducationLongitudinalStudyof1988(NELS:88), nces.ed.gov/surveys/nels88/andEducationLongitudinalStudyof2002(ELS:02),nces.ed.gov/surveys/els2002/ Note: Curvesestimatedusing∆f’scalculatedonagridof5,000evenlyspacedpointsand50evenlyspacedvaluesofk. Theleft-handpanelsshowthecross-sectionalgapsfortheNELSandELScalculatedsuchthatthedifferencesinthe observedcurves(perfectlyhorizontal)equaltheobservedgapchangesintheright-handpanels. Datacleanedas describedinsection6andappendixD.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 40 Figure A.12. High-/Low-Income Achievement ∆f’s, NELS/ELS 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Cross−Sectional ∆ f, Math 0.8 NELS90 ELS02 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Gap Change ∆ f, Math ∆ f 1.5 1 0.5 0 −0.5 −1 −1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Cross−Sectional ∆ f, Reading 0.8 NELS90 ELS02 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s f ∆ Gap Change ∆ f, Reading ∆ f Sources: U.S.DepartmentofEducation,NationalEducationLongitudinalStudyof1988(NELS:88), nces.ed.gov/surveys/nels88/andEducationLongitudinalStudyof2002(ELS:02),nces.ed.gov/surveys/els2002/ Note: CurvesestimatedusingEpanechnikovsmoothingkernelsonagridof5,000points. Datacleanedasdescribedin section6andappendixD.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 41 Figure A.13. High-/Low-Income Achievement Gap/Change Bounds, NELS/ELS 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 k V ∆ Gaps, Math 0.04 positive real 1990 negative real 1990 observed 1990 positive real 2002 0.02 negative real 2002 observed 2002 0 −0.02 −0.04 −0.06 −0.08 −0.1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 k V ∆ Gap Changes, Math observed positive actual negative actual 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −0.05 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gaps, Reading 0.1 positive real 1990 negative real 1990 observed 1990 0.08 positive real 2002 negative real 2002 observed 2002 0.06 0.04 0.02 0 −0.02 −0.04 −0.06 −0.08 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gap Changes, Reading observed positive actual negative actual Sources: U.S.DepartmentofEducation,NationalEducationLongitudinalStudyof1988(NELS:88), nces.ed.gov/surveys/nels88/andEducationLongitudinalStudyof2002(ELS:02),nces.ed.gov/surveys/els2002/ Note: Curvesestimatedusing∆f’scalculatedonagridof5,000evenlyspacedpointsand50evenlyspacedvaluesofk. Theleft-handpanelsshowthecross-sectionalgapsfortheNELSandELScalculatedsuchthatthedifferencesinthe observedcurves(perfectlyhorizontal)equaltheobservedgapchangesintheright-handpanels. Datacleanedas describedinsection6andappendixD.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 42 Figure A.14. High-/Low-Income Achievement Gap/Change Bounds Using Z-Scores, NELS/ELS 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 k V ∆ Gaps, Math 0.05 positive real 1990 negative real 1990 observed 1990 positive real 2002 negative real 2002 observed 2002 0 −0.05 −0.1 −0.15 −0.2 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 k V ∆ Gap Changes, Math observed positive actual negative actual 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −0.05 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gaps, Reading 0.15 positive real 1990 negative real 1990 observed 1990 positive real 2002 0.1 negative real 2002 observed 2002 0.05 0 −0.05 −0.1 −0.15 −0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k V ∆ Gap Changes, Reading observed positive actual negative actual Sources: U.S.DepartmentofEducation,NationalEducationLongitudinalStudyof1988(NELS:88), nces.ed.gov/surveys/nels88/andEducationLongitudinalStudyof2002(ELS:02),nces.ed.gov/surveys/els2002/ Note: Curvesestimatedusing∆f’scalculatedonagridof5,000evenlyspacedpointsand500evenlyspacedvaluesof k. Left-handpanelsshowthecross-sectionalgapsfortheNELSandELScalculatedsuchthatthedifferencesinthe observedcurves(perfectlyhorizontal)equaltheobservedgapchangesintheright-handpanels. Datacleanedas describedinsection6andappendixD.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 43 Figure A.15. Large- and Small-k W+ and W−, High-Low Income Math ∆f, 0 0 NELS/ELS (A3) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s W 0 W+ and W−, k=0.10 0 0 1 W− 0 W 0 + 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s W 0 W+ and W−, k=0.4 0 0 W− 0 W+ 0 Sources: U.S.DepartmentofEducation,NationalEducationLongitudinalStudyof1988(NELS:88), nces.ed.gov/surveys/nels88/andEducationLongitudinalStudyof2002(ELS:02),nces.ed.gov/surveys/els2002/ Note: Curvesestimatedusing∆f’scalculatedonagridof5,000evenlyspacedpoints. Datacleanedasdescribedin section6andappendixD. Figure A.16. Large- and Small-k W+and W−, Symmetric ∆f Satisfying (A3) 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 s W 0 + − W and W , k=0.4 or k=0.04 0 0 W−(s|0.04) 0 W−(s|0.40) 0 W+(s|0.04) 0 W+(s|0.40) 0 Note: Curvesestimatedusing∆f’scalculatedonagridof5,000evenly-spacedpoints.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 44 Figure A.17. Ordinal vs. Cardinal Power Using I(s) and W−(sk =0.1) 0 | 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 400 450 500 N rewoP Power When ∆ µ = 0.25 1 cardinal (z) power ordinal power cardinal W− power 0.9 0 ordinal W− power 0 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.05 0.1 0.15 0.2 0.25 ∆ µ rewoP Power When N=500 cardinal (z) power ordinal power cardinal W− power 0 ordinal W− power 0 Note: Plotshowscross-sectionalpowerforz-testsandBDtestswheretherawdataisdrawnfromFA∼N(0.25,1)and FB ∼N(0,1). Thesolidcurveanddashedcurveshowtherelationshipbetweensamplesizeandpowerforz-testsand BDtestswhenrawtestscoresareused. Thecirclelineandtrianglelineshowthecorrespondingpowerswhen W−(s|0.2)isusedinstead. ThepoweroftheBDtestingapproachfallsveryrapidlyask increases. However,if 0 W−(s|k=0.2,ε=0.0001)isusedinstead,thepoweroftheBDtestisessentiallyunchangedfromtherawdatacase, 0 whilethepowerofthez-testsusingthecorrectlyweighteddataareessentiallyunchangedfromtheW−(s|0.2)case. 0 Figure A.18. Ordinal vs. Cardinal Power Using W−(sk) When N =200 0 | 1 0.95 0.9 0.85 0.8 0.75 0.7 0 0.05 0.1 0.15 0.2 0.25 k rewoP Power cardinal (z) power cardinal (z) power W − 0 ordinal power Note: Plotshowscross-sectionalpowerforz-testsandBDtestswheretherawdataisdrawfromFA∼N(0.2,1)and FB ∼N(0,1). ThereddashedcurveshowstheestimatedpoweroftheBDtestsappliedtotherawtestscores,while thebluedottedcurveshowsthepowerofthez-testusingtestscoresweightedaccordingtoW−(s|k). 0

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 45 Appendix B. Tables Table 1. NLSY79 and NLSY97 Summary Statistics Variable Survey N Mean Median S.D. math NLSY79 3,277 96.77 95 18.23 math NLSY97 2,833 98.74 99 18.82 reading NLSY79 3,277 94.19 98 19.32 reading NLSY97 2,833 93.41 98 20.39 AFQT NLSY79 3,277 142.57 146 26.94 AFQT NLSY97 2,833 142.88 147.4 28.11 income NLSY79 3,388 $44,000 $39,800 $28,700 income NLSY97 3,570 $54,700 $43,100 $49,500 age NLSY79 3,388 16.08 16 0.78 age NLSY97 3,570 15.76 16 0.72 black NLSY79 3,388 0.14 0 0.35 black NLSY97 3,570 0.15 0 0.36 Sources: BureauofLaborStatistics,NationalLongitudinalSurveysofYouth,NLSY79andNLSY97, www.bls.gov/nls/nlsy79.htm,andwww.bls.gov/nls/nlsy97.htm Note: Respondentagesarerestrictedto15-17asofASVABtestdate. Alldollarshavebeenconvertedtoa1997basis usingtheCPI-U.TheN shownforavariableisthesamplesizeusedincalculationsinvolvingthatvariable. Data cleanedasdescribedinsection6andappendixD. Table 2. NELS/ELS Summary Statistics Variable Survey Wave N Mean Median S.D. Missing Imputed math NELS 1990 14,410 44.03 44.31 13.57 777 0 math NELS 1992 12,008 49.00 49.53 14.07 2,138 0 reading NELS 1990 14,427 30.93 31.38 9.91 760 0 reading NELS 1992 11,999 33.33 34.68 10.01 2,147 0 age NELS 1990 15,187 16.13 16 0.68 0 0 age NELS 1992 14,146 18.14 18 .62 0 0 black NELS 1990 15,187 0.12 0 0.32 0 0 black NELS 1992 14,146 0.11 0 0.32 0 0 female NELS 1990 15,187 0.51 1 0.50 0 0 female NELS 1992 14,146 0.50 1 .50 0 0 math ELS 2002 14,934 44.62 44.79 13.57 0 800 math ELS 2004 13,444 50.22 51.38 14.13 1,148 0 reading ELS 2002 14,934 29.29 29.65 9.44 0 933 reading ELS 2004 NA NA NA NA NA NA age ELS 2002 14,934 15.67 16 0.61 0 0 age ELS 2004 14,592 17.70 18 0.61 0 0 black ELS 2002 14,592 0.14 0 0.35 0 0 black ELS 2004 14,934 0.14 0 0.35 0 0 female ELS 2002 14,934 0.50 0 0.50 0 7 female ELS 2004 14,592 0.50 0 0.50 0 5 Sources: U.S.DepartmentofEducation,NationalEducationLongitudinalStudyof1988(NELS:88), nces.ed.gov/surveys/nels88/andEducationLongitudinalStudyof2002(ELS:02),nces.ed.gov/surveys/els2002/ Note: StatisticsshownfortheNELSfirst-yearfollowup(1990)andtheELSbaseyear(2002). Respondentages restrictedto15-17asofsurveydate. Averagesshownfornon-missing,non-imputedobservationsusingcross-sectional weights. NELS1990sampleincludes“freshened” observations. Datacleanedasdescribedinsection6andappendixD.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 46 Table 3. NELS/ELS Income Variables NELS Percentage Percentage ELS Percentage Percentage Income FullSample AnalysisSample Income FullSample AnalysisSample none .26 .27 none .45 .43 lessthan$1,000 .49 .48 lessthan$1,000 1.09 1.14 $1,000-$2,999 1.07 1.13 $1,001-$5,000 1.73 1.78 3,000-$4,999 1.57 1.60 $5,001-$10,000 2.12 2.08 $5,000-$7,499 2.68 2.82 $10,001-$14,000 4.22 4.27 $7,500-$9,999 3.13 3.10 $15,001-$20,000 4.87 4.95 $10,000-$14,999 7.26 7.48 $20,001-$25,000 6.53 6.47 $15,000-$19,999 7.08 7.21 $25,001-$35,000 12.21 12.40 $20,000-$24,999 10.17 10.44 $35,001-$50,000 19.69 19.65 $25,000-$34,999 19.34 19.18 $50,001-$75,000 21.03 20.81 $35,000-$49,999 21.98 21.59 $75,001-$100,000 13.14 13.09 $50,000-$74,999 16.41 16.30 $100,001-$200,000 10.20 10.19 $75,000-$99,999 4.07 4.03 $200,001ormore 2.74 2.75 $100,000-$199,999 3.21 3.16 $200,000ormore 1.26 1.21 Sources: U.S.DepartmentofEducation,NationalEducationLongitudinalStudyof1988(NELS:88), nces.ed.gov/surveys/nels88/andEducationLongitudinalStudyof2002(ELS:02),nces.ed.gov/surveys/els2002/ Note: Dollarrangesshowninsurvey-specificbase-yearrealdollars(1988fortheNELSand2002fortheELS).Thefull samplecolumnsshowthecross-sectionallyweightedpercentagesforthefullrangeofagesineachsurveybaseyear. The analysissamplecolumnsshowthepercentagesofyouthinthefinalsampleusedtoconstructthevarious∆f ’s. Data cleanedasdescribedinsection6andappendixD.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 47 Table 4. Cross-Sectional k∗’s NELS/ELS Subject Year Comparison k∗ Crosses? math 1990 black/white 0.33 Yes math 2002 black/white – No reading 1990 black/white 0.32 Yes reading 2002 black/white – No math 1990 income – No math 2002 income – No reading 1990 income 0.38 Yes reading 2002 income – No NLSY Subject Year Comparison k∗ Crosses? math 1979 black/white – No math 1997 black/white – No reading 1979 black/white 0.35 Yes reading 1997 black/white 0.40 Yes math 1979 income 0.11 Yes math 1997 income 0.33 Yes reading 1979 income 0.13 Yes reading 1997 income 0.20 Yes Sources: BureauofLaborStatistics,NationalLongitudinalSurveysofYouth,NLSY79andNLSY97, www.bls.gov/nls/nlsy79.htm,andwww.bls.gov/nls/nlsy97.htm;U.S.DepartmentofEducation,NationalEducation LongitudinalStudyof1988(NELS:88),nces.ed.gov/surveys/nels88/andEducationLongitudinalStudyof2002 (ELS:02),nces.ed.gov/surveys/els2002/ Note: k∗’sestimatedusing∆f’scalculatedonanevenly-spacedtest-scoregridof5,000pointsandk-gridof1,000 points. Datacleanedasdescribedinsection6andappendixD. Table 5. Gap-Change k∗’s Survey Subject Comparison k∗ Crosses? NELS/ELS math black/white 0.29 Yes NELS/ELS reading black/white 0.28 Yes NELS/ELS math income 0.08 Yes NELS/ELS reading income 0.04 Yes NLSY79/97 math black/white 0.11 Yes NLSY79/97 reading black/white 0.12 Yes NLSY79/97 math income 0.27 Yes NLSY79/97 reading income 0.05 Yes Sources: BureauofLaborStatistics,NationalLongitudinalSurveysofYouth,NLSY79andNLSY97, www.bls.gov/nls/nlsy79.htm,andwww.bls.gov/nls/nlsy97.htm;U.S.DepartmentofEducation,NationalEducation LongitudinalStudyof1988(NELS:88),nces.ed.gov/surveys/nels88/andEducationLongitudinalStudyof2002 (ELS:02),nces.ed.gov/surveys/els2002/ Note: k∗’sestimatedusing∆f’scalculatedonanevenly-spacedtest-scoregridof5,000pointsandk-gridof1,000 points. Datacleanedasdescribedinsection6andappendixD.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 48 Appendix C. Proofs and Additional Theorems ´ ´ For notational simplicity, define B+(W,x,y) y (W(s) s)∆f(s)ds and B−(W,x,y) y (s ≡ x − ≡ x − W(s))∆f(s)ds. C.1. Proofs of the Main Theorems. Proof. (theorem 4.4 and theorem 4.1) Let + denote the set of weighting functions satisfying (A2) Wk and D(I,W) k that have the form given in equation (4.5). Further, let + denote the set of ≤ Mk weighting functions satisfying (A2) and D k that differ from any W+ + on at least one interval ≤ 0 ∈Wk with positive measure. Suppose W˜ + such that +(W˜ ) > +(W ) for all W +. There ∃ 0 ∈ Mk B 0 B 0 0 ∈ Wk are two cases to consider: N even and N odd. Suppose first that N is even. Let s˜ ,s˜ ,...,s˜ be 2 4 N { } thepointssatisfyingW˜ (s∗)=s˜ forevenvaluesofi. ConsiderW+(sk,s˜ ,s˜ ,...,s˜ ) W˜+. Iclaim 0 i i 0 | 2 4 N ≡ 0 that −(W˜+) > −(W˜ ). To see that this inequality follows, suppose that W˜ deviates somewhere B 0 B 0 0 on [s∗ ,s∗ ] for i even. Such a deviation implies that W˜ (s) W˜+(s) on [s∗ ,s∗] and W˜ (s) i−1 i+1 0 ≤ 0 i−1 i 0 ≥ W˜+(s) on [s∗,s∗ ] with at least one of these inequalities strict. Therefore, B+(W˜ ,s∗ ,s∗ ) < 0 i i+1 0 i−1 i+1 B+(W˜+,s∗ ,s∗ ), which implies that W˜+ dominates W˜ on any interval not [0,s∗] such that W˜ 0 i−1 i+1 0 0 1 0 does not correspond to some W+ +. To finish, consider [0,s∗]. Note that all W+ + are 0 ∈ Wk 1 0 ∈ Wk identical on [0,s∗], so if W˜ deviates on this interval it must be that W˜ (s) =max s k,0 on some 1 0 0 (cid:54) { − } [s ,s ] [0,s∗]. Because all functions satisfying (A2) and D(I,W) k are bounded from below L H ⊆ 1 ≤ by the maximum of 0 and s k, W˜ (s) > W+(s) for any W+ +on [s,s¯], which implies that − 0 0 0 ∈ Wk B+(W˜ ,0,s∗) < B+(W+,0,s∗) for all W+ +, a contradiction. Now consider the case that N 0 1 0 1 0 ∈ Wk is odd and construct W˜+ as before. The argument that W˜+ dominates W˜ on [0,s∗ ] is exactly 0 0 0 N−1 analogous to the domination argument for N even on [0,1]. N being odd implies that ∆f > 0 on (s∗ ,1). Note that all W+ + are identical on [s∗ ,1], so if W˜ deviates on this interval it must N 0 ∈ Wk N 0 be that W˜ (s) = min s+k,1 on some [s ,s ] [s∗ ,1]. Because all functions satisfying (A2) and 0 (cid:54) { } L H ⊆ N D(I,W) k are bounded by the minimum of 1 and s+k, W˜ (s) < W+(s) for any W+ + on ≤ 0 0 0 ∈ Wk [s ,s ], which implies that B+(W˜ ,s∗ ,1)<B+(W+,s∗ ,1) for all W+ +, a contradiction. (cid:3) L H 0 N 0 N 0 ∈Wk Proof. (theorem 4.5 and theorem 4.2) Let − denote the set of weighting functions satisfying (A2) Wk andD(I,W) k thatcanbewrittenasinequation(4.6). Further,let − denotethesetofweighting ≤ Mk functions satisfying (A2) and D k that differ from any W− − on at least one interval with ≤ 0 ∈ Wk positive measure. Suppose W˜ + such that −(W˜ ) > −(W−) for all W− −. There ∃ 0 ∈ Mk B 0 B 0 0 ∈ Wk are two cases to consider: N even and N odd. Suppose first that N is odd. Let s˜ ,s˜ ,...,s˜ be 1 3 N { } the points satisfying W˜ (s∗) = s˜ for i odd. Consider W−(sk,s˜ ,s˜ ,...,s˜ ) W˜−. I claim that 0 i i 0 | 1 3 N ≡ 0 −(W˜−) > −(W˜ ). To see this, suppose that W˜ deviates somewhere on [s∗ ,s∗ ] for some odd B 0 B 0 0 i−1 i+1

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 49 i. This implies that W˜ (s) W˜−(s) on [s∗ ,s∗] and W˜ (s) W˜−(s) on [s∗,s∗ ] with at least one 0 ≥ 0 i−1 i 0 ≤ 0 i i+1 of these inequalities strict. Therefore, B−(W˜ ,s∗ ,s∗ ) < B−(W˜−,s∗ ,s∗ ), implying that W˜− 0 i−1 i+1 0 i−1 i+1 0 dominates W˜ on any interval such that W˜ does not correspond to some W −, a contradiction. 0 0 ∈ Wk NowconsiderthecasethatN isevenandconstructW˜− asbefore. TheargumentthatW˜− dominates 0 0 W˜ on [0,s∗ ] is exactly analogous to the domination argument for N odd on [0,1]. N being even 0 N−1 implies that ∆f <0 on (s∗ ,1). Note that all W− − are identical on [s∗ ,1], so if W˜ deviates on N 0 ∈Wk N 0 this interval it must be that W˜ (s) = min s+k,1 on some [s ,s ] [s∗ ,1]. Because all functions 0 (cid:54) { } L H ⊆ N satisfying (A2) and D(I,W) k are bounded by the minimum of 1 and s+k, W˜ (s) < W(s) for 0 ≤ any W− −on [s ,s ], which implies that B−(W˜ ,s∗ ,1) < B−(W−,s∗ ,1) for all W− −, a 0 ∈ Wk L H 0 N 0 N 0 ∈ Wk contradiction. (cid:3) Proof. (theorem 4.3) Consider ∂B+. Suppose that k < min s∗,1 s∗ so that W+ has the form ∂k { −´ } 0 ´ given in equation 4.2. In this case, + may be written as + = k s∆f(s)ds s∗ k∆f(s)ds+ ´ ´ B B − 0 − k 1−k 1 k∆f(s)ds+ (1 s)∆f(s)ds. Differentiating each of these integrals with respect to k yields s∗ ´ 1−k ´− ∂B+ = 1−k ∆f(s)ds s∗ ∆f(s)ds. Nowconsider ∂B− ifs >k ands +k <1. Inthiscase, − may ∂k s∗ −´ k ´ ∂k c ´ c B bewrittenas − = sc−k k∆f(s)ds+ sc+k (s s )∆f(s)ds+ 1 k∆f(s)ds. Takingthederivative B − 0 sc−k ´− c ´ sc+k ´ whilenotingthats c dependsonkyields ∂ ∂ B k − = s 1 c+k ∆f(s)ds − 0 sc−k ∆f(s)ds − s s c c − + k k ∂ ∂ s k c∆f(s)ds. (cid:3) C.2. Bounding Analysis Using Slope Restrictions. This section derives worst-case bounds for the bias associated with using the observed test scale when W is required to be strictly increasing. 0 Very little of importance changes for the bounding analysis if the derivative of the true scale must be bounded away from 0. The functional forms of W+ and W− under this new restriction are very 0 0 slight modifications of their unconstrained counterparts. Furthermore, as the minimum allowable rate ofchangeinW declinesto0, theseworst-casefunctionsconvergesmoothlytothosedefinedinsection 0 4. This implies that + and − also converge smoothly to the values derived in the main body of the B B paper. Thus, for very small ε, the unconstrained biases will be approximately correct and yet the full ordinal information of the observed test scale will be preserved in the worst-case weighting functions. Definition C.1. W satisfies (A5) for 1>ε>0 iff the following hold: 0 (i) dW0 exists everywhere on [0,1] except at a finite number of points. Let be the points in [0,1] ds S such that dW0 is not defined. ds (ii) dW0 ε for all s [0,1]/ . ds ≥ ∈ S Definition C.2. Let be the set of functions on [0,1] that satisfy by (A2) and (A5). Suppose that ε W all component test-score distributions in ∆f satisfy (A1). The worst-case W ’s satisfying (A2), (A5), 0

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 50 and D(I,W) k for a given distance restriction k are defined by ≤ W+(sk,∆f,ε) max +(I,W,∆f) 0 | ≡ W∈Wε∧D(I,W)≤kB W−(sk,∆f,ε) max −(I,W,∆f). 0 | ≡ W∈Wε∧D(I,W)≤kB Theorem C.3. Suppose that ∆f satisfies (A1) and (A3). Then there exists s [s∗ k,s∗+k] such c ∈ − that  max s k,εs , s [0,s∗) W+(sk,∆f,ε) = { − } ∈ 0 | min s+k,εs+(1 ε) , s [s∗,1] { − } ∈  min εs+(s c εs∗),s+k , s [0,s∗) W−(sk,∆f,ε) = { − } ∈ 0 | max εs+(s c εs∗),s k , s [s∗,1] { − − } ∈ Proof. The proofs for W+(sk,∆f,ε) and W−(sk,∆f,ε) are trivial modifications of the proofs of 0 | 0 | theorems 4.1 and 4.2, respectively. (cid:3) CorollaryC.4. Supposethat∆f satisfies(A1)and(A3). Then,lim W+(sk,∆f,ε)=W+(sk,∆f) ε↓0 0 | 0 | and lim W−(sk,∆f,ε)=W−(sk,∆f). ε↓0 0 | 0 | TheoremC.3andcorollaryC.4onlyderiveW+andW−under(A3). Theanalysisissimilarbutmore 0 0 cumbersome for (A4) and is omitted for brevity. Figure C.2 below plots W−(sk,ε) and W+(sk,ε) 0 | 0 | in the case that k is small enough that the distance restriction bites both above and below s∗. These weighting functions are exactly analogous the their non-slope-constrained counterparts except that the regions on the unconstrained curves which had slope 0 now have slope ε. This modification also implies that that the kink points are slightly farther from s∗ compared to the unconstrained case with the same value of s . c

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 51 Figure C.1. W−(sk,ε) and W+(sk,ε), ε>0 0 | 0 | 1 k W (s)=min s εs,s+k 0− { c − } s c W (s)=max s εs,s k 0− { c − − } I(s)=s k 0 ∆f 0 0 s 1 ∗ 1 W+(s)=min s+k,εs+(1 ε) 0 { − } k I(s) k W+(s)=max s k,εs 0 0 { − } ∆f 0 0 s 1 ∗ Appendix D. Data The NELS first surveyed a nationally representative sample of eighth graders in the spring of 1988 with follow-up surveys in 1990, 1992, and 2002. I make use of the 1990 wave in order to keep the comparison groups consistent with my prior work on the income-achievement gap. The NELS wave consists mostly of 10th graders who were between the ages of 15 and 17 at the survey date. The ELS

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 52 first surveyed a nationally representative sample of 10th graders in 2002, so all of my calculations compare this initial ELS wave to the first follow-up wave in the NELS. Both the NELS and ELS contain data on household income, demographics, and achievement. Respondentsinbothsurveystookcomparableachievementtestsineachsurveywave. Thesetestscovered similarcontentandfollowedasimilarstratifieddesign. Bothassessmentsincludedsomeitemsincommon, and both surveys report three parameter logistic item response theory (IRT) scores in the 1988 base-year scale estimated using these items. If the IRT model is correctly specified, these base-year scalescoresshouldbeordinallycomparablebetweenthetwosurveys. Thatis,ifstudentihasahigher score than student j, then student i should have higher underlying achievement regardless of whether i and j were drawn from the same or different surveys. The initial waves of the NELS and ELS collected data on household income. Unfortunately, these data are categorical, significantly complicating the construction of directly comparable income groups from both surveys. I discuss the various ways of attacking this problem in my other working papers. Forthispaper,thesedetailsarerelativelyunimportant,andIsimplyuseoneplausibledefinitionoutof manyfor“high-income” and“low-income.” Idefinehigh-incomeyouthasthosefromthetop20%ofthe household income distribution and low-income youth as those from the bottom 20%. I approximate these quintiles by selecting the range of income buckets such that the mass of the bucket is as close as possible to 0.2.44 Unlike the NELS, the ELS imputes test scores, family income, and demographic variables. IdropimputedobservationsfromtheELSsample. Myotherworkingpaperdocumentsthat the inclusion or exclusion of these observations has relatively little bearing on the sign or magnitude of the estimated achievement gap changes. TheNLSY79andNLSY97arehigh-quality,nationallyrepresentativesurveysthatcontainordinally comparableachievementdataalongwithdetailedstudentdemographicinformation. Almostallrespondents near the start of each survey took the Armed Services Vocational Aptitude Battery (ASVAB). Following an extensive literature in economics using these data, I study the math and reading subscores of the Armed Forces Qualifying Test (AFQT), which itself is a subset of the ASVAB.45 The ASVAB test format changed from pencil-and-paper to a computer aided design between the NLSY79 and NLSY97. The military commissioned a study to determine how to compare scores from the new and old test formats. Segall[21] constructs a score crosswalk by equating percentiles on the two tests 44Forexample,supposethereare8orderedincomecategorieswithequalnumbersofrespondentsineachbucket. Then, the high-income group would simply be the top two income buckets (containing the top 25% of the sample) and the low-income groups would likewise be the bottom two buckets. In this case, both categories are somewhat larger than thetargetcomparisongroups. 45TheASVABcomponentsfeedingintotheAFQTchangedin1989. Throughout,Iwillusethecurrentdefinitionthat setsthemathsubscoretobethesumofthearithmeticreasoningandmathknowledgeASVABcomponentscores. The definitionforreadingdidnotchangein1989.

ACHIEVEMENT GAP ESTIMATES AND DEVIATIONS FROM CARDINAL COMPARABILITY 53 forasampleofmilitaryrecruitswhowererandomlyassignedtooneversionofthetestortheother.46 I usethesecrosswalkedscoresexclusively,astheyshouldbeordinallycomparableinthesensepreviously defined. BothNLSYsurveyscollectextensivelongitudinaldataoneachrespondent’sfamily,income,health, education, and employment history. I do not use the longitudinal component of these surveys here. I define high- and low-income respondents as those in the top and bottom quintiles of the base-year household income distribution, which is reported continuously. This income measure sums together all sources of income (wage, investment, business, etc.) for all household members. Since the youth I study are all younger than 18 years old, their total contribution to household income is typically negligible. Although I have not specifically assessed the robustness of my estimates to these data choices, I found in Nielsen[16] that ordinal income-achievement estimates using these data are not sensitive to plausible alternative income definitions.47 46The crosswalk is available courtesy of Altonji, Bhadarwaj, and Lange[1] and is available at the following url: http: //www.econ.yale.edu/~fl88/data.html. The crosswalk contain percentile-mapped scores for each component score of the ASVAB. Simply adding these scores together is not strictly valid because it ignores the covariance of the different ASVABcomponents. Fortunately,Segall[22]reportsthatsummingthecrosswalkedscoresorcrosswalkingthesummed scoresleadstovirtuallyidenticalresults. 47Forexample,Iestimatesimilarincome-achievementgapchangesifIuseparentalwageincomeinsteadoftotalhousehold incometodefinethehigh-andlow-incomecategories.

Cite this document

APA

Eric R. Nielsen (2015). Achievement Gap Estimates and Deviations from Cardinal Comparability (FEDS 2015-040). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2015-040

BibTeX

@techreport{wtfs_feds_2015_040,
  author = {Eric R. Nielsen},
  title = {Achievement Gap Estimates and Deviations from Cardinal Comparability},
  type = {Finance and Economics Discussion Series},
  number = {2015-040},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2015},
  url = {https://whenthefedspeaks.com/doc/feds_2015-040},
  abstract = {This paper assesses the sensitivity of standard empirical methods for measuring group differences in achievement to violations in the cardinal comparability of achievement test scores. The paper defines a distance measure over possible weighting functions (scalings) of test scores. It then constructs worst-case bounds for the bias in the estimated achievement gap (or achievement gap change) that could result from using the observed rather than the true test scale, given that the true and observed scales are no more than a fixed distance from each other. The worst-case weighting functions have simple, closed-form expressions consisting of achievement thresholds, flat regions in which test scores are uninformative, and regions in which the observed test scores are actually cardinally comparable. The paper next estimates these worst-case weighting functions for black/white and high-/low-income achievement gaps and gap changes using data from several commonly employed surveys. The results of this empirical exercise suggest that cross-sectional achievement gap estimates tend to be quite robust to scale misspecification. In contrast, achievement gap change estimates seem to be quite sensitive to the choice of test scale. Standard empirical methods may not robustly identify the sign of the trend in achievement inequality between students from different racial groups and income classes. Furthermore, ordinal methods may be more powerful and will continue to have the correct size when the test scale has been misspecified.},
}