feds · October 17, 2021

A Dummy Test of Identification in Models with Bunching

Abstract

We propose a simple test of the main identification assumption in models where the treatment variable takes multiple values and has bunching. The test consists of adding an indicator of the bunching point to the estimation model and testing whether the coefficient of this indicator is zero. Although similar in spirit to the test in Caetano (2015), the dummy test has important practical advantages: it is more powerful at detecting endogeneity, and it also detects violations of the functional form assumption. The test does not require exclusion restrictions and can be implemented in many approaches popular in empirical research, including linear, two-way fixed effects, and discrete choice models. We apply the test to the estimation of the effect of a motherâs working hours on her childâs skills in a panel data context (James-Burdumy 2005). Accessible materials (.zip)

Finance and Economics Discussion Series Federal Reserve Board, Washington, D.C. ISSN 1936-2854 (Print) ISSN 2767-3898 (Online) A Dummy Test of Identification in Models with Bunching Carolina Caetano, Gregorio Caetano, Hao Fe, and Eric Nielsen 2021-068 Please cite this paper as: Caetano, Carolina, Gregorio Caetano, Hao Fe, and Eric Nielsen (2021). “A Dummy Test of Identification in Models with Bunching,” Finance and Economics Discussion Series 2021-068. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2021.068. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

A Dummy Test of Identification in Models with Bunching∗ Carolina Caetano†, Gregorio Caetano†, Hao Fe††, Eric Nielsen††† September 2021 Abstract We propose a simple test of the main identification assumption in models where the treatment variable takes multiple values and has bunching. The test consists of adding an indicator of the bunching point to the estimation model and testing whether the coefficient of this indicator is zero. Although similar inspirit to thetestin Caetano (2015), thedummytest has important practicaladvantages: itismorepowerfulatdetectingendogeneity,anditalsodetectsviolationsofthefunctional form assumption. The test does not require exclusion restrictions and can be implemented in many approaches popular in empirical research, including linear, two-way fixed effects, and discrete choice models. We apply the test to the estimation of the effect of a mother’s working hours on her child’s skills in a panel data context (James-Burdumy 2005). JEL Codes: C12, C21, C23, C24 1 Introduction Caetano (2015) introduced the idea that confounders tend to be discontinuous at bunching points. This presents the opportunity to detect endogeneity by testing whether the outcome is discontinuous at a bunching point. There is a growing literature applying this test, see e.g. Rozenas et al. (2017), Erhardt (2017), Pang (2017), Bleemer (2018a), Bleemer (2018b), Ferreira et al. (2018), Lavetti and Schmutte (2018), Caetano and Maheshri (2018), De Vito et al. (2019), Caetano et al. (2019), Fe and Sanfelice (2020) and Caetano et al. (2021). In this paper, we present a test similar in spirit to Caetano (2015)’s discontinuity test (henceforth CDT),butwithsomeimportantadvantages. Akeyadvantageisthatitiseasytoapply: thetestconsists of simply adding an indicator variable (dummy) of a bunching point to the model and testing whether the parameter of the indicator is zero. The only requirement is a rank condition (essentially, that there is bunching), so it extends the applicability of CDT to cases where the treatment variable is discrete or mixed. In fact, some papers have used variations of this approach in an informal attempt to implement CDT (e.g. Caetano and Maheshri 2018, Ferreira et al. 2018 and Caetano et al. 2019). Yet, there has been no formal study of this test, which is one of the aims of this paper. ∗†: University of Georgia. ††: San Diego State University †††: Federal Reserve Board of Governors. We would like to thank Marinho Bertanha, Brantly Callaway, Bruno Ferman, Leonard Goff, Guido Imbens, Pedro Sant’Anna, Tymon Sloczynski,FirminTchatokaaswellasseminarparticipantsinseveralconferencesandinstitutionsforhelpfulconversations and comments. The analysis and conclusions set forth here are those of the authors and do not indicate concurrence by other members of the Federal Reserve Board research staff, the Board of Governors, or the Federal Reserve System. 1

Another advantage of the dummy test is that it tests all the main identification assumptions of the model at the same time, while CDT tests only exogeneity. This is desirable, since when presenting results, itispreferabletoreportdiagnosticstatisticsaboutwhetherallidentifyingassumptionsarevalid, rather than only a subset. In linear models, the dummy test has power to detect violations of both the exogeneity and the linearity conditions. In models with heterogeneous treatment effects, it additionally detects correlated random effects. In linear difference-in-differences models that are estimated with two-way fixed effects regressions, the dummy test detects violations from the “strong parallel trends” assumption, which includes the standard parallel trends assumption plus the uncorrelated treatment effects assumption (e.g., de Chaisemartin and d’Haultfoeuille 2020, Callaway et al. 2021). In nonlinear models, including those estimated by nonlinear regression, GMM, and Maximum Likelihood, it detects violations from the model’s specific exogeneity, functional form, and distributional assumptions. The dummy test is also substantially more powerful than CDT even at detecting endogeneity. The lower power of CDT is due to more than its use of nonparametric estimators – it stems also from the split-sample nature of that test. In fact, we compare the dummy test to the parametric version of CDT, and the dummy test is more powerful. The array of applications where the dummy test can be used is vast. Bunching is commonly found when the treatment variable is constrained to be above or below a threshold. Constraints can be natural (e.g. when the variable cannot be negative, such as the number of cigarettes smoked), or generated by laws/rules (e.g. minimum and maximum requirements, such as minimum schooling). Bunching is also frequently found at interior points, for example due to changes in policies at known thresholds (e.g. bunching at kinks in the US tax schedule). Extensive lists of examples can be found in Caetano (2015) and Caetano et al. (2020) as well as in the public finance bunching literature (see, e.g. Kleven (2016) and Bertanha et al. (2021)). In discrete choice models, there are often product characteristics that are bunched at zero, such as the number of previous purchases of cars of a given brand (Train and Winston 2007), the quantity in foodstuffs of sugar, fat, gluten, carbohydrates, and salt (Harding and Lovenheim 2017), crime in a neighborhood (Caetano and Maheshri 2018), the number of venues of a given type in a neighborhood (e.g. cafes, stores, parks, see Caetano and Maheshri 2019), and the racial composition of schools and neighborhoods (Caetano and Maheshri 2021). Notably, the dummy test can also be used to assess the validity of some popular selection-on-unobservables strategies such as difference-in-differences approaches using multi-way fixed effects. Examples of empirical papers using such strategies where the treatment variable takes multiple values and has bunching include Nunn (2008), Anderson and Sallee (2011), Forman et al. (2012), Imberman et al. (2012), and Dube and Vargas (2013), among many others. We apply the dummy test to study the effect of maternal working hours on the skills of the child. In this literature all models are linear, so it is important to test all the main identification assumptions, not only endogeneity. In a panel setting (James-Burdumy 2005), we find evidence that year fixed effects together with a detailed list of control variables are not sufficient to identify the effect of interest, but family and year fixed effects with the same list of controls are, provided the panel is short enough. With a longer panel, the strong parallel trends assumption becomes invalid and the two-way fixed effects strategyisrejected, highlightingthefragilityofthisstrategyinthiscontext, andtheimportanceofusing tests such as the one we propose to guide the empirical approach. Large parts of the empirical research in social and behavioral sciences relies on observational data 2

and non-experimental identification strategies. This test contributes to a growing list of useful tools for sensitivity analyses in models with multi-valued treatments when experimental or quasi-experimental variationisnotreadilyavailableormaybeimperfect(e.g. Altonjietal.2005,McCrary2008,Oster2019, de Chaisemartin and d’Haultfoeuille 2020, Callaway et al. 2021, D’Haultfoeuille et al. 2021). Since the dummy test does not require exclusion restrictions or special data structures, it can be used in the early stages of research as a diagnostic test to assess whether a different identification strategy should be used (perhaps necessitating longitudinal data, instrumental variables, or different identification assumptions such as those explored in some of the papers cited above). The remainder of the paper is as follows. In Section 2, we formalize the test in the linear case, and discuss its size. In Section 3, we study the power of the test. In Section 4, we compare the dummy test with CDT, and we also discuss the results of a Monte Carlo experiment comparing the tests. We present our application in Section 5. In Section 6, we show how the test can be applied to nonlinear models, and we conclude in Section 7. The Appendix contains proofs and details, as well as various extensions including a section on how interactions of the dummy with controls, and multiple bunching points, can be used to increase power. 2 Test Statistic and Size For simplicity, we focus first on linear models (including heterogeneous treatment effect and differencein-differences models). However, the ideas translate well into nonlinear models, as detailed in Section 6. We want to identify β in the following equation: Y = βX +ε, (1) where Y is the outcome variable, X is the explanatory variable of interest (a scalar), and ε is the remainder (so this equation is without loss of generality).1 Because we are concerned that X and ε are correlated, we may want to use controls. Let the vector Z include a constant and any controls we may wish to include. To estimate β, we intend to run an OLS regression of Y on X and Z. For βˆ obtained from this regression to be consistent, one needs to assume Assumption 1. E[ε|X,Z] = E[ε|Z] = Z(cid:48)λ. This assumption implies Cov(X,ε|Z) = 0. It states that any confounder of X can be absorbed by a linear combination of the elements of Z. Z may include fixed effects, lagged measures of Y and X, proxy variables (including generated regressors) and any other observed control variables. This setting isthereforerathergeneral. InAppendixC.1, weshowthatthissettingincludesheterogeneoustreatment effectsmodelsanddifference-in-differencesmodelsthatareestimatedwithmulti-wayfixedeffects. There, we also use the potential outcomes notation, which may be more familiar to some readers. Let W = (X,Z(cid:48))(cid:48), and assume that Assumption 2. E[(W(cid:48),1(X = 0))(W(cid:48),1(X = 0))(cid:48)] is invertible. 1Notethatεdoesnotneedtobecenteredaroundzero,hencewhyaconstantisnotexplicitlyincludedinthisequation. 3

Because Z includes a constant, this rank condition implicitly requires that 0 < P(X = 0) < 1, i.e., that X varies and has a bunching point at X = 0. Note that there is no restriction in the support beyond Assumption 2. In particular, the distribution of X may be discrete or mixed. WeproposetestingAssumption1byadding1(X = 0)totheregressionofY onX andZ,andtesting whether the coefficient of 1(X = 0) is equal to zero. To increase power, it may also be desirable to add interactions of 1(X = 0) and functions of Z instead. Also, if more than one bunching point is available, it may be advantageous to add more dummies. Here we focus on the simple case where a single dummy is added, and we provide details of these extensions in Appendix D. Let the sample be {(Y ,X ,Z(cid:48))(cid:48)}n , and define y = (Y ,...,Y )(cid:48), d = (1(X = 0),...,1(X = 0))(cid:48), i i i i=1 1 n 1 n andwthematrixwithrowsequalto(X ,Z(cid:48))(cid:48).Fornlargeenough, Assumption2guaranteesthatwecan i i write the matrix inverses below. Define M = I −w(w(cid:48)w)−1w(cid:48), where I is the n×n identity matrix. w Then the coefficient of d in a regression of y onto x, z and d is θˆ= (d(cid:48)M d)−1d(cid:48)M y. w w The dummy test statistic is simply the t-statistic of the test that the coefficient of 1(X = 0) is significant. Specifically, the test statistic is θˆ/SE(θˆ), where SE(θˆ) is the estimator of the standard deviation of θˆ, which will depend on the assumptions and method of estimation. Thus, at the α significance level, we reject the null hypothesis H : Assumption 1 holds 0 if |θˆ/SE(θˆ)| > z , where z is the (1−α/2)·100th quantile of the standard normal distribution. 1−α/2 1−α/2 Technically, the dummy test is identical to a specification test in which we add an additional term, 1(X = 0), to the regression, and then test if the coefficient of this new term is equal to zero. Such tests are ubiquitous in practice. Establishing that the size is asymptotically correct is simply a matter of proving the convergence in distribution of the OLS estimator of the dummy coefficient in the augmented regression, andtheconsistencyofthecorrespondingstandarderrorestimator. Theseconvergenceresults have been established for a host of combinations of data structures and assumptions about the data. Instead of choosing a specific structure and repeating such results here, we refer the reader directly to the relevant papers. For classical cases, see White (1980) for cross-sectional data with independent but not identically distributed observations, and see Arellano et al. (1987) for panel data with clustered errors. Subsequently, many papers have established the asymptotic behavior of the OLS coefficients and standard errors under variations in the data structure and relaxations of the assumptions of the model. For example, asymptotic results for OLS regressions with generated covariates are established in Newey and McFadden (1994) and Newey (1994). The literature on spatial and panel data is also rich in different specifications, cluster definitions, variance models and covariance estimation techniques for which asymptotic results have been established (e.g. Lee 2007, Bester et al. 2011, Bonhomme and Manresa 2015, Bester et al. 2016, de Chaisemartin and d’Haultfoeuille 2018, de Chaisemartin and d’Haultfoeuille 2020). 4

3 Test Power In this section, we study the power of the test. Local power analyses and other determinants of power which depend on the variance of θˆ follow trivially from the asymptotic results of the specific setting (e.g., see the papers cited at the end of the previous section). We focus instead on the magnitude of θˆ, which is the main determinant of power as the sample size increases. Specifically, we examine what θˆ identifies when Assumption 1 does not hold. We can write, without loss of generality, E[ε|X,Z] = Γ(X,Z)+∆(Z)1(X = 0), (2) where Γ(X,Z) is continuous in X at X = 0 for all Z.2 This equation categorizes violations from Assumption 1 as either continuous (Γ(X,Z) (cid:54)= Z(cid:48)λ) or discontinuous at X = 0 (∆(Z) (cid:54)= 0). In Appendix B, we develop a model in which the bunching at X = 0 is generated by a constraint that X cannot be negative. This example is rather general, and fits most applications where bunching is at one extreme of the support of X’s distribution. There, Γ and ∆ have a structural interpretation within the model.3 Let Γ = (Γ(X ,Z ),...,Γ(X ,Z ))(cid:48), ∆ = (∆(Z )1(X = 0),...,∆(Z )1(X = 0))(cid:48), and (cid:15) = 1 1 n n 0 1 1 n n ((cid:15) ,...,(cid:15) )(cid:48). Then the estimated coefficient of 1(X = 0) is 1 n θˆ= (d(cid:48)M d)−1d(cid:48)M Γ+(d(cid:48)M d)−1d(cid:48)M ∆ +(d(cid:48)M d)−1d(cid:48)M (cid:15). w w w w 0 w w The last term of θˆis asymptotically negligible.4 In Appendix A, we show that, under standard assumptions such as random sampling and the existence of moments, E[Γ(0,Z)|X = 0]−Γ∗ E[∆(Z)|X = 0]−∆∗ θˆ→ 0 + 0, (3) p 1−d∗ 1−d∗ where, letting m := plim n−1(z(cid:48)(I −x(x(cid:48)x)−1x(cid:48))z), Γ∗ := E[Z|X = 0](cid:48)m−1 plim n−1z(cid:48)(I − ZX n→∞ 0 ZX n→∞ x(x(cid:48)x)−1x(cid:48))Γ is the predicted value of Γ(X,Z) in a regression on X and Z at X = 0. Analogously, ∆∗ := E[Z|X = 0](cid:48)m−1 E[Z∆(Z)1(X = 0)]istheasymptoticlimitofthepredictedvalueof∆(Z)1(X = 0 ZX 0) from a regression on X and Z at X = 0. The power is therefore dependent on two factors. The first factor, E[Γ(0,Z)|X = 0]−Γ∗, depends 0 on the continuous nonlinearities. If Γ(X,Z) = αX +Z(cid:48)λ, (i.e. there is no misspecification, but there is linear endogeneity, through αX) then this term will be zero.5 In every other case, this term will be 2Any function f(X,Z) can be written without loss of generality as the sum of a continuous function in X at X = 0 and the discontinuity in X at X =0. 3In particular, in that model, (a) if there is endogeneity, then ∆(Z) (cid:54)= 0, and thus there is a discontinuity in the unobservables generated by the constraint; (b) a discontinuity in the treatment function will also affect ∆(Z); and (c) Γ is affected by continuous nonlinearities both in the treatment function and, if there is endogeneity, in the effect of the confounder on the outcome. 4This holds for any data structure, method, and choice of Z under the assumptions that guarantee the consistency of thatmethod. Forexample,ifthe(cid:15) areindependentbutnotidenticallydistributedandweusetheEicker-Whitestandard i errors, then the negligibility of the last term follows by White (1980)’s Theorem 1, under Assumptions 2-4 in that paper (replacing White (1980)’s X and ε with (W(cid:48),1(X = 0))(cid:48) and (cid:15) , respectively, and noting that Assumption 1 in that i i i i i paper holds by (2)). 5Nevertheless, since α and ∆(Z) tend to be determined by the same factors, in this case the endogeneity will usually 5

different from zero. As we show in Appendix B, nonlinearities in Γ may appear because the treatment functionismisspecified, orbecausethereisnonlinearendogeneity. Inparticular, thistermdetectsaction of confounders that only affect the outcome for values of X away from the bunching point, since this tends to generate continuous nonlinearities in E[Y|X,Z].6 The second factor, E[∆(Z)|X = 0]−∆∗, depends on the size of the discontinuities. Such discon- 0 tinuities appear if the treatment function is discontinuous at X = 0 or if there is endogeneity and the unobservables are discontinuous at the bunching point. As argued by Caetano (2015), and shown in Appendix B for a constrained choice model, discontinuities in unobservables are ubiquitous. The term d∗ := E[Z|X = 0](cid:48)m−1 E[Z1(X = 0)] in the denominator is the asymptotic limit of the ZX predicted value of 1(X = 0) from a regression on X and Z at X = 0. In Appendix A, we show that 0 < 1−d∗ ≤ 1 and thus that the difference in the numerators are further magnified. To illustrate the sources of power of the dummy test, consider our application. We are interested in the effects of the number of hours a mother works during the first three years of the child on the child’s skills. There is a pronounced bunching of 25% of mothers at zero hours, which can be seen in Figure 1. Figure 1: Evidence of Bunching in Maternal Working Hours elpmaS llA :FDC 1 8. 6. 4. 2. 0 0 500 1000 1500 2000 2500 Average hours per year working during the three first years of child Note: The figure shows the empirical cumulative density function of the mother’s average hours working per year during the first three years of the child’s life for our full sample (N =3,383). Source: NLSY79. See Section 5 for details about the application. The top left panel in Figure 2 shows the local linear fit of the expected verbal score of the child (our outcome variable) for each positive level of working hours of the mother, as well as the average test score among those who are bunched. The evident discontinuity at zero has only two possible (non-exclusive) explanations: the effect of working hours on skills is discontinuous at zero hours, or the confounders are discontinuous at zero hours. Indeed, the vast majority of observable confounders are discontinuous at zero hours. The other panels in Figure 2 are constructed similarly to the top left panel already discussed. These panels show be detected by the second term in equation (3). This can be seen in the example in Appendix B and in the Monte Carlo study (Appendix F). 6For example, suppose that Y = βX +Z(cid:48)γ+δζ +ε, where E[ζ|X,Z] = 0 if X < 10, and E[ζ|X,Z] = a(X −10) if X ≥10. Then E[Y|X,Z] is apiecewise linear functionof X for X >0, with akink at X =10. So, while the confounderζ does not vary discontinuously at X =0, the dummy test can still detect it because of non-linearities. 6

Figure 2: Evidence that ε is Discontinuous at Bunching Points erocS labreV :dlihC 001 59 09 58 08 57 07 0 500 1000 1500 2000 Average hours working during three first years of child TQFA :rehtoM 06 05 04 03 02 01 0 500 1000 1500 2000 Average hours working during three first years of child tneserP fi 1= :esuopS 8. 7. 6. 5. 4. 0 500 1000 1500 2000 Average hours working during three first years of child erocS EMOH :dlohesuoH 501 001 59 09 0 500 1000 1500 2000 Average hours working during three first years of child Note: ThisfigureshowsthelocallinearregressionofobservablesonX (averagehoursworkingperyearduringthechild’s first three years) along with the 95% confidence interval. The bandwidth is 300 hours. At X = 0 and X = 2,080, the average along with the 95% confidence interval is also shown. N = 3,383. Source: NLSY79. See Section 5 for details about the application. discontinuities in the mother’s Armed Forces Qualifying Test (AFQT) score, a pre-market measure of her academic skills, the presence of the spouse in the household in the year the child took the test, and the Home Observation Measurement of the Environment (HOME) score.7 Moreover, we find that those children bunched at zero are systematically negatively selected, in the sense that the observables that are positively correlated with Y (verbal test scores) tend to be discontinuously lower at X = 0. This is consistent with what we found in the top left panel of Figure 2 for the outcome variable Y. Because discontinuities at X = 0 are so prevalent in observables, we expect that they should also be prevalent in unobservables. Thus, if there is endogeneity, we expect ∆(Z) (cid:54)= 0 in this application. The models in this literature (e.g. James-Burdumy 2005 and the references therein) rely not only on exogeneity as the main identification assumption; they also assume that the model is linear. Any continuousnonlinearitiesresultingfromthenonlinearityofthetruemodelarereflectedinΓ.Forinstance, it is possible that one additional hour of work becomes more or less costly for the development of the child’s skill the longer hours the mother works. Additionally, it is possible that there are confounders that affect the outcome only after the mother works enough hours (e.g. quality of child care). 7TheHOMEscoremeasuresthequalityofthehomeenvironmentofthechildforcognitiveandemotionaldevelopment (Bradley and Caldwell 1984; Bradley et al. 1992). 7

The linearity assumption also indirectly rules out the possibility that the effect of the hours the mother works is discontinuous at zero hours. Indeed, it is plausible that the effect is continuous, as working 0 hours per year in the first three years of the child should have a similar effect on the child’s skill at age 4 to working, say, 1 hour per year. In any case, a discontinuity in the treatment effect would affect ∆(Z), and thus be detected by the dummy test. Figure 1 also shows bunching of 3% of the sample at 2,080 hours, which is equivalent to 40 hours per week for 52 weeks. This opens the possibility of a multiple dummy test, by including 1(X = 0) and 1(X = 2,080) in the regression, and performing a joint test of whether the coefficients of both dummies are equal to zero, as described in Appendix D.2. However, Figure 2 does not show a corresponding discontinuityintheoutcomeorobservablesatthatthreshold. Thissuggeststhatboththeunobservables as well as the treatment effect are likely to be continuous at 2,080 hours. As discussed in Appendix D.2, in this instance, the multiple dummy test is advisable only if the amount of bunching at X = 2,080 is substantially larger than at X = 0, which is not the case here. 4 Comparison with Caetano (2015)’s Discontinuity Test Inthissection,wecomparethepowerofthedummytestandCaetano(2015)’sDiscontinuityTest(CDT). CDT identifies the quantity lim E[E[Y|X = 0,Z]−Y|X = x]. In our context, this is equivalent to x↓0 limE[∆(Z)|X = x], (4) x↓0 providedcertainconditionsonΓhold(e.g. boundedabovebyanintegrablefunction). Thus,thepowerof CDTcomesentirelyfromthediscontinuities. Incontrast,thedummytestcandetectbothdiscontinuities and continuous nonlinearities (Γ(X,Z) (cid:54)= αX +Z(cid:48)γ). Supposing Γ(X,Z) = αX + Z(cid:48)λ, the power of both tests depend entirely on the discontinuities. First, we consider the estimated quantities. CDT identifies an average of the discontinuities among the values of Z near the bunching point. The dummy test identifies a more complex quantity (the second term in equation (3)). On the one hand, it subtracts from the average of the discontinuities the part of those discontinuities which is linearly predicted by X and Z. On the other hand, it divides this term by (1−d∗), a number between 0 and 1. Neither quantity always dominates the other. In contrast, the standard errors of the estimators of both quantities are very different. CDT uses nonparametric estimators, so the rate of convergence of its test statistic is much slower than that of the √ √ dummy test ( nh vs. n, where h is the bandwidth in the local linear regression in CDT). Therefore, theresultingpowerofthedummytestwillusuallybelargerbecausethestandarderrorsoftheestimators will tend to be much smaller. The stark difference in power between CDT and the dummy test can be seen in the Monte Carlo simulations in Appendix F. The results there reflect what is expected from the theory: the dummy test detects continuous misspecification (while CDT does not), and has substantially more power to detect endogeneity. The variance advantage of the dummy test over CDT is not only due to the use of parametric versus nonparametric estimators. To show this, we develop a parametric version of CDT. We refer to this test 8

as the Linear CDT. In model (1), the Linear CDT estimates (4) in two steps:8 1. Regress Y onto Z using only observations such that X = 0; let the coefficients be λˆ. Calculate Q = Z(cid:48)λˆ−Y in the entire sample. 2. Regress Q onto X using only observations such that X > 0. The intercept of this regression is θˆ . LCDT IfAssumption1holds,thenthefirststepisanestimatorofλ,soQ ≈ −βX−(cid:15),where(cid:15) = Y −E[Y|X,Z]. Thus, step 2 consistently estimates the true intercept, zero. The Linear CDT is identical to CDT with linear instead of nonparametric regressions in the first and second steps. The Linear CDT has power to detect misspecification as well as endogeneity, and does not suffer the loss of power from the nonparametric estimation. However, like CDT, it is still a split-sample test, as steps 1 and 2 are estimated on different subsamples of the data. In Appendix F.3, we consider the performance of the Linear CDT in our Monte Carlo study. Although it is substantially more powerful than CDT, it is less powerful than the dummy test. 5 Application We showcase the test using the application in James-Burdumy (2005), which estimates the effect of maternal working hours on children’s skills. We assemble the same data, from the National Longitudinal Survey of Youth (NLSY),9 and we consider both the original sample and an extended sample augmented to include data from more recent survey rounds. In the notation of our paper, Y is the child’s verbal test score (Peabody Picture Vocabulary Test), measured around age four, and X is the yearly average number of working hours of the mother in the three years following the child’s birth.10 The NLSY allows us to observe many covariates that help control for confounders, but controlling only for these covariates might not be sufficient, thus leading to bias in the effect of interest. James- Burdumy(2005)improvesonthepreviousliteraturebyaddingtime-invariantfamilyfixedeffectstothese detailedcontrolspecifications. Intuitively,thepaperaimstocomparethetestscoresoftwosiblingswhose mother worked different hours during their respective first three years of life. The siblings were born in different years, and so the test scores are observed in different calendar years, 1986 and 1988, depending on the sibling. Because family and year fixed effects are used, the identification strategy is a conditional (on observed controls) version of difference-in-differences, with two years and many groups (families). Naturally, there is still the concern that there are other confounders varying with the child within the family, such as factors affecting labor supply and test scores that may change across children during their first three years of life (e.g., spouse’s presence, hours of work, quality of child care). We also consider an extended sample where we include siblings whose test scores are observed in any of the years 1986, 1988, 1990 and 1992. This version of difference-in-differences also compares siblings’ 8We formalize the Linear CDT in Appendix E. 9Specifically, we link maternal work history data during the first three years of a child’s life from the National Longitudinal Surveys of Youth 1979 (NLSY79) to the children’s skill measures from the Children of the National Longitudinal Surveys (CNLSY). 10We start the period in the fourth month after the month of the birth of the child to avoid measurement error related to differences in maternity leave. 9

testscoreswhentheyareobservedfartherapartfromeachother,whichextendsthesamplesubstantially, butmayleadtofurthersourcesofbiasduetoviolationsofthestrongparalleltrendsassumption. Indeed, not only do the unobservables of the family have more scope to change over time (a violation of the standard parallel trends assumption), but the treatment effects are more likely to be heterogeneous (a violation of the uncorrelated treatment effects assumption, which is also necessary for strong parallel trends). Fortunately, the dummy test detects both of these types of confounders (see Appendix C.1.2 for details). Table 1 presents the results for different versions of the dummy test, different samples, and different identificationstrategies. ThefirsttwocolumnsshowresultsfortheoriginalsamplefromJames-Burdumy (2005), while the last two columns show the results for the extended sample. The specification labeled “Diff” refers to the one used by James-Burdumy (2005) with the most detailed set of controls, including year fixed effects.11 The specification labeled “DiD” adds to these controls family fixed effects, which is the best specification in that paper.12 In the first two rows of the table, we implement the univariate version of the test, while in the next two rows we implement the multivariate version (Appendix D.1) by allowing for heterogeneity in the coefficient of the dummy variable by whether the spouse is both presentandhasahighschooldegree. Specifically, insteadofthedummy1(X = 0),weaddtwodummies 1(X = 0,Z)and1(X = 0,Zc),whereZ indicatesthatthespouseispresentandhasahighschooldegree, andZc representsallotherpossibilities, andperformajointtestofwhetherthetwocoefficientsareequal tozero. Wechoosetousethepresenceandlevelofeducationofthespouseasthesourceofheterogeneity because a mother’s decision to work likely depends on whether there is a spouse present who is capable of earning enough money on their own to support the family. The first two columns show that in the original context of James-Burdumy (2005) we strongly reject Assumption 1 for the Diff specification, but we do not reject Assumption 1 for the DiD specification. Thus,thedummytestsuggeststhatJames-Burdumy(2005)’sapproachofconsideringaseconddifference in her analysis is key to controlling for most confounders. Inthenextcolumnsofthetable,weextendthesampleandconducttheanalogouscomparisonsunder presumablystrongerassumptions(becauseofthewiderrangeofcomparisonamongsiblingsacrossyears, as discussed above). Reassuringly, we reject even the DiD identification strategy in this case. The table also shows that, in this context, the multivariate test (bottom rows) tends to have a bit more power to detect violations from Assumption 1 than the univariate test (top rows). However, the conclusions are very similar irrespective of the version of the dummy test one uses. 11The list of controls includes the child’s gender, birth order, and age; the mother’s age at the child’s birth, highest educationlevel,andaveragewageinthechild’sfirstthreeyearsoflife;whetherthespouseispresent;thespouse’sincome and highest education level; the number of children in the household with ages 0-2, 3-5, 6-11 and above 12; and region of residence and survey year fixed effects. 12James-Burdumy (2005) also provides an IV approach, but argues that this DiD specification is the preferred one. 10

Table 1: F-statistics and P-Values of Dummy Tests Sample Original Extended Bunching Location X=0 X=0 Identification Strategy Diff DiD Diff DiD Univariate F statistic 10.358 0.000 16.071 2.428 p-value 0.001 0.997 0.000 0.119 Multivariate F statistic 6.485 0.987 18.088 2.471 p-value 0.002 0.373 0.000 0.085 N 1867 1172 3383 2545 Note: This table shows the F statistics and p-values of the univariate and multivariate dummy tests for different samples and identification strategies. The samples are either the original one used in James-Burdumy (2005), or the extended sample discussed in the main text. The identification strategies are either Differences (Diff) or Difference-in-Differences (DiD), where the former includes several controls plus year FEs (see Footnote 11), while the latter adds family FEs as well. In the multivariate case, we allow for heterogeneity in the coefficient of the dummy by whether the spouse is both present and has a high school degree. We conclude that most of the sources of bias in the original sample and application seem to come from confounders that vary across families, which are absorbed by the family fixed effects. By focusing on siblings whose test scores were observed within a narrow range of years, James-Burdumy (2005) seems to have successfully controlled for confounders with the DiD identification strategy. By contrast, an extended version of that DiD identification strategy where siblings’ test scores are allowed to be observed within a wider range of years does not seem to be valid. 6 Nonlinear Models The dummy test can be implemented in nonlinear models. If the model allows for the inclusion of the dummy and the identification of its coefficient under the null, the test can be performed. The gamut of such models is very large, and it is not possible, as far as we know, to characterize every identification strategy under the same conceptual umbrella. In this section, we show how the dummy test can be applied to a wide range of nonlinear models that are estimated with extremum estimators. This includes most classical models which are estimated by Maximum Likelihood or GMM, such as nonlinear regression, probit, and discrete choice models, among others. Assumption 3. Suppose that, under some condition A, the parameter γ can be identified as 0 γ = argmaxM (Y,X,Z;γ). 0 0 γ∈Λ This assumption states that γ is identified as the argument which maximizes a function M within 0 0 a parameter set γ when condition A holds. We want to test whether assumption A holds. We will now extend the model to a more general one which includes the dummies and nests the original model under assumption A. 11

Assumption 4. Suppose that there exists a function Q (Y,X,Z,1(X = 0);γ,δ) such that if assumption 0 A holds, (γ ,0) = argmaxQ (Y,X,Z,1(X = 0);γ,δ), 0 0 γ∈Λ,δ∈Ω and Q (Y,X,Z,1(X = 0);γ,0) = M (Y,X,Z;γ). 0 0 The following theorem establishes that one can test assumption A by testing whether δ = 0 using a t or F test, depending on whether δ is a scalar or a multivariate vector. Theorem 6.1. Let Qˆ be an estimator of Q . If Assumptions 3 and 4, as well as the conditions of n 0 Theorem 3.1 in Newey and McFadden (1994) hold,13 then if condition A holds, (cid:32) (cid:33) (cid:112) γˆ−γ nVˆ 0 → N(0,I), δˆ d where I is the identity matrix, and Vˆ is the consistent estimator of the asymptotic variance of the coefficients built using Theorem 4.1 in Newey and McFadden (1994). The proof of Theorem 6.1 is trivial and is thus omitted. Note that the setup above is true for any specification test based on the inclusion of an additional variable into the model which should not be there if the identification assumption holds. We propose specifically the inclusion of the dummy because of the discontinuities in confounders that are often found at bunching points, as we discuss in Section 3. In Appendix C.2, we discuss assumptions, power, and implementation details in the context of some well known models fitting this setting: standard nonlinear models which are estimated with GMM (Appendix C.2.1), probit (Appendix C.2.2), and discrete choice models (Appendix C.2.3). 7 Conclusion We propose a simple test of identification when the treatment variable takes multiple values and has a bunching point. The test is easy to implement: it consists of adding a dummy of the bunching point to the model and testing if the coefficient of the dummy is equal to zero. To increase power, one may also interactthedummywithcontrols, orincludedummiesofadditionalbunchingpoints. Thedummytestis similar in spirit to Caetano (2015)’s discontinuity test, but it is more powerful at detecting endogeneity, and it also detects misspecification. The test can be used to validate identification strategies or diagnose problems, and it has advantages over Caetano (2015)’s discontinuity test on both accounts. The test can be naturally extended for a multivariate treatment vector with bunching points at all coordinates, as implemented in Caetano and Maheshri (2018) and Caetano et al. (2019). We conjecture that this test can also be extended to other contexts where bunching has been used for testing, analogously to what has been done by Caetano et al. (2016) for control function approaches and by Khalil and Yildiz (2019) for treatment variables without bunching. 13Note that implicit in these conditions is often a requirement that 1(X = 0) is not a part of Z, that is, the model is continuousinX atX =0. Itisnotalwaysnecessaryforthemodeltobecontinuousifthediscontinuitydoesnotinvalidate the rank condition on the extended model (e.g. the model is Y = βX/(1+α1(X = 0))+ε, and the extended model includes the dummy additively). 12

References Altonji, J. G., Elder, T. E., and Taber, C. R. (2005). Selection on observed and unobserved variables: Assessing the effectiveness of catholic schools. Journal of political economy, 113(1):151–184. Anderson, S. T. and Sallee, J. M. (2011). Using loopholes to reveal the marginal cost of regulation: The case of fuel-economy standards. American Economic Review, 101(4):1375–1409. Arellano, M. et al. (1987). Computing robust standard errors for within-groups estimators. Oxford bulletin of Economics and Statistics, 49(4):431–434. Berry, S., Levinsohn, J., and Pakes, A. (1995). Automobile prices in market equilibrium. Econometrica: Journal of the Econometric Society, pages 841–890. Berry, S., Linton, O. B., and Pakes, A. (2004). Limit theorems for estimating the parameters of differentiated product demand systems. The Review of Economic Studies, 71(3):613–654. Bertanha, M., McCallum, A.H., andSeegert, N.(2021). Betterbunching, nicernotching. arXiv preprint arXiv:2101.01170. Bester, C. A., Conley, T. G., and Hansen, C. B. (2011). Inference with dependent data using cluster covariance estimators. Journal of Econometrics, 165(2):137–151. Bester, C. A., Conley, T. G., Hansen, C. B., and Vogelsang, T. J. (2016). Fixed-b asymptotics for spatially dependent robust nonparametric covariance matrix estimators. Econometric Theory, 32(1):154. Bleemer, Z. (2018a). The effect of selective public research university enrollment: Evidence from california. Working Paper. Bleemer, Z.(2018b). Toppercentpoliciesandthereturntopostsecondaryselectivity. Available at SSRN 3272618. Bonhomme, S. and Manresa, E. (2015). Grouped patterns of heterogeneity in panel data. Econometrica, 83(3):1147–1184. Bradley,R.H.andCaldwell,B.M.(1984). Thehomeinventoryandfamilydemographics. Developmental Psychology, 20(2):315. Bradley, R. H., Caldwell, B. M., Brisby, J., Magee, M., Whiteside, L., and Rock, S. L. (1992). The home inventory: a new scale for families of pre-and early adolescent children with disabilities. Research in developmental disabilities, 13(4):313–333. Caetano, C. (2015). A test of exogeneity without instrumental variables in models with bunching. Econometrica, 83(4):1581–1600. Caetano,C.,Caetano,G.,andNielsen,E.(2020). CorrectingEndogeneityBiasinModelswithBunching. Working Paper. Available here. Caetano, C., Caetano, G., and Nielsen, E. (2021). Should children do more enrichment activities? Leveraging bunching to correct for endogeneity. Working Paper. Available here. Caetano, C., Rothe, C., and Yıldız, N. (2016). A discontinuity test for identification in triangular nonseparable models. Journal of Econometrics, 193(1):113–122. Caetano, G., Kinsler, J., and Teng, H. (2019). Towards causal estimates of children’s time allocation on skill development. Journal of Applied Econometrics, 34(4):588–605. 13

Caetano, G. and Maheshri, V. (2018). Identifying Dynamic Spillovers of Crime with a Causal Approach to Model Selection. Quantitative Economics, 9(1):343–394. Caetano, G. and Maheshri, V. (2019). Gender segregation within neighborhoods. Regional Science and Urban Economics, 77:253–263. Caetano, G. and Maheshri, V. (2021). Explaining Recent Trends in US School Segregation. Technical report, Forthcoming. Callaway, B., Goodman-Bacon, A., and Sant’Anna, P. (2021). Dose-response difference in differences: Identification. Working Paper. Chow, Y. S. and Teicher, H. (1997). Probability Theory. Springer - New York. de Chaisemartin, C. and D’Haultfœuille, X. (2020). Difference-in-differences estimators of intertemporal treatment effects. Available at SSRN 3731856. de Chaisemartin, C. andd’Haultfoeuille, X. (2020). Two-way fixedeffects estimatorswith heterogeneous treatment effects. American Economic Review, 110(9):2964–96. de Chaisemartin, C. and d’Haultfoeuille, X. (2018). Fuzzy differences-in-differences. The Review of Economic Studies, 85(2):999–1028. De Vito, A., Jacob, M., and Müller, M. A. (2019). Avoiding taxes to fix the tax code. Available at SSRN 3364387. D’Haultfoeuille, X., Hoderlein, S., and Sasaki, Y. (2021). Nonparametric difference-in-differences in repeated cross-sections with continuous treatments. arXiv preprint arXiv:2104.14458. Dubé,J.-P.,Fox,J.T.,andSu,C.-L.(2012). Improvingthenumericalperformanceofstaticanddynamic aggregate discrete choice random coefficients demand estimation. Econometrica, 80(5):2231–2267. Dube, O. and Vargas, J. F. (2013). Commodity price shocks and civil conflict: Evidence from colombia. The review of economic studies, 80(4):1384–1421. Erhardt, E. C. (2017). Microfinance beyond self-employment: Evidence for firms in bulgaria. Labour economics, 47:75–95. Fe, H. and Sanfelice, V. (2020). How bad is crime for business? evidence from consumer behavior. Center for Health Economics and Policy Studies Working Paper. Ferreira, D., Ferreira, M. A., and Mariano, B. (2018). Creditor control rights and board independence. The Journal of Finance, 73(5):2385–2423. Forman, C., Goldfarb, A., and Greenstein, S. (2012). The internet and local wages: A puzzle. American Economic Review, 102(1):556–75. Harding, M. and Lovenheim, M. (2017). The effect of prices on nutrition: comparing the impact of product-and nutrient-specific taxes. Journal of Health Economics, 53:53–71. Imbens,G.W.(2000).Theroleofthepropensityscoreinestimatingdose-responsefunctions.Biometrika, 87(3):706–710. Imberman, S. A., Kugler, A. D., and Sacerdote, B. I. (2012). Katrina’s children: Evidence on the structure of peer effects from hurricane evacuees. American Economic Review, 102(5):2048–82. 14

James-Burdumy,S.(2005).Theeffectofmaternallaborforceparticipationonchilddevelopment.Journal of Labor Economics, 23(1):177–211. Khalil,U.andYildiz,N.(2019). ATestofSelectiononObservablesAssumptionUsingaDiscontinuously Distributed Covariate. working paper. Kleven, H. J. (2016). Bunching. Annual Review of Economics, 8:435–464. Lavetti, K. and Schmutte, I. M. (2018). Estimating compensating wage differentials with endogenous job mobility. Working paper. Lee, L.-F. (2007). Identification and estimation of econometric models with group interactions, contextual factors and fixed effects. Journal of Econometrics, 140(2):333–374. McCrary, J. (2008). Manipulation of the running variable in the regression discontinuity design: A density test. Journal of econometrics, 142(2):698–714. McFadden, D. et al. (1973). Conditional logit analysis of qualitative choice behavior. Institute of Urban and Regional Development, University of California. Nevo, A. (2000). A practitioner’s guide to estimation of random-coefficients logit models of demand. Journal of economics & management strategy, 9(4):513–548. Newey,W.K.(1994). Theasymptoticvarianceofsemiparametricestimators. Econometrica,62(6):1349– 1382. Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of econometrics, 4:2111–2245. Nunn, N. (2008). The long-term effects of africa’s slave trades. The Quarterly Journal of Economics, 123(1):139–176. Oster, E. (2019). Unobservable selection and coefficient stability: Theory and evidence. Journal of Business & Economic Statistics, 37(2):187–204. Pang, J. (2017). Do subways improve labor market outcomes for low-skilled workers. working paper, Syracuse University. Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55. Rozenas, A., Schutte, S., and Zhukov, Y. (2017). The political legacy of violence: The long-term impact of stalin’s repression in ukraine. The Journal of Politics, 79(4):1147–1161. Słoczyński, T. (2020). Interpreting ols estimands when treatment effects are heterogeneous: Smaller groups get larger weights. Forthcoming, Review of Economics and Statistics. Train, K. E. (2009). Discrete choice methods with simulation. Cambridge university press. Train, K. E. and Winston, C. (2007). Vehicle choice behavior and the declining market share of us automakers. International economic review, 48(4):1469–1496. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4):817–838. 15

A Proofs of the claims in Section 3 First, we prove the statements made prior to equation (3). Let mˆ = n−1(z(cid:48)(I −x(x(cid:48)x)−1x(cid:48))z), then ZX d(cid:48)∆ − (cid:0)1 (cid:80)n Z 1(X = 0) (cid:1) mˆ−1 z(cid:48)∆ (d(cid:48)M d)−1d(cid:48)M ∆ = 0 n i=1 i i ZX 0 . w w 0 d(cid:48)d− (cid:0)1 (cid:80)n Z 1(X = 0) (cid:1) mˆ−1 z(cid:48)d n i=1 i i ZX Note that the coefficient vector of Z in a regression of a variable Q onto X and Z is mˆ−1 z(cid:48)M q, where ZX x M = I−x(x(cid:48)x)−1x(cid:48) andq = (Q ,...,Q )(cid:48).Moreover, thepredictedvalueofsucharegressionatX = 0 x 1 n is (cid:0)1 (cid:80)n Z 1(X = 0) (cid:1) mˆ−1 z(cid:48)M q. However, note that since X is orthogonal to 1(X = 0), M d = d, n i=1 i i ZX x x and M ∆ = ∆ . Therefore, the second term in the numerator is the prediction of ∆(Z)1(X = 0) in a x 0 0 regressionontoX andZ atX = 0,andthesecondterminthedenominatoristhepredictionof1(X = 0) in a regression onto X and Z at X = 0. The denominator is a quadratic form, and by Assumption 2, for n large enough, it is positive with probability equal to one. Analogous interpretations can be made for the first term in (3). Next, we prove equation (3) itself. The convergence in probability of d(cid:48)d/n, d(cid:48)∆ /n, d(cid:48)z/n, z(cid:48)∆ /n 0 0 and mˆ can be shown using whichever laws of large numbers are applicable for the specific data ZX structure and Z specification. For example, under the setting in White (1980) (i.e. independent but not identically distributed observations, and E[|∆(Z )|2+α] ≤ A, E[||Z ||2+α] ≤ A and E[||Z ∆(Z )||2+α] ≤ A i i i i for some α,A > 0), the results are obtained by Brunk-Chung’s Strong Law of Large Numbers (see Chow and Teicher (1997), Theorem 10.1.3, for r=1). By the continuous mapping theorem, E[∆(Z)|X = 0]−E[Z|X = 0](cid:48)m−1 E[Z∆(Z)1(X = 0)] (d(cid:48)M d)−1d(cid:48)M ∆ → ZX . w w 0 p 1−E[Z|X = 0](cid:48)m−1 E[Z1(X = 0)] ZX The convergence of the first term to (E[Γ(0,Z)|X = 0]−Γ∗)/(1−d∗) is established analogously. 0 B A model of constrained choice In this section, we present a model where X is the result of a constrained problem. We show that it is possibletointerpretΓand∆structurally. Importantly, thisexampleprovidesintuitionthatendogeneity and nonlinearities typically affect both Γ and ∆, but in different ways. ConsiderthecasethatX cannotbenegative. ThechoiceofX isassumedtoresultfromacombination ofobservable, Z,andunobservable, η,factorsunderaconstrainedproblem. Thesolutiontothisproblem thus yields X = max{0,h(Z,η)}, (5) whereη isscalarandhisstrictlymonotonicinη (supposeitisincreasing, forthesakeoftheexposition). Assume that 0 < P(h(η;Z) < 0) < 1, so that the constraint is binding to a subset of the population. Define X∗ = h(Z,η) as the “latent” or “desired” choice absent the non-negativity constraint. Let the model be Y = g(X,Z)+m(Z)η+ν, where g is continuous in X at X = 0, and E[ν|X,Z,η] = 0. Since the intention is to estimate the 16

effect of X on Y by regressing Y onto X and Z, there are two concerning problems. First, there may be misspecification of the functional form: g(X,Z) (cid:54)= βX + Z(cid:48)λ. Second, there may be endogeneity: m(Z) (cid:54)= 0. Define the notation D (Z) = f(0,Z)−lim f(x,Z), and assuming the limits below exist, we can f x↓0 write E[Y|X,Z] = [g(X,Z)−D (Z)1(X = 0)]+m(Z)[h−1(X;Z)−D (Z)1(X = 0)] g h−1 +D (Z)1(X = 0)+m(Z)[E[h−1(X∗;Z)|X∗ ≤ 0,Z]−limh−1(x,Z)]1(X = 0). g x↓0 Note that E[h−1(X∗;Z)|X∗ ≤ 0,Z] − lim h−1(x,Z) ≤ 0 a.s., and P(E[h−1(X∗;Z)|X∗ ≤ 0,Z] − x↓0 lim h−1(x,Z) < 0) > 0, thus if g is continuous in X at zero, there will be a discontinuity in E[Y|X,Z] x↓0 if and only if m(Z) (cid:54)= 0 (i.e. if and only if there is endogeneity). If, additionally, g is discontinuous, this may increase or decrease the discontinuity in the outcome depending on whether the sign of both discontinuities are equal or different, respectively. This model satisfies the structure in (2). Here, Γ(X,Z) = −βX +[g(X,Z)−D (Z)1(X = 0)]+ g m(Z)[h−1(X;Z)−D (Z)1(X = 0)],whichdependsonthecontinuousnonlinearitiesing,andtheconh−1 tinuous nonlinearities in h if there is endogeneity. ∆(Z) = D (Z)1(X = 0)+m(Z)[E[h−1(X∗;Z)|X∗ ≤ g 0,Z]−lim h−1(x,Z)]1(X = 0), which depends on the discontinuity of g at X = 0, and on whether x↓0 there is endogeneity. Therefore, endogeneity affects both Γ (via the nonlinearity of h) and ∆. The nonlinearity in g affects Γ, and if additionally g is discontinuous, it also affects ∆. Tounderstandthepowerwhentherearenononlinearities,supposethath(Z,η) = Z(cid:48)π+η,g(X,Z) = βX +Z(cid:48)γ, and m(Z) = δ. Then, Γ(X,Z) = δX +Z(cid:48)(γ −πδ), and ∆(Z) = δE[X∗|X∗ ≤ 0,Z]. In this case, the first component of the power of equation (3) is equal to zero. The power then depends entirely on the discontinuity. That is, it depends on the magnitude of δ, and on how binding the constraint on the choice of X is, which is expressed on how negative is E[X∗|X∗ ≤ 0,Z]. C Examples of linear and nonlinear models In this section we discuss some popular models, and how the dummy test may be applied in those contexts. C.1 Linear models In this section we show that heterogeneous treatment effect models, as well as difference-in-differences models estimated with two-way fixed effect regressions, fit the linear framework in equation (1), and discuss the meaning of Assumption 1 in those contexts. Therefore, the standard linear dummy test can be used. 17

C.1.1 Heterogeneous treatment effects The linear setting from equation (1) includes a standard heterogeneous treatment effects model with multivalued treatment, when one wishes to identify the average treatment effect. Suppose that Y = β X +U . (6) i i i i This model can be written as equation (1), where β = E[β ] is the average treatment effect, and ε = i i U +(β −β)X . Supposing that the covariate vector Z includes only a constant, Assumption 1 requires i i i i thatbothU andβ aremeanindependentofX ,whicharethestandardassumptionsforidentificationof i i i average treatment effects.14 If Z includes other variables besides a constant, Assumption 1 is equivalent i to E[U |X ,Z ] = Z(cid:48)λ and E[β |X ,Z ] = E[β ], which are the necessary conditions for the identification i i i i i i i i of average treatment effects from a regression of Y on X with controls.15 C.1.2 Difference-in-differences The linear setting from equation (1) also applies to testing whether average treatment effects are identified by difference-in-differences in a two-way fixed effects regression. (In fact, the arguments below also hold analogously for one-way and multi-way fixed effects, as well as for pooled cross-section data with group fixed effects.) Here we discuss the standard case with two periods where nobody is treated in the first period, but the test can also be applied with multiple periods with or without staggered treatment adoption. Suppose that the treatment variable of individual i in time t is D , which assumes multiple i,t values, and the potential outcome under each treatment level d is Y (d). We intend to run a regression i,t of the observed outcome Y onto D using individual and time fixed-effects. This is equivalent to our i,t i,t model, where X = D and Z = (1,1(i = 1),...,1(i = N −1),1(t = 1))(cid:48).16 i,t i,t i,t The potential outcome of treatment d is given by Y (d) = α+β d+γ +δ +U , (7) i,t i,t i t i,t where γ is an unobservable that does not vary in time, and δ is an unobservable which does not i t vary per individual. Here, the treatment effect of one additional unit is the same for all d, that is, Y (d)−Y (d−1) = β , and we are interested in identifying β = E[β ]. i,t i,t i,t i,t Define λ = (α + γ + δ ,γ − γ ,...,γ − γ ,δ − δ )(cid:48). Then ε = Z(cid:48) λ + U + (β − N 0 1 N N−1 N 1 0 i,t i,t i,t i,t β)D . Assumption 1 in this context is therefore equivalent to E[U |X ,Z ] = 0 (exogeneity) and i,t i,t i,t i,t E[β |X ,Z ] = E[β ] (uncorrelated random effects).17 In this model, these conditions correspond i,t i,t i,t i,t 14In this model, Assumption 1 is equivalent to ignorability of X (e.g., Rosenbaum and Rubin 1983 and Imbens 2000). In the notation of the potential outcomes model, X =D , β =Y (1)−Y (0) and U =Y (0). Then, β =E[Y (1)−Y (0)] i i i i i i i i i is the average treatment effect of one additional treatment unit, and ε = Y (0)+([Y (1)−Y (0)]−E[Y (1)−Y (0)])D . i i i i i i i Assumption1inthismodelisthusequivalenttoexogeneity(E[Y (0)|D ,Z ]=E[Y (0)])anduncorrelatedtreatmenteffects i i i i (E[Y (1)−Y (0)|D ,Z ]=E[Y (1)−Y (0)]). i i i i i i 15The often assumed conditional ignorability condition (here equivalent to E[U |X ,Z ] = E[U |Z ], and E[β |X ,Z ] = i i i i i i i i E[β |Z ]) is weaker than Assumption 1, but it is not sufficient for the identification of average treatment effects in linear i i regression models (see Słoczyński 2020). 16DefiningZ inthismanneris, ofcourse, anabuseofnotation. Thevectoroffixedeffectdummiesisnot, technically, a random variable. 17 HereweproveonlythatAssumption1impliesE[U |X ,Z ]=0,as,otherthanthat,thestatementistrivial. Note i,t i,t i,t 18

to the “strong parallel trends” assumption discussed in Callaway et al. (2021) as the main necessary condition for the identification of average treatment effects in difference-in-differences models.18 See also related Assumptions 4, 5 and 7 in de Chaisemartin and d’Haultfoeuille (2020) for the case with discrete ordered treatment. Notethatthedummytestinthetwo-wayfixedeffectsmodelisalsoatestofthelinearityassumption in model (7). Indeed, if linearity does not hold, the two-way fixed effects regression cannot identify average treatment effects (see Corollary 2 in de Chaisemartin and D’Haultfœuille 2020 and Theorem 3 in Callaway et al. 2021). In this case, identification of interesting quantities may be done with other approaches, such as the ones suggested by the papers cited above, as well as by D’Haultfoeuille et al. (2021). C.2 Nonlinear models Here we discuss implementation details of some well known examples of nonlinear models within the setting discussed in Section 6. We use several results in Newey and McFadden (1994), which we abbreviate as N-MF. C.2.1 Standard nonlinear model Suppose that Y = g(X,Z;γ )+U, 0 where g is a known function, and Y, X (a scalar) and Z are observable variables. The parameter γ is identifiable if E[U|X,Z] = 0.19 If this condition does not hold, then without 0 loss of generality E[Y|X,Z] = g(X,Z;γ)+Γ(X,Z)+∆(Z)1(X = 0). One can perform the test by including the dummy 1(X = 0) into the nonlinear or GMM regression of Y onto X and Z. Here, we propose testing the identification assumptions by estimating the model Y = g(X,Z;γ )+δ1(X = 0)+ν asifE[ν|X,Z] = 0,andtestingifδ isequaltozero. Barringveryspecific 0 functional shapes of Γ, this test also has the power to detect Γ (cid:54)= 0 because nonlinearities – whether caused by misspecification or endogeneity – are at least partially absorbed by the dummy variable. that E[U |Z ] = (cid:80) a 1(j = i)+ (cid:80) b 1(s = t)+ (cid:80) (cid:80) c 1(j = i)1(s = t) = a +b + (cid:80) (cid:80) c 1(j = i)1(s = t). i,t i,t j i s s j s js i t j s js Without loss of generality, assume a = 0 and b = 0 (since γ and δ are in the model). If Assumption 1 holds, then i t i t E[U |X ,Z ]=E[U |Z ]=Z(cid:48) λ,impliesthatc =0,becausethetermsinthelastsumarenotalinearcombination i,t i,t i,t i,t i,t i,t it of the elements of Z . i,t 18ThestrongparalleltrendsassumptionstatesthatE[Y (d)−Y (0)|D =d]=E[Y (d)−Y (0)]foralld. That i,t i,t−1 i,t i,t i,t−1 strongparalleltrendsimpliesAssumption1underlinearityistrivialgivenequation(7)andFootnote17. Conversely,here we show the stronger result that Assumption 1 always implies strong parallel trends under model (1), not just in model (7): E[Y (d)−Y (0)|D =d(cid:48)]=E[E[Y (d)|D =d(cid:48),Z ]−E[Y (0)|D =d(cid:48),Z ]|D =d(cid:48)] i,t i,t−1 i,t i,t i,t i,t i,t−1 i,t i,t−1 i,t =βd+E[E[ε |D =d(cid:48),Z ]−E[ε |D =d(cid:48),Z ]|D =d(cid:48)] i,t i,t i,t i,t−1 i,t i,t−1 i,t =βd+E[Z −Z |D =d(cid:48)](cid:48)λ=βd+δ −δ , i,t i,t−1 i,t t t−1 which does not vary with d(cid:48). 19Plus some rank condition that is specific to the identification method. For example, if we intend to identify γ with 0 GMM, see Lemma 2.3 in N-MF. 19

If we intend to estimate γ with a GMM regression, the test is correctly sized under the rank and 0 regularity conditions in Lemma 2.3, and the regularity conditions in Theorems 3.4 and 4.5 in N-MF. C.2.2 Probit model Suppose that Y = 1(βX +Z(cid:48)λ+U > 0). In this model, β is identifiable if U|X,Z ∼ N(Z(cid:48)π,σ2)20 and is usually estimated with Maximum Likelihood. To test the identification conditions, we propose estimating instead the model Y = 1(βX +Z(cid:48)γ +δ1(X = 0)+U > 0), U|X,Z ∼ N(Z(cid:48)π,σ2), and testing whether δ (cid:54)= 0. If Assumption 2 holds and X and Z have finite fourth moments, the test is correctly sized by Theorems 3.3 and 4.4 in N-MF (see example 1.2 in pages 2147 and 2159). This test has the power to detect nonlinearities in the original model, as well as violations of the normality or homoskedasticity assumptions. Most importantly, it can detect discontinuities in confounders at X = 0. In the extreme case, suppose that linearity, normality and homoskedasticity hold, but that thereisendogeneitywhichcausesadiscontinuityinU.Thatis,supposeU|X,Z ∼ N(τX+Z(cid:48)π+κ1(X = 0),σ2).Thentheestimatorofthecoefficientofthedummyintheextendedmodelis,infact,anestimator of κ. C.2.3 Discrete choice model Consider a standard selection on unobservables discrete choice model where individuals, indexed by i, choose an option j in choice set J in order to maximize their utility. Specifically, individuals solve the optimization problem maxV = V(X ,Z ;β,λ)+ξ +(cid:15) , (8) ij ij ij j ij j∈J foraknownfunctionV(·), whereX (ascalar)andZ areoftenunderstoodasthecharacteristicsofthe ij ij productthatindividualiobtainsiftheychooseoptionj,andβ andγ areunderstoodasthecorresponding preference parameters. The term ξ is often interpreted as the mean utility that individuals obtain from j unobservable characteristics of option j. We observe X, Z and the share of people who choose each alternative option, and we are interested in identifying β.21 Onecanidentifyβ underanexogeneityassumption,E[(cid:15) |X ,Z ,ξ ] = E[(cid:15) |Z ,ξ ],functionalform ij ij ij j ij ij j assumptions on V, and distributional assumptions about the idiosyncratic error, (cid:15) .22 For instance, it ij 20Plus the rank condition that E[(X,Z(cid:48))(cid:48)(X,Z)] is invertible. Note also that in the context of Theorem 6.1, γ = 0 (β,γ(cid:48)+π(cid:48))(cid:48) in this model. 21Withindividual-leveldataonpeople’schoices,itisalsocommontoallowforheterogeneouspreferences,β ,depending i on individual-level observables. In this case, β is modelled as a function of the elements of Z which do not vary with j, i ij and the function V can then be reparameterized. Thus, for example, if β =α+π(cid:48)Z (for a subvector Z of Z ), then i 1i 1i ij we can redefine β = (α,π(cid:48))(cid:48) (see Nevo 2000). Therefore, the heterogeneous preferences setting fits equation (8), and we can test the specification of V and β jointly. i 22Plus a rank condition requiring that the aggregated market shares be continuously differentiable, and the matrix of the derivatives be invertible (see Assumption 2 in Berry et al. 2004). Note also that, in the context of Theorem 6.1, 20

is common in applications to assume that V is linear and (cid:15) is i.i.d. Extreme Value Type I. Estimation ij is usually done with simulation-assisted methods, as in Berry et al. (1995) or any of the alternative methods developed thereafter in this extensive literature (e.g. Dubé et al. 2012). If the identification condition does not hold, then without loss of generality, we can write E[(cid:15) |X ,Z ,ξ ] = Γ(X ,Z )+∆(Z )1(X = 0), ij ij ij j ij ij ij ij where Γ is continuous in X (see Footnote 2). If X has bunching at zero, this justifies our choice ij ij to test the exogeneity condition by including the dummy 1(X = 0) in the model as an additive term ij in equation (8). More specifically we can use the augmented parametric function V˜(X ,Z ,1(X = ij ij ij 0);β,λ,δ) = V(X ,Z ;β,λ) + δ1(X = 0), then estimate δ jointly with the other parameters with ij ij ij the same estimation approach discussed in the last paragraph. The test is correctly sized under the assumptions of the specific estimation method used. For example, if using the method in Berry et al. (1995), the correct size follows from Theorem 2 and the discussion about standard errors that follows it in Berry et al. (2004), under their assumptions A1-A6 and B1-B5. The dummy test detects violations from the exogeneity, functional form of V, and distributional assumptions. If the dummy test in this scenario is rejected, it motivates the use of different function form or distributional assumptions, or the use of control function approaches such as those discussed in Chapter 13 of Train (2009). If one wants to identify β in a model without the unobservable term ξ , as in McFadden et al. j (1973), then estimation is less computationally burdensome. The dummy test in this setting can be done similarly by including the dummy additively into the utility specification. The correct size of the test follows from the results in Chapter 10 of Train (2009), which depend on the estimation method (e.g. maximum simulated likelihood, method of simulated moments or method of simulated scores). In this case, if the dummy test is rejected, it motivates the inclusion of ξ , as discussed above. j D Extensions: using multiple dummies Depending on the context, it is possible to extend the framework discussed above to use multiple dummies. Below, we consider two scenarios which are not mutually exclusive: using interactions with covariates and additional bunching points. D.1 Allowing for heterogeneous effects of the dummy It is clear in equation (2) that the discontinuities may vary with Z. For example, if Z ∈ {z ,...,z } 1 L has finite support, equation (2) can be rewritten as L (cid:88) E[ε|X,Z] = Γ(X,Z)+ ∆(z )1(X = 0,Z = z ). l l l=1 We can conceive a multivariate version of the dummy test in which, instead of adding only the dummy 1(X = 0) to the model, we add all the dummies 1(X = 0,Z = z ), l = 1,...,L, and perform a joint l γ =(β,λ(cid:48))(cid:48) in this model. 0 21

F−test of whether the coefficients of these dummies are all equal to zero. Thus, in the model above, the coefficients of the dummies estimate the ∆(z ). In a situation where ∆(z ) differs enough across the l l z , this multivariate test will prove to be more powerful than the univariate test, as shown in the Monte l Carlo simulations. In general, when Z has arbitrary support Z, partitioned into subsets Z ,...Z , we propose testing 1 L Assumption 1 by including the dummies 1(X = 0,Z ∈ Z ),...,1(X = 0,Z ∈ Z ) in the regression and 1 L performing an F-test of whether the coefficients of the dummies are all equal to zero. The fundamental rank condition for this approach is Assumption 5. E[(W(cid:48),1(X = 0,Z ∈ Z ),...,1(X = 0,Z ∈ Z ))(W(cid:48),1(X = 0,Z ∈ Z ),...,1(X = 1 L 1 0,Z ∈ Z ))(cid:48)] is invertible. L This rank condition is testable and indirectly requires that 0 < P(X = 0,Z ∈ Z ) < 1 for all l l = 1...,L. As with the simple dummy test, this version is also technically identical to a specification test, so the same assumptions and results applicable in the simple case also apply here. The heterogeneity of the discontinuities in the covariates can be leveraged not only by using multiple dummies, but also by interacting the dummy of the bunching point with functions of the controls. For example, letting Z be a non-binary element in the vector Z, one may add Z · 1(X = 0) and 1 1 Z2·1(X = 0) to the regression. Moreover, both of these approaches (interacting and multiple dummies) 1 may be combined. Although the use of multiple dummies can increase the power in some cases, it may also lead to a less powerful test if some of the P(X = 0,Z ∈ Z ) are too close to 0 (Assumption 5), or if there is not l enough heterogeneity in the correlation between confounders and Y at X = 0 across different values of Z. Thus, in practice, a targeted division of the support on a few characteristics for which heterogeneity in the discontinuities is high is likely to better balance the gain in power from the heterogeneity with the possible loss in precision by the inclusion of another dummy. D.2 Multiple bunching points Figure 1 also shows evidence of some bunching in maternal hours worked at X = 2,080, which is the total number of yearly hours of someone who works 40 hours per week every week of the year. When there is a second bunching point, the dummy of that point can also be added to the regression, and a joint test that the coefficients of both dummies are equal to zero performed. This test is similar to the test in the previous section, and the technical results are established analogously. The approach can be also immediately extended to cases when there are more than two bunching points. The power of the joint test is particularly higher when the confounder has sufficiently different correlations with Y at the different bunching points.23 However, there are some instances in which the single dummy test performs better. If the size of the discontinuity in the confounders in the second bunching point is small, or if there is little bunching, then the increase in power from detecting such confounder may not compensate for the loss of power resulting from the inclusion of the additional 23Thisisanalogoustothepreviousextension,wheremorepowerisobtainedwhentheconfounderhassufficientlydifferent correlationswithY fordifferentvaluesofZ.Inthiscase,theheterogeneityisalongdifferentvaluesofX insteadofZ. See the Monte Carlo simulation results in Section F, specifically Panel (d) in Figure 3. 22

dummy. To clarify this point, consider the modification of equation (2) for the case with the two bunching points in our application: E[ε|X,Z] = Γ(X,Z)+∆ (Z)1(X = 0)+∆ (Z)1(X = 2,080). 1 2 Thecoefficientofeachdummydependsonthemagnitudeofthe∆functionwhichmultipliesit. However, thevariancesoftheseestimatorsdependonbothP(X = 0)andP(X = 2080). Themagnitudeof∆ (Z) 2 must be large enough to make it worthwhile to add a second dummy, particularly if P(X = 2080) is small. In our application, P(X = 2,080) ≈ 0.03, which is small but sufficient for testing given our sample. Nevertheless, the empirical evidence is that ∆ (Z) is very small. Figure 2 shows no discontinuities at 2 X = 2,080 in any of the plots. In fact, as shown in the top left panel, there seems to be no discontinuity in the outcome (verbal score) at X = 2,080 either, which is direct evidence that ∆ (Z) ≈ 0. This is in 2 stark contrast to the discontinuities at X = 0. Therefore, we conclude that it is better in our application to test using only the bunching point at X = 0. E Linear CDT details InSection4,weintroducedaparametricversionofCaetano(2015)’sDiscontinuityTest,whichwenamed Linear CDT. Here we provide the details. Define y = (Y 1(X > 0),...,Y 1(X > 0))(cid:48), y = (Y 1(X = 0),...,Y 1(X = 0))(cid:48), x ,z + 1 1 n n 0 1 1 n n + + and z the matrices with rows equal to (1(X > 0),X ), 1(X > 0)Z(cid:48) and 1(X = 0)Z(cid:48) respectively. 0 i i i i i i Supposing the appropriate rank conditions hold, then θˆ = e(cid:48)(x(cid:48) x )−1x(cid:48) [z (z(cid:48)z )−1z(cid:48)y −y ], (9) LCDT 1 + + + + 0 0 0 0 + where e = (1,0)(cid:48). 1 Let (cid:15) = ((cid:15) 1(X > 0),...,(cid:15) 1(X > 0))(cid:48) and (cid:15) = ((cid:15) 1(X = 0),...,(cid:15) 1(X = 0))(cid:48), if Assumption + 1 1 n n 0 1 1 n n 1 holds, then θˆ = e(cid:48)(x(cid:48) x )−1x(cid:48) [z (z(cid:48)z )−1z(cid:48)(cid:15) −(cid:15) ]. LCDT 1 + + + + 0 0 0 0 + Let Σ = E[(cid:15) (cid:15)(cid:48) |x,z], Σ = E[(cid:15) (cid:15)(cid:48)|x,z], and Ω = (z(cid:48)z )−1z(cid:48)Σ z (z(cid:48)z )−1. Then, + + + 0 0 0 0 0 0 0 0 0 0 0 Var(θˆ |x,z) = e(cid:48)(x(cid:48) x )−1x(cid:48) (cid:0) Σ +z Ω z(cid:48) (cid:1) x (x(cid:48) x )−1e . LCDT 1 + + + 0 + 0 + + + + 1 The first term (e(cid:48)(x(cid:48) x )−1x(cid:48) Σ x (x(cid:48) x )−1e ) is the Eicker-White variance of the constant in a re- 1 + + + 0 + + + 1 gression of z λ−y onto x if λ were known. The second term (e(cid:48)(x(cid:48) x )−1x(cid:48) z Ω z(cid:48) x (x(cid:48) x )−1e ) + + + 1 + + + + 0 + + + + 1 is the penalty due to the fact that λ is in fact estimated, where Ω is the Eicker-White covariance matrix 0 of the coefficients estimated in the first-step regression. Let SE(θˆ ) be the square-root of the estimator of Var(θˆ |x,z)/n. The test statistic is LCDT LCDT thereforeθˆ /SE(θˆ ). Under standard assumptions that allow the application of a Central Limit LCDT LCDT Theorem, such as the existence of moments and independence or stationarity of the data, this statistic 23

can be compared to the standard normal distribution. F Monte Carlo We perform a set of Monte Carlo simulations to contrast the finite-sample properties of the dummy test, the multivariate dummy test discussed in Section D.1, and Caetano (2015)’s discontinuity test (CDT). F.1 Set up For each of the 5,000 iterations of the Monte Carlo, we draw (Z,η,(cid:15)) randomly N times, where (Z,η) ∼ (cid:32)(cid:32) (cid:33) (cid:32) (cid:33)(cid:33) 0 5 0.5 N , and (cid:15) ∼ N(0,1). Next, we define X and Y as follows: 0 0.5 1 X∗ = 1+.5Z +η X = max{X∗,0} Y = 2+X +φX2+2Z +(µ+ρ1(Z ≤ 0)−ρ1(Z > 0))η+(cid:15) (10) This specification yields bunching rates of around 25% for each iteration, which is approximately the bunching rate of maternal labor supply in the empirical application (Figure 1). In all cases, we compare the performance of three tests: 1. Univariate dummy test: we run a linear regression of Y on X, Z and 1(X = 0) and test whether the coefficient of 1(X = 0) is equal to zero. This is the test discussed in Sections 2-3. 2. Multivariate dummy test: we run a linear regression of Y on X, Z, 1(X = 0,Z < 0) and 1(X = 0,Z ≥ 0) and jointly test whether the coefficients of 1(X = 0,Z < 0) and 1(X = 0,Z ≥ 0) are equal to zero. This is the test discussed in Appendix D.1. 3. CDT: we perform Caetano (2015)’s discontinuity test assuming Z enters the equation linearly but allowing X to enter the equation nonparametrically.24 F.2 Main results We consider three sets of Monte Carlo experiments corresponding to different values of the parameters of equation (10), and show the results in Figure 3. For each of the three tests discussed above, we calculate the proportion of the 5,000 iterations for which we reject the null hypothesis at the 5% level of significance. The first set of Monte Carlo simulations studies the size and power of the tests under no misspecification of the functional form of the effect of X (φ = 0) and under no heterogeneity in the effect of η 24We report results for the triangular kernel with bandwidths h=0.4 for N =1,000, and h=0.2 for N =10,000. On average, across all iterations, there are about 30 (150) observations in the bandwidth for N =1,000 (N =10,000). The triangularkernelisknowntobethebestkernelforboundaryestimation,andthereportedbandwidthsarethelargestthat still yield a correctly sized test. Nevertheless, the relative performance of CDT and the dummy tests do not change using different kernels or bandwidths of any size. See, for example, Figure 4 in Appendix F.3 for the uniform kernel, infinite bandwidth case (which is equivalent to the Linear CDT). 24

on Y (ρ = 0). Thus, the OLS estimate of the coefficient of X could be biased only because the baseline endogeneity parameter µ is different from zero in equation (10). The Monte Carlo results are shown in Panels (a) (N = 1,000) and (b) (N = 10,000) of Figure 3. The point in the far left of the plots in both panels (µ = 0) shows the size of the tests (i.e., the rejection rate under the null of no endogeneity or misspecification). As expected, for all tests, this rate is close to 5% (dotted line in each plot), thus showing that all three tests are correctly sized. As µ increases, the rejection rates increase for each test, but more so for the dummy tests. With a sample of N = 1,000, both dummy tests rejects the null 100% of the time at µ = 0.6. In contrast, at that level of endogeneity, CDT rejects the null only 40% of the time. The second set of Monte Carlo simulations assumes µ = ρ = 0 (no endogeneity), and varies the influence of the quadratic term X2, φ, from negative (concave) to positive (convex) in equation (10). The results of this exercise can be seen in Panel (c) for N = 1,000. The rejection rates of CDT remain around 5%, as expected, since this test is designed to not detect misspecification. In contrast, the rejection rates of both dummy tests increase steeply as φ moves away from zero in either direction. Note that ρ = 0 in Panels (a)-(c), and as expected the multivariate dummy test performs a little worse than the univariate dummy test in these cases. The third set of Monte Carlo simulations assumes φ = µ = 0 and varies ρ in equation (10), thus allowing the influence of η on Y to be heterogeneous in Z. The results of this exercise can be seen in Panel (d) for N = 1,000, and, as expected, the multivariate dummy test performs better than the univariate dummy test. 25

Figure 3: Monte Carlo Results etaR noitcejeR 1 8. 6. 4. 2. 0 0 .2 .4 .6 .8 1 Coefficient of Confounder (μ) Univariate Dummy Test Multivariate Dummy Test CDT (a) Nomisspecification,noheterogeneity(N=1,000) etaR noitcejeR 1 8. 6. 4. 2. 0 0 .2 .4 .6 .8 1 Coefficient of Confounder (μ) Univariate Dummy Test Multivariate Dummy Test CDT (b) Nomisspecification,noheterogeneity(N=10,000) etaR noitcejeR 1 8. 6. 4. 2. 0 -.1 -.05 0 .05 .1 Coefficient of Non-linear Term of X (φ) Univariate Dummy Test Multivariate Dummy Test CDT (c) Misspecification,noendogeneity(N=1,000) etaR noitcejeR 1 8. 6. 4. 2. 0 0 .2 .4 .6 .8 1 Coefficient of Heterogeneity of Confounder (ρ) Univariate Dummy Test Multivariate Dummy Test CDT (d) Nomisspecification,heterogeneity(N=1,000) Note: Each panel compares the rejection rates of the null hypothesis that Assumption 1 holds for Caetano (2015)’s discontinuity test (CDT) as well as the univariate (Sections 2-3) and multivariate (Section D.1) dummy tests. These panels make different assumptions about the parameters of equation (10). Panels (a) and (b) assume no misspecification nor heterogeneity in the endogeneity (φ=ρ=0), with N =1,000 in Panel (a) and N =10,000 in Panel (b). Panel (c) assumes only misspecification and no endogeneity (µ = ρ = 0), with N = 1,000. Panel (d) assumes no misspecification and no baseline endogeneity (φ=µ=0), but varies the degree of heterogeneity in the endogeneity (ρ), with N =1,000. For each parameter value, we perform 5,000 Monte Carlo iterations. F.3 Comparison with the Linear CDT For completeness, we also compare the Linear CDT, discussed in Section 4 and Appendix E, with the dummy tests and CDT. The results can be seen in Figure 4, which consider the case with no misspecification nor heterogeneity in the endogeneity (φ = ρ = 0 in equation (10)). As expected, the Linear CDT is more powerful to detect endogeneity than the CDT, because of its parametric rate of convergence, yet the Linear CDT is still less powerful than the dummy tests. 26

Figure 4: Monte Carlo Results etaR noitcejeR 1 8. 6. 4. 2. 0 0 .2 .4 .6 .8 1 Coefficient of Confounder (μ) Univariate Dummy Test Multivariate Dummy Test CDT Linear CDT (a) Nomisspecification,noheterogeneity(N=1,000) etaR noitcejeR 1 8. 6. 4. 2. 0 0 .2 .4 .6 .8 1 Coefficient of Confounder (μ) Univariate Dummy Test Multivariate Dummy Test CDT Linear CDT (b) Nomisspecification,noheterogeneity(N=10,000) Note: Each panel compares the rejection rates of the null hypothesis (that Assumption 1 holds) for Caetano (2015)’s discontinuitytest(CDT),theLinearCDT(Section4andAppendixE)andtheunivariate(Sections2-3)andmultivariate (Section D.1) dummy tests. Both panels assume no misspecification nor heterogeneity in the endogeneity (φ = ρ = 0 in equation (10)), with N =1,000 in Panel (a) and N =10,000 in Panel (b). 27

Cite this document

APA

Carolina Caetano, Gregorio Caetano, Hao Fe, & and Eric Nielsen (2021). A Dummy Test of Identification in Models with Bunching (FEDS 2021-068). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2021-068

BibTeX

@techreport{wtfs_feds_2021_068,
  author = {Carolina Caetano and Gregorio Caetano and Hao Fe and and Eric Nielsen},
  title = {A Dummy Test of Identification in Models with Bunching},
  type = {Finance and Economics Discussion Series},
  number = {2021-068},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2021},
  url = {https://whenthefedspeaks.com/doc/feds_2021-068},
  abstract = {We propose a simple test of the main identification assumption in models where the treatment variable takes multiple values and has bunching. The test consists of adding an indicator of the bunching point to the estimation model and testing whether the coefficient of this indicator is zero. Although similar in spirit to the test in Caetano (2015), the dummy test has important practical advantages: it is more powerful at detecting endogeneity, and it also detects violations of the functional form assumption. The test does not require exclusion restrictions and can be implemented in many approaches popular in empirical research, including linear, two-way fixed effects, and discrete choice models. We apply the test to the estimation of the effect of a motherâs working hours on her childâs skills in a panel data context (James-Burdumy 2005). Accessible materials (.zip)},
}