feds · September 17, 2020

Correcting for Endogeneity in Models with Bunching

Abstract

We show that in models with endogeneity, bunching at the lower or upper boundary of the distribution of the treatment variable may be used to build a correction for endogeneity. We derive the asymptotic distribution of the parameters of the corrected model, provide an estimator of the standard errors, and prove the consistency of the bootstrap. An empirical application reveals that time spent watching television, corrected for endogeneity, has roughly no net effect on cognitive skills and a significant negative net effect on non-cognitive skills in children. Accessible materials (.zip)

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C. Correcting for Endogeneity in Models with Bunching Carolina Caetano, Gregorio Caetano, and Eric Nielsen 2020-080 Please cite this paper as: Caetano, Carolina, Gregorio Caetano, and Eric Nielsen (2020). “Correcting for Endogeneity in Models with Bunching,” Finance and Economics Discussion Series 2020-080. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2020.080. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

Correcting for Endogeneity in Models with Bunching∗ Carolina Caetano Gregorio Caetano Eric Nielsen University of Georgia University of Georgia Federal Reserve Board August 2020 Abstract We show that in models with endogeneity, bunching at the lower or upper boundary of the distribution of the treatment variable may be used to build a correction for endogeneity. We derive the asymptotic distribution of the parameters of the corrected model, provide an estimator of the standard errors, and prove the consistency of the bootstrap. An empirical application reveals that time spent watching television, corrected for endogeneity, has roughly no net effect on cognitive skills and a significant negative net effect on non-cognitive skills in children. Codes: C2, C21, C24 1 Introduction When the treatment variable is constrained to be above (or below) a certain threshold, we often encounter bunching of observations at the threshold itself. In this paper, we show that this situation can be leveraged to build a correction for endogeneity. In some models, ranging fromlineartosometypesofnonseparable,nonparametricstructures,thecorrectionconsistsof estimating a nuisance parameter and adding it to the model. In the linear case, for example, this translates to adding a generated control to the original regression. In the linear model andonsomeofthegeneralizations, theentireapproachcanbeimplementedwithoff-the-shelf, packaged software. This type of bunching is often observed in variables constrained to be non-negative, as in the case of demand or inputs to production. Examples include behavioral variables like the consumption of vitamin supplements, cigarettes, alcohol, and coffee;1 financial variables such as credit card debt, credit access, expenditure on ads, and bequests;2 variables quantifying different uses of time such as exercising, working, doing homework, volunteering, and using ∗We thank seminar participants at various conferences and institutions. The analysis and conclusions set forth here are those of the authors and do not indicate concurrence by other members of the research staff, the Board of Governors, or the Federal Reserve System. 1Meta-analysesofstudiesestimatingtheeffectsofthesevariablesonhealthoutcomesincludeShintonand Beevers(1989);Fawzietal.(1993);Corraoetal.(2000);Hernánetal.(2002);Reynoldsetal.(2003);Noordzij et al. (2005); Bischoff-Ferrari et al. (2005); Oken et al. (2008); Richardson et al. (2013). 2See,e.g.,JoulfaianandWilhelm(1994);Peeketal.(2003);EkiciandDunn(2010);Bertrandetal.(2010); Brown et al. (2010); Melzer (2011); Kim and Ruhm (2012); Carman (2013); Boserup et al. (2016); Erixson (2017); Elinder et al. (2018). 1

social media;3 and even count data such as the number of children, the frequency of doctor visits, and the crime rate.4 More generally, one can apply the proposed method whenever Caetano (2015)’s test of exogeneity can be applied at the boundary. It is a natural solution whenever exogeneity is rejected by that test, or whenever the researcher wants to obtain point estimates that are robust under a selection-on-unobservablesassumption. This testhas beenapplied in avariety of settings in economics, political science, and finance.5 The approach of adding a generated covariate to account for endogeneity is well known, starting with the ubiquitous use of control functions, which require instrumental variables. Heckman(1979)’scorrectionisjustaswellknown,andwasdevelopedformodelswithselection in the dependent variable. In contrast, our correction approach does not require instrumental variables, and there may not be a missing data problem. Thispaperisrelatedtothegrowingliteratureontheuseofbunchingpointstoinfercausal parameters. Saez (2010) first leveraged bunching points in the context of the estimation of labor supply elasticities. Other seminal references in that field are Chetty et al. (2011) and Kleven and Waseem (2013). Kleven (2016) surveys this large literature. Some recent theoretical advancements include Blomquist et al. (2019) and Bertanha et al. (2020). TheasymptotictheoryofourestimatorusestheresultsinChenetal.(2003)forextremum estimatorswithpossiblyinfinitedimensionalnuisanceparameters. Severalformsofstochastic equicontinuity conditions are required to establish that the objective functionals (squares of residuals using the estimated regressor) belong to Donsker classes, and for this we also use results in Pakes and Pollard (1989) and Andrews (1994). The paper is laid out as follows. Section 2 proposes the correction strategy in the linear model and then discusses generalizations into nonlinear, semiparametric, and some types of nonparametric models. Section 3 establishes the asymptotic normality of the coefficients of the corrected model in the linear case, provides an estimator of the asymptotic variance, and proves the consistency of the bootstrap. Section 4 details several strategies for the identification and estimation of the correction term. Section 5 uses the correction approach to estimate the effect of time spent watching television (TV) on children’s cognitive and noncognitive skills. The application also showcases how the main identifying assumptions may be tested or argued. Another application of our correction method can be seen in Caetano et al. (2020). Section 6 concludes, and the proofs are gathered in the Appendix. An online 3See, e.g., Luoh and Herzog (2002); Baum II (2003); James-Burdumy (2005); Ruhm (2004, 2008); Eren and Henderson (2011); Chatterji et al. (2013); Ermisch and Francesconi (2013); Bhutani et al. (2013); Holt et al. (2013); Bettinger et al. (2014); Boulianne (2015). 4See, e.g., McDuffie et al. (1996); Black et al. (2005); Cohen (2008); Black et al. (2010). 5See,e.g.,Rozenasetal.(2017),Erhardt(2017),Pang(2017),Bleemer(2018a),Bleemer(2018b),Ferreira et al. (2018), Lavetti and Schmutte (2018), Caetano and Maheshri (2018), De Vito et al. (2019), Caetano et al. (2019) and Caetano et al. (2020). 2

appendix discusses a real-data Monte Carlo study and additional implementation strategies.6 2 Correction Approach The main idea behind our correction approach can be understood intuitively in the figure below. We are interested in estimating the effect of the treatment X on the outcome Y. Let X be constrained to be non-negative, and let the unobservable variable η be correlated with X conditional on controls Z. For concreteness, suppose that X represents a choice. As depicted in the left panel of Figure 1, observations with similar, positive values of X tend to have similar η. However, those at the threshold point X = 0 are particularly selected. BecauseoftheconstraintonX,manychosethecornersolutionbecausetheywantedtochoose a negative amount but could not (dashed part of the left panel). Therefore, the expected η among those who chose X = 0 may be quite different from the expected η among those who chose a small but positive X. Figure 1: Discontinuity in the Outcome Reflects Variation in Unobservables at the Boundary E[η|X,Z] E[Y|X,Z] E[Y|X=0,Z] E[η|X=0,Z] 0 X 0 X (a) Unobservables (b) Outcome If η influences Y, then we have a problem of endogeneity. This means that, even holding Z constant, any variation in the outcome will reflect both changes in the treatment X and changes in the confounder η at the same time, as in the right panel of Figure 1. However, when we compare outcomes at X = 0 and X immediately above zero, the difference in X is negligible but the expected difference in η is large. Therefore, exactly at this location, the discontinuity in the outcome Y at X = 0 reflects changes in η without contamination from changes in X. Caetano (2015)’s test looks at the discontinuity E[Y|X = 0,Z]−lim E[Y|X = x,Z] to x↓0 determinewhetherη isincludedintheoutcomeequation,andthuswhetherthereisendogeneity. Ifweimposesomestructureonthemodel(e.g. linearity, partiallinearityorsometypesof 6The online appendix is available at https://bit.ly/3gCigdZ. 3

nonseparable and nonparametric structures), this same discontinuity can be leveraged further to reveal information about the treatment effects. Section 2.1 formalizes the approach in the context of a linear model. Section 2.2 then discusses possible generalizations. Technical note: in this paper, all equations and results involving random variables should be read as holding almost surely. P denotes the probability, and details about implied probability spaces and conditional sigma-algebras should be self-evident and are thus omitted. The expectation E is assumed to exist wherever written. The support of the distribution of any random variable R is denoted supp(R). For brevity, we sometimes say “the support of R" to mean the support of the distribution of R. To keep identification arguments simple we will omit rank conditions when they are obvious. 2.1 Linear Model Suppose that the structural equation is Y = βX +Z(cid:48)γ +δη+ε, where E[ε|X,Z,η] = 0. (1) X isascalarvariableandweareinterestedinitseffectonY,β.Thevectorofcontrols, Z,may include a constant term. We observe Y, X and Z, and we do not observe η and ε. Suppose that there exists a latent, unobservable variable X∗ which depends on both Z and η: X∗ = Z(cid:48)π+η. (2) The actual treatment, X, is equal to X∗ subject to the (binding) constraint X = max{0,X∗}, with P(X∗ < 0) > 0. (3) Note that this is not a censored model in the typical sense. The outcome Y is a function of the actual treatment X, which is observed, not the latent variable X∗. As suggested in the discussion of Figure 1, an intuitive framework in which this model can be understood is that of choice and utility maximization. Suppose for example that the utility is written as a function of X, Z and η, and that X is the choice which maximizes utilityconditionalontheconstraint(3). X∗ isthedesiredchoicewithoutthisconstraint. The model’s key requirement is that some observations are at a “corner solution:” their desired choice in the unconstrained optimization would have been different from their actual choice in the constrained optimization. Note that this framework may help justify the model, but it is not necessary. The specific motivation for the three equations above is irrelevant for the validity of the approach. Given equations (1), (2) and (3), E[Y|X,Z] = (β +δ)X +Z(cid:48)(γ −πδ)+δE[X∗|X∗ ≤ 0,Z]1(X = 0). (4) 4

If δ = 0, X is exogenous, and thus β is identifiable as in the standard linear model. If δ (cid:54)= 0, X is endogenous. Then, whenever E[X∗|X∗ ≤ 0,Z] < 0 (which happens with positive probability because of (3)), the outcome will be discontinuous at X = 0. Let us rewrite equation (4) as E[Y|X,Z] = βX +Z(cid:48)(γ −πδ)+δ(X +E[X∗|X∗ ≤ 0,Z]1(X = 0)). (5) If we can identify E[X∗|X∗ ≤ 0,Z] for all z ∈ supp(Z|X = 0), then we can eliminate the endogeneity bias and identify β by adding the term X +E[X∗|X∗ ≤ 0,Z]1(X = 0) to the regression. Correcting for endogeneity thus depends on the identification of E[X∗|X∗ ≤ 0,Z]. In essence, our approach transforms the problem of endogeneity into a problem of out-of-sample prediction. Because X∗ is observed whenever X∗ > 0, we can use the observed empirical distribution of X∗|Z for X∗ > 0 to predict this expectation. Although the out-of-sample nature of this prediction creates a great deal of difficulty, the fact that it is a prediction problem, rather than a causal identification problem, opens up a multitude of data-driven strategies that can be tailored to the particular empirical application. To allow for such flexibility, in Section 3, we provide the asymptotic distribution, variance estimator, and consistency of the bootstrap for any estimator of E[X∗|X∗ ≤ 0,Z] satisfying higher-level conditions. In Section 4, we propose several options for the identification of E[X∗|X∗ ≤ 0,Z]. Finally, in Section 5 we provide guidance about how to test the linearity assumption as well as the assumptions needed to identify the expectation. Remark 2.1. Can a correction be built without bunching? Consider the simplest alternative approach: specify X = Z(cid:48)π+η, thereby removing any bunching structure from the model. In this case, E[Y|X,Z] = (β +δ)X +Z(cid:48)(γ −πδ). It is impossible to separate β and δ in this equation. Even if we could somehow identify π (for example, by supposing the strong exclusion restriction that E[η|Z] = 0), this would still be insufficient to identify δ, and thus β. It is possible to separate δ if there exists some form of nonlinearity in the relationship between X, Z and η. The simplest nonlinear specification is X = g(Z) + η. In this case, E[Y|X,Z] = (β + δ)X + Z(cid:48)γ − δg(Z). In order to identify δ in this equation, g must be nonlinear and identifiable. To identify g, we would need to suppose that E[η|Z] = 0, or that there exists an instrument for Z in the first stage equation. In our model, E[η|X,Z] = X −Z(cid:48)π+E[X∗|X∗ ≤ 0,Z]1(X = 0) is discontinuous, which allows us to identify δ without identifying the parameters governing the relationship between X and Z. Other sharp features such as bunching at interior points or kinks may also allow the construction of similar corrections without the need for requirements of independence of η and Z. 5

Remark 2.2. The sign of δ is identified even if E[X∗|X∗ ≤ 0,Z] is not identified. To see this, note that if equations (1), (2) and (3) hold E[E[Y|X = 0,Z]−limE[Y|X = x,Z]|X = 0] = δE[X∗|X∗ ≤ 0]. (6) x↓0 Since E[X∗|X∗ ≤ 0] < 0 (by (3)), the sign of δ is opposite to the sign of the expected discontinuity. We show in Section 4 how knowing the sign of δ can be useful for partial identification. Note that this result also provides the basis for a test of exogeneity in the spirit of Caetano (2015), since δE[X∗|X∗ ≤ 0] = 0 if and only if δ = 0. Implementation of this approach is simple, and we discuss it in Appendix A. In the online appendix (Section 2.1), we estimate the sign of δ in our application using this method. In particular, we strongly reject exogeneity. Remark 2.3. (More than one bunching point) Additional bunching points may be used in conjunction with our approach. This affords both a correction and a test of the underlying assumptions of the correction method in the same regression. To do this, calculate the correction using the bunching point at one end of the support and then apply Caetano (2015)’s exogeneity test on the other bunching point (or apply Caetano and Maheshri (2018)’s method if there are multiple additional bunching points). For example, consider the problem of estimating the effect of maternal labor supply on children’s skills (e.g., James-Burdumy (2005)). Figure 2 shows the empirical c.d.f. of the treatment X = average number of weeks per year in which the mother worked in the three years following her child’s birth. There are clearly two bunching points, one at X = 0, and another at X = 52. Figure 2: Evidence of Bunching, Maternal Labor Supply noitcnuF noitubirtsiD evitalumuC laciripmE 1 8. 6. 4. 2. 0 0 10 20 30 40 50 Average Weeks Working Per Year Note: This figure shows the empirical c.d.f. for the average number of weeks per year in which the mother worked in the three years following her child’s birth. Source: National Longitudinal Study of Youth, 1979 cohort, sample of mothers whose children were born from 1979 to 2002. A simple implementation of this test consists of running a regression of Y on X, Z, X + Eˆ[X∗|X∗ ≤ 0,Z]1(X = 0), and 1(X = 52). Then, the test of the exogeneity assumption 6

after the correction is applied is equivalent to a t-test of whether the coefficient of 1(X = 52) is equal to zero. The asymptotic variance of the estimator of the coefficient of 1(X = 52), the variance estimator, and the consistency of the bootstrap critical values are established in Section 3 (see footnote 8). 2.2 Extensions Linearity is not a fundamental requirement for this type of correction. We discuss several possible generalizations here, though the list of examples below is by no means exhaustive. Readers who are not interested in these extensions can skip directly to Section 3 without missing any notation or concept important to the understanding of the rest of the paper. 2.2.1 Linear Correlated Random Coefficients Suppose that equations (1), (2) and (3) hold, but β = α X+Z(cid:48)α. This is equivalent to Garen 0 (1984)’s model (see e.g. Chay and Greenstone (2005)), except that we do not exclude Z from the structural equation, and we do not require that Z and η are independent. Then, E[Y|X,Z] = α X2+XZ(cid:48)α+Z(cid:48)(γ −πδ)+δ(X +E[X∗|X∗ ≤ 0,Z]1(X = 0)). 0 IfE[X∗|X∗ ≤ 0,Z]isidentified,α andαareidentified,andthuswecanidentifythetreatment 0 effects. Our estimation results also cover this model (see footnote 8 in Section 3). Other models β = g (X,Z;α ), where the function g is known up to the finite parameter 1 1 1 vector α , may be identified analogously. If E[Y|X,Z] is linear in parameters, estimation is 1 covered by the results in Section 3. Otherwise, this is a special case of the next model. 2.2.2 Nonlinear Correlated Random Effects Suppose that the model is Y = g (X,Z;α )X +g (Z;α )η+ε, E[ε|X,Z,η] = 0, 1 1 2 2 X∗ = h (Z;κ )+h (Z;κ )η, 1 1 2 2 and equation (3), where g , g , h and h are functions which are known up to the finite 1 2 1 2 parameter vectors α ,α ,κ and κ , and h (Z;κ ) (cid:54)= 0. We are interested in identifying g . 1 2 1 2 2 2 1 Then, g (Z;α )h (Z;κ ) g (Z;α ) E[Y|X,Z]=g (X,Z;α )X− 2 2 1 1 + 2 2 (X+E[X∗|X∗ ≤ 0,Z]1(X = 0)). 1 1 h (Z;κ ) h (Z;κ ) 2 2 2 2 This expression can usually be simplified a great deal when the functions are specified in an application. Identification in this model can be established with the method of moments. 7

All the combinations of parameters necessary for the identification of the treatment effects of X on Y are identified (see an explicit identification argument for a more general model in Section 2.2.4.) In fact, depending on the specific functional form of g , some (or all) of 1 the elements of α may be identified. Estimation can be done with nonlinear regression or 1 generalized method of moments. 2.2.3 Partially Linear Model The previous case requires that all the functions be specified. We now show that it is possible to build a correction in semiparametric models. Suppose that the model is Y = βX +g(Z)+δη+ε, E[ε|X,Z,η] = 0, X∗ = h(Z)+η, and equation (3), where g and h are not known. Then, E[Y|X,Z] = (β +δ)X +(g(Z)−δh(Z))+δE[X∗|X∗ ≤ 0,Z]1(X = 0). We can follow Robinson (1988)’s strategy for partially linear models: Y −E[Y|Z] = β(X −E[X|Z]) +δ(X −E[X|Z]+E[X∗|X∗ ≤ 0,Z][1(X = 0)−P(X = 0|Z)])+ε. If the terms X −E[X|Z] and E[X∗|X∗ ≤ 0,Z][1(X = 0)−P(X = 0|Z)] are linearly independent, β and δ are identifiable. There are alternative identification methods for β in the equation above – see Härdle et al. (2000) for a detailed treatment of partially linear models. Estimation follows any of the methods available for partially linear models but substituting E[X∗|X∗ ≤ 0,Z] for its estimate. 2.2.4 A Nonparametric Nonseparable Model The following case gives an example of a nonparametric, nonseparable structure in which a correction can be built. Suppose that Y = g (X,Z,ε)+g (Z,ε)η, ε ⊥⊥ X,Z,η 1 2 X∗ = h(Z)+η and equation (3), where g , g and h are unknown functions. In this model, we are interested 1 2 (cid:104) (cid:12) (cid:105) in the identification of the expected partial effect of X on Y everywhere, E ∂g1(X,Z,ε)(cid:12)X,Z . ∂X (cid:12) It may be possible to weaken the structure further if the desired quantity is less ambitious, 8

(cid:104) (cid:105) such as the average treatment effect E ∂g1(X,Z,ε) or the average structural function a(x) = ∂X E[g (x,Z,ε)]. Given the model, 1 E[Y|X,Z] = E[g (X,Z,ε)]+E[g (Z,ε)]X +E[g (Z,ε)]E[X∗|X∗ ≤ 0,Z]1(X = 0). 1 2 2 Assuming the regularity conditions that guarantee the interchangeability of the order of the derivatives, expectations, and limits, then for X > 0, ∂ (cid:20) ∂ (cid:12) (cid:21) E[Y|X,Z] = E g (X,Z,ε)(cid:12)X,Z +E[g (Z,ε)]. (7) ∂X ∂X 1 (cid:12) 2 At the same time, note that E[Y|X = 0,Z]−limE[Y|X = x,Z] = E[g (Z,ε)]E[X∗|X∗ ≤ 0,Z]. (8) 2 x↓0 If E[X∗|X∗ ≤ 0,Z = z] is identifiable and strictly negative (that is, if P(X∗ < 0|Z = z) > 0), we can identify E[g (z,ε)] from equation (8), and thus we can identify the partial derivatives 2 (cid:104) (cid:12) (cid:105) E ∂ g (X,z,ε)(cid:12)X,Z = z from equation (7). ∂X 1 (cid:12) Estimation in this example is much more involved than in the previous cases, requiring the nonparametric estimation of E[Y|X = 0,Z], the limit lim E[Y|X = x,Z] as well as the x↓0 first derivatives of E[Y|X,Z] with respect to X. 2.2.5 Probit Model with Endogeneity All of the previous cases admit simple correction strategies which are based on the separation of η and X in the structural equation. Building a correction without this separability is much harder, and will usually require the identification of other moments of X∗|Z or perhaps of the entire distribution. We now give an example of how this may be done in a probit model with endogeneity. Let the model be Y = 1(βX +Z(cid:48)γ +δη+ε ≥ 0), with ε|X,Z,η ∼ N(0,σ), where equations (2) and (3) hold. Let Φ be the c.d.f. of the standard normal distribution. Then for X > 0, P(Y = 1|X,Z) = 1−Φ (cid:0) −X(β +δ)/σ−Z(cid:48)(γ −πδ)/σ (cid:1) , which allows us to identify and estimate (β +δ)/σ and (γ −πδ)/σ by maximum likelihood using only observations with X > 0. At the same time, for X = 0, P(Y = 1|X = 0,Z) = 1− 1 (cid:90) 0 Φ (cid:0) −Z(cid:48)(γ −πδ)/σ−xδ/σ (cid:1)P(X∗ ≤ dx|Z). P(X = 0|Z) −∞ 9

If P(X∗ ≤ x|Z) is identified for all x ≤ 0, the integral above may be calculated.7 This allows us to identify (γ −πδ)/σ and δ/σ by maximum likelihood using only observations such that X = 0. Subtracting δ/σ from the previously identified (β +δ)/σ then allows us to identify β/σ. Remark 2.4. It is not possible to correct for endogeneity in all types of nonseparable models. Our approach allows us a glimpse of how η affects Y separately from the effect of X only at the bunching point. It is only at that location that we can guarantee that the treatment will not vary while the unobservables will. Consider the following nonparametric nonseparable model: Y = g(X,η)+ε, E[ε|X,η] = 0, where g is not known. Here, no matter how restrictive is the equation that generates X∗, the bunching only allows us to learn something about ∂g(0,η)/∂η. This is not sufficient information to allow us to learn anything about g(X,η) for X > 0. Some form of regularity, be it some type of separability between η and X or a semiparametric structure, will be necessary for the identification of the treatment effects. 3 Asymptotic Theory for Estimation of the Corrected Model We provide estimation results in the linear model.8 Estimation follows equation (5) and consists of an OLS regression of Y onto X, Z and the estimated correction term, X+Eˆ[X∗|X∗ ≤ 0,Z]1(X = 0). The coefficient of X is βˆ, and the coefficient of the correction term is δˆ. We provide general asymptotic results that can be adapted to different estimators of E[X∗|X∗ ≤ 0,Z],includingestimatorsnotdiscussedinthispaper,providedtheyareuniformly consistent at a minimum n1/4 rate. Moreover, there is no impediment to using different identification and estimation methods to obtain Eˆ[X∗|X∗ ≤ 0,Z] for different values of Z. Assumption 1. Denote ψ = E[X∗|X∗ ≤ 0,Z = ·], ψˆ = Eˆ[X∗|X∗ ≤ 0,Z = ·], and dim(Z) 0 equal to the number of elements in Z. Suppose that (i) The observations {(Y ,X ,Z(cid:48))(cid:48)}n are independent. i i i i=1 (ii) Denote W = (X,Z(cid:48),X + ψ (Z)1(X = 0))(cid:48), where W is an observation of this vec- 0 i tor and W is the j-th element of W . There exist constants α > 0 and ∆ < ∞ such ij i 7This may have to be done numerically. However, if X∗|Z belongs to a very simple distributional family, suchasuniform,orifinsteadofprobitwehaveamoretractablemodelsuchasalogit,aclosedformexpression may be obtained. 8Results for some extensions are proven identically. For the case with more than one bunching point (Remark 2.3 in Section 2.1), just substitute the first coordinate in W in all assumptions and proofs by (X,1(X = x¯))(cid:48), where x¯ is the second bunching point, and β by (β,α )(cid:48), where α is the coefficient of x¯ x¯ 1(X = x¯). For the linear correlated random coefficients model (Section 2.2.1), simply substitute the first coordinate in W in all assumptions and proofs by (X2,XZ(cid:48))(cid:48), and β by (α ,α(cid:48))(cid:48). 0 10

that (a) E[|ε2|1+α] < ∆, E[|η2|1+α] < ∆, E[|W W |1+α] < ∆, E[|η2W W |1+α] < ∆ i i ij ik i ij ik and E[|W2W W |1+α] < ∆ for all i = 1,...,n and j,k,l = 1,...,dim(Z) + 2; (b) ij ik il 1 (cid:80)n E[W W(cid:48)], 1 (cid:80)n E[ε2W W(cid:48)], and 1 (cid:80)n E[Var(η|X = 0,Z )W W(cid:48)] are nonn i=1 i i n i=1 i i i n i=1 i i i singular for n sufficiently large, with determinants bounded away from zero. Moreover, for each pair j,s = 1,...,dim(Z) + 2, either W W is constant for all i, or ij is 1 (cid:80)n Var(W W ) ≥ α for n sufficiently large. n i=1 ij is (iii) Let Θ be a compact subset of Rdim(Z)+2, and θ := (β,(γ −πδ)(cid:48),δ)(cid:48) ∈ int(Θ). 0 (cid:113) (iv) ψˆ,ψ ∈ H, a normed linear space with (cid:82)∞ logN (ε,H,||·|| )dε < ∞ (this implies 0 0 [] H that H is a P-Donsker class). (cid:12) (cid:12) (v) Let Z = supp(Z), then sup (cid:12)ψˆ(z)−ψ (z)](cid:12) = o (n−1/4). z∈Z(cid:12) 0 (cid:12) p √ (vi) nE[(ψˆ(Z )−ψ (Z ))1(X =0,X =0)W W(cid:48)] → N(0,Ω). i 0 i i j i j d Theorem 3.1. If equations (1), (2) and (3), and Assumption 1 hold, then √ n(θˆ−θ ) → N (cid:0) 0,Σ+δ2E[WW(cid:48)]−1ΩE[WW(cid:48)]−1(cid:3) , 0 d where Σ is the asymptotic covariance of a regression of Y onto X, Z, and X +E[X∗|X∗ ≤ 0,Z]1(X = 0) (i.e., it is the asymptotic covariance if the true rather than the estimated expectation had been used in the regression). The proof of this theorem is in Appendix B.1. It applies Theorem 2 in Chen et al. (2003). Assumption 1(iv) requires that the uniform entropy integral (as defined in Van der Vaart and Wellner (1996), Chapter 2.1) is finite. This is used to establish the stochastic equicontinuity condition 2.5’ in Chen et al. (2003) using Lemma 2.17 in Pakes and Pollard (1989). The same condition is also an indirect requirement of Assumption 2.6 in Chen et al. (2003) when we are allowing for an arbitrary estimator ψˆ(z). When a specific estimator is defined, Assumption 1(iv) often holds under more standard primitive conditions. In practice, this condition requires that the expectation and its estimator are well behaved functions. Many of the function classes adopted in economics satisfy this condition. For example, if the expectation assumes a parametric form, or if it is Lipschitz-continuous on the parameter vector, or smooth, or belongs to the Sobolev class, or is of bounded variation, the condition is satisfied. Assumption 1(vi) is used to verify Assumption 2.6 in Chen et al. (2003). Note that the expectation is taken with respect to Z and X , conditional on the data that generated ψˆ. i i GivenAssumptions1(iv)and(v), Assumption1(vi)maybesubstitutedby √1 n (cid:80)n i=1 (ψˆ(Z i )− ψ (Z ))1(X =0,X =0)W W(cid:48) → N(0,Ω) (this follows from Lemma 19.24 in Van der Vaart 0 i i j i j d 11

√ (1998)). When the expectation estimator converges weakly at the n rate, Assumption 1(vi) does not need to be verified because it always holds (see Remark 3.1). Next, we present an estimator of the asymptotic variance. Let wˆ be the matrix of regressors, with rows equal to Wˆ := (X ,Z(cid:48),X +ψˆ(Z )1(X = 0))(cid:48). Let Dˆ = Diag{(Y −Wˆ (cid:48)θˆ)2}n i i i i i i i i i=1 be the diagonal matrix of the square of the residuals. Finally, let Vˆ be a matrix with row i equal to (Cˆ 1(X = 0,X = 0),...,Cˆ 1(X = 0,X = 0))(cid:48), where the Cˆ = Cˆ(Z ,Z ) are i1 1 i in n i ij i j defined in Assumption 2 below. Then, an estimator of the asymptotic variance of θˆis (cid:32) (cid:33) (cid:32) (cid:33) (cid:18) wˆ(cid:48)wˆ (cid:19)−1 wˆ(cid:48)Dˆwˆ (cid:18) wˆ(cid:48)wˆ (cid:19)−1 (cid:18) wˆ(cid:48)wˆ (cid:19)−1 wˆ(cid:48)Vˆwˆ (cid:18) wˆ(cid:48)wˆ (cid:19)−1 Vˆ = +δˆ2 . θ n n n n n2 n Note that the first term is simply the Eicker-White covariance estimator in a regression of Y onto X, Z and X +ψˆ(Z)1(X = 0). The second term is the penalty resulting from the fact that we are using an estimate instead of the true value ψ (Z). 0 Assumption 2. Suppose that (i) Ω = E[C 1(X =0,X =0)W W(cid:48)], for some variables C = C(Z ,Z ), ij i j i j ij i j (ii) E[||C W W(cid:48)||], E[||C W ||] and E[|C |] are bounded, ij i j ij i ij (cid:12) (cid:12) (iii) sup (cid:12)Cˆ(z,z˜)−C(z,z˜)(cid:12) = o (1). z,z˜∈Z(cid:12) (cid:12) p Theorem 3.2. If equations (1), (2), and (3), and Assumptions 1 and 2 hold, then Vˆ → Σ+δ2E[WW(cid:48)]−1ΩE[WW(cid:48)]−1. θ p The proof of this theorem is in Appendix B.2. It uses a specific Strong Law of Large numbersforU-Statisticswhenthedataisindependentbutnotidenticallydistributed. Letting (cid:113) Vˆ be the first element in the matrix Vˆ , the standard error of βˆ is Vˆ /n. The first term β θ β in Vˆ can be obtained directly from packaged software as the square of the Eicker-White β standard errors of βˆon a regression of Y onto X, Z and X+ψˆ(Z)1(X = 0). The second term in Vˆ is the first element in the second matrix in Vˆ divided by n. β θ √ When the expectation estimator converges weakly at the n rate, Assumption 2(i) does not need to be verified because it always holds (see Remark 3.1). Assumption 2(iii) requires consistent estimators of the C , which are usually asymptotic covariances of the ψˆ(Z ) for ij i two different values of Z . Note that if the ψˆ(Z ) are independent, then C = 0 for all i (cid:54)= j, i i ij and thus the asymptotic variance is not affected by the fact that the expectation is estimated. The following remarks discuss simplifications of the previous theorems for two important special cases. 12

√ Remark 3.1. ( n-rate of convergence) Suppose that ψˆ converges at the parametric rate to a Brownian Bridge, so that for all z ∈ Z, there exists a normal random variable χ z √ such that n(ψˆ(z) − ψ (z)) → χ . Then, Assumptions 1(vi) and 2(i) always hold, and 0 d z C = Cov(χ ,χ ). The proof of this statement follows from the Functional Delta Method ij Zi Zj (Theorem 3.9.5 in Van der Vaart and Wellner (1996)) and is presented in Appendix B.3. Remark 3.2. (Z with finite support) If supp(Z) = {z ,...,z }, and if ψˆ(z ) uses only obser- 1 L l vations such that Z = z , then the assumptions of Theorems 3.1 and 3.2 may be simplified. i l Specifically, replace Assumptions 1 (iii)-(vi) with (iii’) Define p = P(X = 0,Z = z ). Then p > 0 for at least one l = 1,...,L. 0,l l 0,l √ (iv’) n(ψˆ(z )−ψ (z )) → N(0,V ) for all l = 1,...,L such that p > 0. l 0 l d z l 0,l Define W = (0,z(cid:48),ψ (z ))(cid:48). The asymptotic variance can be written as 0,z l l 0 l (cid:32) L (cid:33) (cid:88) Σ+δ2E[WW(cid:48)]−1 p2 V W W(cid:48) E[WW(cid:48)]−1. 0,l z l 0,z l 0,z l l=1 The term Σ can be estimated as in the general case, and the second term can be estimated by replacing δ with δˆ, V with an estimator Vˆ and the other terms with sample equivalents. z z l l To establish the consistency of this variance estimator, Assumption 2 may be replaced with Vˆ → V for all l such that p > 0. z l p z l 0,l Now we establish that the ordinary nonparametric bootstrap can consistently estimate √ the distribution of n(θ−θ ) for an i.i.d. sample for a wide class of estimators ψˆ(Z). 0 Assumption 3. Suppose that (i) The observations {(Y ,X ,Z(cid:48))(cid:48)}n are i.i.d. i i i i=1 (ii) H has a bounded envelope function. (cid:12) (cid:12) (iii) n1/4sup (cid:12)ψˆ(z)−ψ (z)(cid:12) = o (1) z∈Z(cid:12) 0 (cid:12) a.s. (iv) √1 n (cid:80)n i=1 (ψˆ(Z i )−ψ 0 (Z i ))r1(X i = 0)R i = o a.s. (1), for r = 1 and R i = ε i ,η i , and W i ; and for r = 2 and R = 1.9 i (v) Denote the bootstrap sample quantities with a “b,” let ψˆb = Eˆb[X∗|X∗ ≤ 0,Z = ·], and let o (·) and O (·) denote the o (·) and O (·) notation for the Pb-probability. Then, pb pb p p (cid:12) (cid:12) (a) n1/4sup (cid:12)ψˆb(z)−ψˆ(z)(cid:12) = o (1). z∈Z(cid:12) (cid:12) pb 13

(b) √1 n (cid:80)n i=1 (ψˆb(Z i )−ψˆ(Z i ))r1(X i = 0)R i = o pb (1), for r = 1 and R i = ε i ,η i , and W ; and for r = 2 and R = 1.9 i i √ (c) nE(cid:2) [(ψˆb(Z) − ψˆ(Z)) − (ψˆ(Z) − ψ (Z))]1(X = 0)R (cid:3) = o (1), for R = Z and 0 pb R = ψ (Z). 0 Theorem 3.3. If equations (1), (2) and (3), and Assumptions 1 and 3 hold, then √ n(θˆb−θˆ) → N(0,Σ+δ2E[WW(cid:48)]−1ΩE[WW(cid:48)]−1) in Pb-probability. d The proof of this theorem is in Appendix B.4, and follows from Theorems 3 and B in Chen et al. (2003). Theorem B makes a requirement of almost sure stochastic equicontinuity (Assumption 2.5’ a.s. in that paper). Its direct translation to our context is expressed in footnote 9, which is a difficult condition to establish.10 Instead, we bypass the need for almost sure stochastic equicontinuity and prove that, in our context and given the other assumptions of Theorem 3.3, it may be substituted by two weaker conditions, Assumptions 3 (iv) and (vb). 4 Identifying E[X∗|X∗ ≤ 0,Z] The identification of E[X∗|X∗ ≤ 0,Z] is a prediction exercise. In this sense it is a simpler problem than causal identification. The difficulty is that the prediction is out-of-sample, as we only observe the distribution of X∗ for X∗ > 0 and the bunching at X = 0. Nevertheless, thereisalotofinformationwhich, togetherwithassumptionsofvaryingdegreesofgenerality, can be leveraged into partial or point identification of E[X∗|X∗ ≤ 0,Z], and therefore of β. In this section, we focus on providing guidance to practitioners and exposing the reader to many options that are available. We formalize much of the analysis below, but a comprehensive treatment of the material in this section is beyond the scope of this paper. The following sections discuss three avenues of investigation into the identification of the expectation. InSection4.1,weexamineopportunitiesforpartialidentification. InSection4.2, we discuss identification inside of parametric classes of distributions, where the parameters 9Assumptions 3 (iv) and (vb) may be substituted by the almost sure stochastic equicontinuity condition: for all positive sequences τ = o(1) (it’s sufficient to prove this for decreasing sequences δ = o(n1/2−α) for n n some α>0), (cid:12) (cid:12) (cid:12) 1 (cid:88) n (cid:12) sup (cid:12)√ [(ψ(Z )−ψ (Z ))V −E[(ψ(Z )−ψ (Z ))V ]](cid:12)=o (1). (cid:12) n i 0 i i i 0 i i (cid:12) a.s. ||ψ−ψ0||H≤τn(cid:12) i=1 (cid:12) 10Pötscher and Prucha (1994) and Jenish and Prucha (2009) discuss almost sure stochastic equicontinuity forestablishingUniformLawsofLargeNumbers. Theprimitivesexploredinthosepapersarenotsufficientto √ establishalmostsurestochasticequicontinuitywitha ndenominatorasrequiredbyChenetal.(2003). We didnotfindotherreferencesofprimitivesofalmostsurestochasticequicontinuity,andageneraltreatmentof this condition in the lines of Pakes and Pollard (1989) is an open question. 14

may be specified parametrically or nonparametrically. Finally, in Section 4.3, we discuss how to discretize Z and how to leverage the resulting discretized controls to improve both the identification and the estimation of the expectation. Note that to predict E[X∗|X∗ ≤ 0,Z], we only need data on X and Z. A researcher may wish to predict the expectation using an entirely different dataset. There is no impediment to doing so, as the theorems in Section 3 do not make any restrictions on the data used to estimate the expectation.11 For simplicity, we use the notation F (r) to denote P(R ≤ r|Z), for R = X,X∗ and η. R|Z F−1 (q) denotes the quantile q of F . When they exist, f is the density of F , and R|Z R|Z R|Z R|Z f(cid:48) is the derivative of f . R|Z R|Z 4.1 Partial Identification ConsiderFigure3,whichshowsthedistributionofthetreatmentvariableX inourapplication (hours per week watching TV) for two different values of the controls. We leave details about the data to Section 5. For now, we focus on the shape of the distribution. The main pieces of information we have are the amount of bunching at zero and the height and slope of the density as it reaches zero from the positive side. Figure 3: Conditional Empirical Distributions of X noitubirtsiD laciripmE 80. 60. 40. 20. 0 -20 0 20 40 60 Hours Per Week Watching TV noitubirtsiD laciripmE 80. 60. 40. 20. 0 -20 0 20 40 60 Hours Per Week Watching TV Note: Eachpanelrestrictsthesampletoadifferentvalueofthecontrols. Thekerneldensityplotandhistogram ofX inthepositivesideareshown. ThedarkerbarrepresentstheproportionofobservationsatX =0.Both the kernel bandwidth and the histogram bin width are equal to 2. WebeginbyestablishinganupperboundonE[X∗|X∗ ≤ 0,Z]thatisfairlyagnosticabout the shape of the distribution of X∗|Z below zero. 11Infact,insomeapplications,otherdatasetsmayallowtheobservationoftheentiredistributionofX∗,thus enablingthedirectidentificationofE[X∗|X∗ ≤0,Z].Forexample,minimumschoolinglaws,minimumwages, and minimum working age change over time and by state, and 401K minimum contributions are mandatory in some jobs and not in others, which may provide an opportunity to identify E[X∗|X∗ ≤0,Z] from similar observations that are unconstrained. 15

Proposition 4.1. Suppose that the right derivative of F (x) at zero exists, and denote it X|Z f (0) . Suppose also that in (−∞,0), f (x) exists and f (x) ≤ f (0) . Then, X|Z + X∗|Z X∗|Z X|Z + E[X∗|X∗ ≤ 0,Z] ≤ −F (0)2/f (0) . (9) X|Z X|Z + Thispropositionstatesthat,ifthedensityforX∗ < 0isnohigherthanthedensityatX∗ = 0, then the uniform density which integrates to F (0) in the negative side, f (0) 1(x ≥ X|Z X|Z + −F (0)/f (0) ) in (−∞,0], yields the highest possible value of the expectation. The X|Z X|Z + proposition allows f (x) to be discontinuous anywhere, including at zero, and to not be X∗|Z monotonic. Its proof is a trivial application of the lemma in Appendix B.5. Substitutinganestimateoftheupperboundinequation(9)onthemainregression(equation(5))yieldsanestimatorofanasymptoticupperboundforβ ifδ > 0andofanasymptotic lower bound for β if δ < 0. Since the sign of δ can be identified (Remark 2.2), this bound may be used strategically to obtain conservative conclusions. Figure 3 reveals a clear bell shape on the positive side. Note that in the left panel of Figure 3, where bunching is smaller, there is even some evidence of an inflection point in the left tail of the empirical distribution of X∗|Z. In order to obtain more precise information about β, the temptation is to assume a normal, logistic, or at the very least a symmetric structure. The concern with such an approach is that there may be a steep descent to zero on the side we do not observe, as in the case of a beta or a gamma distribution. AvisualinspectionofFigure3suggeststhattheprobabilityofbunchingislargeenoughto ruleoutasteepdescentinbothcases. Thefollowingpropositionoffersaconditionwhichallows us to test the concavity of the density on the negative side, which is a common manifestation in distributions with steep descent. Proposition 4.2. Suppose that f (x) exists in (−∞,0], and is differentiable at x = 0. X∗|Z Denote by f(cid:48) (0) the right derivative of f (x) at zero. Then f(cid:48) (0) = f(cid:48) (0) is X|Z + X|Z X∗|Z X|Z + identifiable. Moreover, if f(cid:48) (0) > 0 and f (x) is concave in supp(X∗|X∗ < 0,Z), then X∗|Z X∗|Z f (0)2 −2f(cid:48) (0) ·F (0) ≥ 0. (10) X|Z + X|Z + X|Z This proposition is a corollary of Proposition 4.3. The rationale behind equation (10) can be understood in the left panel in Figure 4. The dotted red line is f (0) +f(cid:48) (0) ·x. X|Z + X|Z + If f (x) is concave in (−∞,0], the area below the dotted line must not be smaller than X∗|Z F (0). Confirming the visual inspection, our estimation of the quantities above rules out X|Z concavity in both panels of Figure 3. 16

Figure 4: Concave Distribution of X∗|Z below zero F (0) X|Z f (0) X|Z + − fX|Z(0)+ X,X∗ E(Z)E(Z) X,X∗ f X (cid:48) |Z (0)+ (a) Testing Concavity (b) Bounds Under Concavity Note: The left panel shows the distribution of X∗|Z. On the positive side, it is equal to the distribution of X|Z. The right panel shows a zoomed-in version around the negative side of the distribution of X∗|Z. Under concavity, it is possible to derive upper and lower bounds of the expectation. Proposition 4.3. Suppose that f (x) exists in (−∞,0], and is differentiable at x = 0, X∗|Z f (0) > 0 and f(cid:48) (0) is defined as in Proposition 4.2. If, moreover, f (x) is concave X∗|Z X|Z + X∗|Z in supp(X∗|X∗ < 0,Z), then E(Z) ≤ E[X∗|X∗ ≤ 0,Z] ≤ E(Z), where (cid:40) −a/2, if f(cid:48) (0) = 0 E(Z) = −2a/3 and E(Z) = X|Z + −a/3(b(3−b)+b1/2(b−2)3/2), if f(cid:48) (0) (cid:54)= 0, X|Z + F (0) f (0)2 X|Z X|Z + where a = and b = . f (0) F (0)f(cid:48) (0) X|Z + X|Z X|Z + TheboundscanbeseenintherightpanelofFigure4. Underconcavity, themostnegative value of the expectation, E(Z), is the one obtained under the linear density which integrates to F (0), illustrated by the dashed blue line.12 The least negative value of the expectation, X|Z E(Z), is obtained under the density with derivative f(cid:48) (0) which integrates to F (0), X|Z + X|Z illustrated by the red dotted line.13 Note that the bounds also hold when f(cid:48) (0) < 0. The X|Z + proofofthispropositionfollowsfromthelemmainAppendixB.5,sincethedensity’sconcavity and continuity at 0 implies the necessary stochastic dominance relationships between those distributions. 12This density is equal to [f (0) +(f (0)2/2F (0))·x]1(x≥−2a) in (−∞,0]. X|Z + X|Z + X|Z 13If f(cid:48) (0) =0, this density is equal to f (0) 1(x≥−a). Otherwise, this density is X|Z + X|Z + [f (0) +f(cid:48) (0) ·x]1(x≥−a(b− (cid:112) b(b−2))) in (−∞,0]. X|Z + X|Z + 17

Substitutinganestimateoftheupperboundinequation(9)onthemainregression(equation (5)), and then repeating the process using the lower bound instead yields estimates of the implied bounds on β. Similar arguments allow us to determine that if f (x) is convex in supp(X∗|X∗ < X∗|Z 0,Z), then: (a) if f(cid:48) (0) > 0, then E(Z) is an upper bound of E[X∗|X∗ ≤ 0,Z]. Thus, X|Z + substitutinganestimatorofE(Z)intheregressionyieldsaβˆestimatorwhichisanasymptotic lower bound of β if δ > 0, and an asymptotic upper bound on β if δ < 0. (b) If f(cid:48) (0) ≤ 0, X|Z + then E(Z) is a lower bound of E[X∗|X∗ ≤ 0,Z]. Thus, substituting an estimator of E(Z) in the regression yields the opposite conclusions as (a). We discuss how the bounds in this section may be estimated in Appendix A. 4.2 Identification Through Families of Distributions To point-identify E[X∗|X∗ ≤ 0,Z], a natural approach is to suppose that X∗ belongs to some parametric family of distributions. It is possible to do this in varying degrees of flexibility, as weshowbelow. TohelpclarifyhowtheassumptionsmadetoidentifyE[X∗|X∗ ≤ 0,Z]impact theoriginalmodel,allassumptionshencefortharewrittenwithrespecttoη.Importantly,these assumptions are testable (see Remark 4.1). 4.2.1 Parametric Methods Model 4.2.1. (Tobit) η|Z ∼ N(Z(cid:48)µ,σ2). In this case, E[X∗|X∗ ≤ 0,Z] = Z(cid:48)(π+µ)−σλ(−Z(cid:48)(π+µ)/σ), where λ(·) is the inverse Mill’s ratio. Note that X∗|Z ∼ N(Z(cid:48)(π + µ),σ2) together with equation (3) satisfy the conditions of the Tobit model. Thus, both π+µ and σ can be identified as in Tobin (1958), and therefore the expectation is also identified. We never identify π and µ separately, nor is doing so necessary. Appendix A shows how this model may be estimated. The Tobit model assumes homoskedasticity. Turning to the application in Section 5, the dashed blue line in Figure 5 demonstrates the homoskedastic normal fit for two different values of the controls. These panels demonstrate that homoskedasticity is clearly not a good assumption in this application. The variance of the treatment variable X (TV time) for the left panel is clearly higher thanthecorrespondingvariancefortherightpanel, yetthefittednormaldistribution(dashed blue curve) does not reflect this. In the online appendix, we show how to identify E[X∗|X∗ ≤ 0,Z] in the logistic, exponential and uniform distribution families by maximum likelihood. Other distribution families can 18

Figure 5: Homoskedastic Tobit Fit noitubirtsiD laciripmE 80. 60. 40. 20. 0 -20 0 20 40 60 Hours Per Week Watching TV noitubirtsiD laciripmE 80. 60. 40. 20. 0 -20 0 20 40 60 Hours Per Week Watching TV Note: ThisfigureaddstoFigure3thefitteddistributionfromthehomoskedasticTobitmodel(Model4.2.1), shown as the dashed blue curve. be identified similarly. The homoskedasticity in Model 4.2.1 can also be relaxed by assuming that σ(Z) has a parametric functional form. 4.2.2 Semiparametric Methods The parameters in the previous models may be nonparametrically identified. Model 4.2.2. (Semiparametric Tobit) η|Z ∼ N(µ(Z),σ2(Z)). In this case, (cid:18) Z(cid:48)π+µ(Z) (cid:18) Z(cid:48)π+µ(Z) (cid:19)(cid:19) E[X∗|X∗ ≤ 0,Z] = σ(Z) −λ − , σ(Z) σ(Z) where λ(·) is the inverse Mill’s ratio. We can identify (Z(cid:48)π+µ(Z))/σ(Z) = −Φ−1(F (0)), X|Z where Φ is the c.d.f. of the standard normal distribution. We can also identify σ(Z) = −E[X|X > 0]/(Φ−1(F (0))−λ(−Φ−1(F (0))). Therefore, X|Z X|Z (cid:32) (cid:33) Φ−1(F (0))+λ(Φ−1(F (0)) E[X∗|X∗ ≤ 0,Z] = −E[X|X > 0,Z] X|Z X|Z . −Φ−1(F (0))+λ(−Φ−1(F (0))) X|Z X|Z The complicated expression above is just a function of F (0) and E[X|X > 0,Z], which X|Z are both identifiable. In Appendix A, we show how these quantities may be estimated. In the online appendix, we study identification in the cases in which η|Z has a logistic, exponentialoruniformdistribution,whileallowingalltheparametersofthosedistributionsto be fully nonparametric functions of Z. In all these cases, we derive the formula of E[X∗|X∗ ≤ 0,Z] as a function of F (0) and E[X|X > 0,Z]. Identification using other distribution X|Z families can be achieved analogously. 19

4.3 Discrete/Discretized Z In Section 4.3.1 we showcase the advantages that a finite support Z affords in our context, both in the identification and in the estimation of E[X∗|X∗ ≤ 0,Z]. Unfortunately, it is rare that Z has a finite support in practice. In Section 4.3.2, we show how Z may be discretized to leverage the advantages of finite support in cases with arbitrary Z. 4.3.1 Methods for Z with Finite Support In this section, suppose that for all z ∈ supp(Z), P(Z = z) > 0. Semiparametric Models with Finite Support Z We begin by showing that if supp(Z) is a finite set, then the identification in the semiparametric models of Section 4.2.2 may be achieved by simpler methods which yield better estimators. Consider first the semiparametric Tobit model. Model 4.3.1. (Semiparametric Tobit, discrete case) Suppose that Model 4.2.2 holds. Let α = z(cid:48)π + µ(z) and σ = σ(z). This implies that X∗|Z = z ∼ N(α ,σ2), where both the z z z z mean and variance depend arbitrarily on the value z. For a given z, this is a simple Tobit model with constant mean and variance. This means that we can identify α and σ as in z z (Tobin (1958)). We show how this model is estimated in Appendix A. Note that this approach also has computational advantages. We only need to estimate a 2-dimensional Tobit model for each value of Z in supp(Z). This is generally faster than estimating a (dim(Z)+1)-dimensional Tobit model, as in the homoskedastic Tobit case (Model 4.2.1). Figure 6 is identical to Figure 5, but it shows the semiparametric Tobit fit with the estimation method implied by Model 4.3.1. The fit in Figure 6 is better than the homoskedastic Tobit fit in Figure 5. However, the upper tails of the distributions of X|Z appear to be too heavy for a normal fit. In each panel, in an effort to match the heavier tail, the fitted distribution ends up missing the location and height of the peak observed in the raw data. If the upper tails are any indication of what is happening in the lower tails, this suggests that this approach may underestimate the magnitude of E[X∗|X∗ ≤ 0,Z], thus overestimating the magnitude of δ. This is consistent with what we find in our empirical results in Section 5 (compare column (iv) to our preferred results in column (v) of Table 1). Depending on the application, other distribution families may be more appropriate. In the online appendix, we provide the likelihood functions for the semiparametric logistic, exponential and uniform cases when Z has finite support. 20

Figure 6: Semiparametric Tobit Fit noitubirtsiD laciripmE 80. 60. 40. 20. 0 -20 0 20 40 60 Hours Per Week Watching TV noitubirtsiD laciripmE 80. 60. 40. 20. 0 -20 0 20 40 60 Hours Per Week Watching TV Note: ThisfigureaddstoFigure3thefitteddistributionfromthesemiparametricTobitmodel(Model4.3.1), shown as the dashed blue curve. Nonparametric Methods: Symmetry A finite-support Z makes it possible to gain more than semiparametric simplicity. If P(X = 0|Z = z) ≤ 0.5, we can drop the assumption that η|Z belongs to a known distribution family. Instead, we can use pieces of the distribution of X|Z = z reflected to the negative side. Model 4.3.2. (Conditional Tail Symmetry) Suppose that F (0) ≤ 0.5. Assume that X|Z=z the tails of F below −z(cid:48)π and above the corresponding location on the positive side are η|Z=z symmetric, so the dashed areas on each of the plots in Figure 7 perfectly mirror each other. Specifically, if a ≤ −z(cid:48)π, we suppose that F (a) = 1−F (F−1 (1−F (−z(cid:48)π))−a−z(cid:48)π). η|Z=z η|Z=z η|Z=z η|Z=z Note that, as is clear from the figure, we do not assume symmetry in the “middle” of the Figure 7: Symmetric Density in the Tails dFη|Z=z dFX∗|Z=z a −z(cid:48)π F η − |Z 1 =z (1−Fη|Z=z(−z(cid:48)π)) x 0 F X − ∗ 1 |Z=z (1−FX∗|Z=z(0)) −z(cid:48)π−a −x distribution (between −z(cid:48)π and z(cid:48)π). Also, the mean and variance of η|Z = z can assume any value. 21

For all x < 0 it follows that F (x) = 1−F (F−1 (1−F (0))−x) = X∗|Z=z X∗|Z=z X∗|Z=z X∗|Z=z 1−F (F−1 (1−F (0))−x). Therefore, if we calculate the expectation, we can X|Z=z X|Z=z X|Z=z identify the conditional expectation via a change of variables as E[X∗|X∗ ≤ 0,Z = z] = F−1 (1−F (0))−E[X|X ≥ F−1 (1−F (0)),Z = z]. X|Z=z X|Z=z X|Z=z X|Z=z We show how this model is estimated in Appendix A. Figure 8 is identical to Figures 5 and 6 but for the tail symmetry fit. Note that the fitted plots under tail symmetry imply a heavier tail than the fitted plots under normality in those figures. Figure 8: Conditional Tail Symmetry Fit noitubirtsiD laciripmE 80. 60. 40. 20. 0 -20 0 20 40 60 Hours Per Week Watching TV noitubirtsiD laciripmE 80. 60. 40. 20. 0 -20 0 20 40 60 Hours Per Week Watching TV Note: This figure adds to Figure 3 the fitted distribution from the conditional tail symmetry model (Model 4.3.2), shown as the dashed blue curve. In the online appendix we discuss identification and estimation under full symmetry. Full symmetry is easy to test, and it implies tail symmetry. The next remark discusses how the model may be tested. Remark 4.1. (Testing the Distributional Assumption) The distribution of X∗|Z is observed when X∗ > 0. Therefore, one may test the model assumptions by comparing the empirical distribution of X|Z and the estimated distribution implied by the model for X > 0. The well known two-sample Kolmogorov-Smirnov test is an option, but there are more powerful alternatives, such as the test developed by Goldman and Kaplan (2018).14 In our application, we test whether the distribution of X∗|Z is symmetric by comparing the empirical distribution of X|Z in (0,med(X|Z)) and the mirror of the empirical distribution of X|Z in (med(X|Z),F−1 (1−F (0))) using both Kolmogorov-Smirnov and Goldman and X|Z X|Z Kaplan (2018) tests. We also test the null hypothesis that the mean of X|Z in (0,med(X|Z)) 14This test is designed to have better power to detect deviations from the null at the extremities of distributions. This may be a concern in our setting when including the upper tail in the comparison of the distributions. 22

is the same as the mean of X|Z in the mirror image of (med(X|Z),F−1 (1 − F (0))). X|Z X|Z For each of these three tests, we fail to reject the null hypothesis for all clusters even at the 10% level of significance, using the Bonferroni critical values to avoid concerns about multiple testing. 4.3.2 Discretizing Z: Hierarchical Clustering When Z does not have a finite support, it may be “discretized” using a dimensionality reductiontechnique. Herewefocusonclusteringtechniquesbecause, inordertousethemethodsin the previous section, our goal is to reduce the size of the support, not necessarily the number of elements in Z. For a given number of elements in the support, clustering methods aim to minimize the loss of information. Let {Cˆ ,...,Cˆ } be a finite partition of supp(Z) into sets, which we call clusters, and 1 K let Cˆ = (1(Z ∈ Cˆ ),...,1(Z ∈ Cˆ ))(cid:48) be the cluster indicators. In the estimation of the K 1 K expectation, we propose substituting Z with Cˆ , which has finite support. The estimator K Eˆ[X∗|X∗ ≤ 0,Z] = Eˆ[X∗|X∗ ≤ 0,Cˆ ] is thus a two-step procedure in which first Z is K discretized and then one of the methods in the previous section is applied. In general, if E[X∗|X∗ ≤ 0,Z] is continuous, the ability of this estimator to approximate the expectation depends on how much information about Z is given by the cluster indicator vector Cˆ . Thus, it is desirable to choose a clustering method that minimizes the within- K cluster variation in the values of Z. All unsupervised clustering methods in the statistical learning literature could in principle be used (e.g. k-means, k-medoids, self-organizing maps, and spectral – see Hastie et al. (2009)).15 Below, we show results using hierarchical clustering for its simplicity, but similar results were also obtained with other unsupervised clustering methods.16 Figure9showshowtheestimatesofβ inourapplicationchangeasweincreasethenumber of clusters K. As K increases, the role of the discretization assumption diminishes. If our results were an artifact of the discretization, we should expect β estimates to change as K increases. Thus, if the estimates of β remain close to constant as the number of clusters increases, as we find in Figure 9, this raises our confidence in the assumption that the clusters adequately capture differences in the conditional distributions η|Z. Thereisagrowingliteratureineconomicsusingclusteringtechniquesinpanelsettings(e.g. Bonhomme and Manresa (2015); Bonhomme et al. (2017)). In this paper, clusters are used onlytoimprovetheestimationoftheexpectation,nottocontrolforunobservedheterogeneity. Clusters may in principle be used in two ways, both to improve the identification of the 15We can also envision supervised methods which leverage different values of Z for which there is less bunching to “train" predictions of E[X∗|X∗ ≤0,Z] for Z’s with more bunching. 16Hierarchicalclusteringrequiresthechoiceofadissimilaritymeasureandalinkagemethod. Thereported results use the Gower measure and Ward’s linkage, but we also obtained similar results with other choices. 23

Figure 9: Uncorrected and Corrected Cognitive and Non-cognitive β Estimates setamitsE evitingoC 5.2 0 5.2- 5- 0 10 20 30 40 50 Number of Clusters (K) setamitsE evitingoc-noN 5.2 0 5.2- 5- 0 10 20 30 40 50 Number of Clusters (K) Note: Shaded areas depict 90% confidence intervals for the corrected estimates using the tail symmetry method (model 4.3.2) and different total numbers of clusters K. All standard errors are bootstrapped using 1,000 iterations. expectation and to control for remaining confounders that may be present in spite of the correction. For instance, Caetano et al. (2020) provides a robustness check where cluster fixed effects are also included in the corrected regression in order to allow control variables to enter the outcome equation more flexibly. 5 Application: The Effect of TV on Children’s Skills In this section, we apply our method to estimate the effect of time spent watching TV on children’s skills17 using the 1997, 2002 and 2007 Waves of the Child Development Supplement from the Panel Study of Income Dynamics (CDS-PSID). A second application estimating the effect of enrichment activities on children’s skills can be found in Caetano et al. (2020). As both applications use the same sample, we refer the reader to that paper for details about the data, specification of controls, and definition of skills. Our analysis sample has a total of 4,330 observations. We would like to estimate β in equation (1), where Y is either the cognitive or non-cognitive skills of the child, X is the number of hours the child spent watching TV in a typical week, and Z is a vector of controls which includes a constant as well as characteristics of the child, family, and environment. Figure 10 shows the unconditional c.d.f (left panel) and the empirical distribution (right panel) of X. About 5% of the sample is bunched at X = 0. Examples of conditional versions of the right panel are shown in Figure 3. 17See for instance Zavodny (2006), Gentzkow and Shapiro (2008) and Munasib and Bhattacharya (2010) for recent empirical papers in this literature. 24

Figure 10: Unconditional Distribution of X noitcnuF noitubirtsiD evitalumuC 1 8. 6. 4. 2. 0 0 10 20 30 40 50 Hours Per Week Watching TV noitubirtsiD laciripmE 80. 60. 40. 20. 0 0 10 20 30 40 50 Hours Per Week Watching TV Note: TheleftpanelshowsthecumulativedistributionfunctionofX.Therightpanelshowsthekerneldensity estimate along with the histogram for X > 0 (bandwidth equals to 2). The darker bar is the proportion of observations with X =0. 5.1 Main Results Table1presentsthemainresultsoftheestimationofβ inequation(1)bothwithandwithout ourendogeneitycorrection. Column(i)showstherawassociationsbetweenTVtimeandskills without controls. Time spent watching TV is negatively correlated with both cognitive and non-cognitive skills. Column (ii) adds controls but does not use the endogeneity correction. After controlling for observables, the estimates of β are closer to zero, but the cognitive estimate is still negative and highly significant. In columns (iii)-(v), we show estimates of the corrected regression (equation (5)), where E[X∗|X∗ ≤ 0,Z]isestimatedusingdifferentmodels. Column(iii)showstheresultsunderthe homoskedasticTobitassumption(Model4.2.1). Theβ estimatesarepositiveandinsignificant for cognitive skills, but negative and significant for non-cognitive skills. In column (iv), we relax the assumption of homoskedasticity while keeping the assumption of normality (Model 4.3.1) and find estimates that are very close to the homoskedastic case. Finally, we relax the assumption of normality in column (v) and assume only that the conditional η distributions are symmetric in the tails (Model 4.3.2). This assumption yields estimates that are a bit closer to zero, yet the non-cognitive estimate remains significant at 5%. Statistically, all three corrections yield similar results. Table 1 also displays the estimates of δ for all correction methods. All the estimates are significant at 5% and are negative for cognitive skills and positive for non-cognitive skills. Notethatthebootstrappedstandarderrorsofβˆinthecorrectedmodels(columns(iii)-(v) in Table 1) are much larger than the corresponding standard errors in the uncorrected model (column (ii) in Table 1). The Eicker-White standard errors of the corrected model (Σ, see Theorem 3.1) gravitate around 95% of the bootstrapped standard errors for all specifications, so the penalty due to the estimation of the expectation turns out to not be important in 25

Table 1: Main Empirical Results (i) (ii) (iii) (iv) (v) Uncorrected Uncorrected Homosk. Semipar. Tail No Controls w/ Controls Tobit Tobit Symmetry β -0.57** -0.44** 1.54 1.52 0.94 Cognitive (0.13) (0.10) (0.98) (0.98) (0.68) δ -1.92** -1.90** -1.31** (0.93) (0.93) (0.63) β -0.28* -0.10 -3.05** -3.07** -2.11** Non- (0.14) (0.14) (1.44) (1.42) (1.00) Cognitive δ 2.85** 2.87** 1.91** (1.40) (1.38) (0.95) Note: N=4,330. Results are reported in terms of percentage points of the standard deviation of the outcome variable. Forexample,theresultsinthelastcolumnsuggestthatanincreaseofonehourperweekwatchingTV leadstoareductioninnon-cognitiveskillsof2.11percentagepointsofonestandarddeviation. Bootstrapped standarderrorsinparentheses(1,000bootstrapsamples). Columns(iii),(iv)and(v)showresultsforModels 4.2.1,4.3.1and4.3.2,respectively. Thecorrectedspecificationsuse10clusters. SeeFigure9forareproduction of the results in column (v) for different numbers of clusters. ** p<0.05, * p<0.1. explaining the larger standard errors. Rather, the standard errors are larger because much of the raw variation in X is contaminated by variation from confounders, and thus in the uncorrected models X is predicting a large part of the error. 5.2 Supporting Evidence for Main Identifying Assumptions Thissectionprovidesaroadmapofthetypesofsensitivityanalysesthatwediscussedthroughout the paper to validate the correction approach. To keep it brief, we organize the discussion as a list of checks for each identifying assumption, and simply refer the reader to the relevant discussion in the text for details. 5.2.1 Linearity Assumption in Equations (1) and (2) • Relax this assumption using a more flexible model specification, such as the models discussed in Section 2.2. • Check the predicted residuals of the regression of Y on X and Z for X > 0. If equations (1) and (2) hold, the non-parametric fit of these residuals should be close to zero everywhere for X > 0, which is what we find in the application (see Section 2.1 in the online appendix). 26

• Estimate β for truncated samples (X ≤ x¯), then plot βˆ for many values of x¯. Results should be stable, which is what we find in the application (see Section 2.2 in the online appendix). • If there is more than one bunching point in the support of X, implement the exogeneity test in Remark 2.3. 5.2.2 Assumption for Identification of E[X∗|X∗ ≤ 0,Z] • Report results under different assumptions, as in Table 1. • Run Monte Carlo simulations using the application data comparing the results of different methods under different distributional assumptions (see Section 3 in the online appendix). • Apply the tests and calculate bounds in Propositions 4.1, 4.2, and 4.3. • VisualcheckscanbedonebyvalueofZ,ifpossible,orbygroupsofvaluesofZ otherwise. SeeFigures5,6and8,whichshowthefitfortwooftheclustersusedforthemainresults (Table 1). • Test whether the empirical distribution and the fitted distribution are the same for X∗ > 0 (see Remark 4.1). • Knowing the sign of δ without any assumption about the expectation (Remark 2.2) allows one to choose identification strategies for E[X∗|X∗ ≤ 0,Z] that are conservative in the context of the application. For example, if δ > 0 and we want to make the point that β < 0, it is preferable to err towards an overestimation of the magnitude of E[X∗|X∗ ≤ 0,Z] so that the remaining bias after correction is positive. In this case, if βˆ is still negative, we could be confident in the conclusion that β is negative even if the correction is imperfect. This is in part why we report as our main results the tail symmetry estimates (column (v) of Table 1). • If using clusters, examine how results change with the number of clusters used in the estimation of the expectation (Figure 9 in Section 4.3.2). Of course, some of these checks can detect violations from both assumptions jointly (e.g., the last item from each section). 6 Concluding Remarks This paper shows how to leverage bunching at the lower (or upper) extremum of the distribution of the treatment variable to transform a problem of endogeneity into a problem of 27

out-of-sample prediction. We examine several models in which this type of correction can be built. In a linear model, the correction consists of a generated regressor which is added to the original regression. We study the asymptotic behavior of the estimated coefficients of the corrected regression. We consider several ways in which the out-of sample prediction might be done. Finally, we apply our correction to an empirical problem and showcase how the underlying assumptions of the method may be tested or argued. The method developed in this paper opens up several paths for new research. Here we highlight a few: (1) Throughout the paper, we proposed several tests of the underlying assumptions. Although all are based on existing tests, the consequences of the use of estimated nuisance parameters should be studied. (2) The correction strategy for the probit with endogeneity (Section 2.2.5) indicates that this type of approach may be developed for some widely used models in the structural literature, such as discrete choice models. (3) Discretizing Z before estimating the expectation proved to be a useful approach in this application and in Caetano et al. (2020). The advantages/drawbacks of the use of clusters in this context need to be investigated further. (4) The interaction of the correction with existing methods is promising. We mentioned the potential combination of this method with Caetano (2015)’s test in Remark 2.3. The interaction of this approach with instrumental variables methods may also prove valuable. Our preliminary analyses indicate that the validity requirements of an IV may be substantially weakened when the correction term is added to the regression. A Estimators • Remark2.2: Thiscanbeimplementedintwosteps. (1)RegressY onX andZ usingonly observations with X > 0. Record the estimate of the coefficient of Z, αˆ . (2) Calculate Z the average of the residuals at X = 0, ( (cid:80)n 1(X = 0))−1(cid:80)n (Y −Z(cid:48)αˆ )1(X = 0). i=1 i i=1 i i Z i This is an estimator of δE[X∗|X∗ ≤ 0]. • Bounds from Section 4.1 for discrete/discretized Z: for each value Z assumes, say z, restrict the sample only to observations such that Z = z. Then, (1) calculate Fˆ (0) = ( (cid:80)n 1(Z = z))−1(cid:80)n 1(X = 0,Z = z), and (2) apply the method X|Z=z i=1 i i=1 i i in Cattaneo et al. (2019) to estimate both the density and the derivative terms. • Model4.2.1: EstimationcanbedonewithaTobitregressionofX ontoZ withcensoring below zero. • Model 4.2.2: Note that E[1(X = 0)|Z = z] = F (0) and can be estimated as X|Z=z a nonparametric regression of 1(X = 0) onto Z at z, and E[X|X > 0,Z = z] = (1−F (0))−1E[X|Z = z], and E[X|Z = z] can be estimated as a nonparametric X|Z=z regression of X onto Z at z. A standard Nadaraya-Watson kernel regression could be 28

(cid:82)∞ used, for example. Let K(u) be a kernel function ( K(u) = 1, and suppose if −∞ convenient that K(u) ≥ 0 and K(u)1(|u| > 1) = 0). Let k (Z − z) = K((Z − n i i z)/h )/ (cid:80)n K((Z −z)/h ), for some sequence h → 0, nh → ∞. Then n i=1 i n n n n (cid:88) Fˆ (0) = 1(X = 0)k (Z −z) X|Z=z i n i i=1 and n (cid:88) Eˆ[X|X > 0,Z = z] = (1−Fˆ (0))−1 X k (Z −z). X|Z=z i n i i=1 Conditions for uniform convergence of such estimators can be verified in the existing literature. See,forexample,Andrews(1995)andHansen(2008). Onecouldusedifferent estimators, for example local polynomials, see e.g. Masry (1996) or series estimators, see e.g. Song (2008). • Model 4.3.1: For each value Z assumes, say z, run a Tobit regression of X onto a constant with censoring below zero using only observations such that Z = z. • Model 4.3.2: Substitute quantities in the expectation formula by sample equivalents. B Proofs B.1 Proof of Theorem 3.1 For any function ψ ∈ H, and parameter θ ∈ Θ, define M(θ,ψ) = E[W(Y −W(cid:48)θ)], where ψ W = (X,Z(cid:48),ψ(Z)1(X = 0))(cid:48), and note that M(θ ,ψ ) = 0. Since it will be used several ψ 0 0 times, note that W −W = (0,0(cid:48),(ψ(Z)−ψ (Z))1(X = 0))(cid:48) = (ψ(Z)−ψ (Z))1(X = 0)e , ψ 0 0 δ where e = (0,...,0,1)(cid:48) is the last (dim(Z) + 2) × 1 canonical vector. Define M (θ,ψ) = δ n 1 (cid:80)n W (Y − W(cid:48) θ). Since this is a just-identified problem, θˆ is chosen exactly as the n i=1 iψ i iψ solution to min ||M (θ,ψˆ)||. Denote θ = (θ ,θ(cid:48) ,θ )(cid:48). Define (cid:15) = Y −E[Y|X,Z] = ε+δ(η− θ n X Z E E[η|X,Z])1(X = 0). Finally, Chen et al. (2003) also define a matrix W, which in our case should be understood as the identity matrix (i.e. whenever W appears in Chen et al. (2003), substitute it for the identity matrix). We show the asymptotic normality, using Theorem 2 in Chen et al. (2003). • θˆ−θ = o (1) : We prove this directly. 0 p (cid:32) n (cid:33)−1 n 1 (cid:88) 1 (cid:88) θˆ−θ = W W(cid:48) W ((cid:15) −(W −W )(cid:48)θ ) 0 n iψˆ iψˆ n iψˆ i iψˆ i 0 i=1 i=1 29

(cid:32) n (cid:33)−1(cid:34) n n n 1 (cid:88) 1 (cid:88) 1 (cid:88) 1 (cid:88) = W W(cid:48) W (cid:15) + (W −W )(cid:15) − W (W −W )(cid:48)θ + n iψˆ iψˆ n i i n iψˆ i i n i iψˆ i 0 i=1 i=1 i=1 i=1 n (cid:35) 1 (cid:88) − (W −W )(W −W )(cid:48)θ n iψˆ i iψˆ i 0 i=1 The term 1 (cid:80)n W (cid:15) = o (1) by Assumption 1(ii) and Brunk-Chung’s Law of Large Numn i=1 i i a.s. bers (Chow and Teicher (1997), Theorem 10.1.3 for r = 1). (cid:12) (cid:12) The term (cid:12)1 (cid:80)n (W −W )(cid:15) (cid:12) ≤ ||ψˆ− ψ ||1 (cid:80)n |(cid:15) |. By Assumption 1(v), the first (cid:12)n i=1 iψˆ i i(cid:12) 0 n i=1 i term is o (1). The second term is O (1) by Assumption 1(ii) and Brunk-Chung’s Law of p a.s. Large Numbers. Remaining terms are shown to be o (1) analogously. For the last term, in Appendix B.2 p we show that 1 (cid:80)n W W(cid:48) = 1 (cid:80)n W W(cid:48) plus a matrix with terms that can be seen n i=1 iψˆ iψˆ n i=1 i i there. The terms inside the matrix are all similar to the terms above, in that they combine (ψˆ(Z ) − ψ (Z )) or the square of it, and other independent variables, and thus it can be i 0 i shown that they are all o (1) analogously to what we just did. In Appendix B.2 we also show p that 1 (cid:80)n W W(cid:48) → E[WW(cid:48)], which is full rank by Assumption 1(ii), which completes n i=1 i i a.s. the proof. • Assumption 2.1 is trivially satisfied. • Assumption2.2: Γ (θ,ψ ) = E[WW(cid:48)]byAssumption1(ii)andtheDominatedConvergence 1 0 Theorem, and is constant in θ. The requirements are thus satisfied by Assumption 1(ii). • Assumption2.3: First, wecalculatethederivativeΓ (θ,ψ )[ψ−ψ ].Letψ = ψ+t(ψ−ψ ). 2 0 0 t 0 By Assumption 1(ii) and the Dominated Convergence Theorem, (cid:20) (cid:21) Γ (θ,ψ )[ψ−ψ ] = E lim 1 (cid:0) −W(W −W)(cid:48)θ+(W −W)(Y −W(cid:48) θ) (cid:1) 2 0 0 t→0 t ψt ψt ψt = E(cid:2) (ψ(Z)−ψ (Z))1(X = 0)(0,−θ Z(cid:48),−W(cid:48)(θ−θ )−θ ψ (Z))(cid:48)(cid:3) , 0 E 0 E 0 which exists in all directions [ψ−ψ ] ∈ H. Next, 0 ||M(θ,ψ)−M(θ,ψ 0 )−Γ 2 (θ,ψ 0 )[ψ−ψ 0 ]|| = (cid:12) (cid:12) (cid:12) (cid:12) E(cid:2) −θ E (ψ(Z)−ψ 0 (Z))21(X = 0)e δ (cid:3)(cid:12) (cid:12) (cid:12) (cid:12) ≤ |θ |·||ψ−ψ ||2 E 0 H and for τ = o(1), and ||θ−θ || ≤ τ , and ||ψ−ψ || ≤ τ , n 0 n 0 n ||Γ (θ,ψ )[ψ−ψ ]−Γ (θ ,ψ )[ψ−ψ ]|| = 2 0 0 2 0 0 0 (cid:12) (cid:12) (cid:12) (cid:12) E(cid:2) −(ψ(Z)−ψ 0 (Z))1(X = 0)(0,(θ E −δ)Z(cid:48),W(cid:48)(θ−θ 0 )+(θ E −δ)ψ 0 (Z))(cid:48)(cid:3)(cid:12) (cid:12) (cid:12) (cid:12) ≤ sup ||ψ−ψ ||E[||W||](|θ −δ|+||θ−θ ||) ≤ 2∆·τ2, 0 E 0 n ||ψ−ψ0||≤τn 30

where the first inequality uses the triangle and Cauchy-Schwarz’s inequalities, and the second inequality is true by Assumption 1(ii). • Assumption 2.4 is true by Assumption 1(v). • Assumption 2.5 (we prove the stronger condition 2.5’): unfortunately, we cannot take advantage of Theorem 3 in Chen et al. (2003) because we would like to allow the data to be independent but not identically distributed. We cannot apply the result they mention on Remark 3(iii) either, because the last element of our function m is not monotonic in ψ. We must therefore prove the stochastic equicontinuity condition directly. Let ||ψ−ψ || ≤ τ and ||θ−θ || ≤ τ , with τ = o(1). Then, 0 n 0 n n √ n||M (θ,ψ)−M(θ,ψ)−M (θ ,ψ )|| n n 0 0 (cid:12)(cid:12) (cid:12)(cid:12) (cid:12) (cid:12) n n (cid:12)(cid:12) 1 (cid:88) (cid:12)(cid:12) (cid:12) 1 (cid:88) (cid:12) ≤ (cid:12)(cid:12)√ (W W(cid:48)−E[W W(cid:48)])(θ−θ )(cid:12)(cid:12)+(cid:12)√ (ψ(Z )−ψ (Z ))1(X = 0)(cid:15) (cid:12) (cid:12)(cid:12) n i i i i 0 (cid:12)(cid:12) (cid:12) n i 0 i i i(cid:12) (cid:12)(cid:12) (cid:12)(cid:12) (cid:12) (cid:12) i=1 i=1 (cid:12)(cid:12) (cid:12)(cid:12) n +(τ +θ ) (cid:12) (cid:12) (cid:12) (cid:12)√ 1 (cid:88)(cid:2) (ψ(Z )−ψ (Z ))1(X = 0)W(cid:48)−E[(ψ(Z )−ψ (Z ))1(X = 0)W(cid:48)] (cid:3) (cid:12) (cid:12) (cid:12) (cid:12) n E (cid:12)(cid:12) n i 0 i i i i 0 i i i (cid:12)(cid:12) (cid:12)(cid:12) (cid:12)(cid:12) i=1 (cid:12) (cid:12) n +θ (cid:12) (cid:12)√ 1 (cid:88)(cid:2) (ψ(Z )−ψ (Z ))21(X = 0)−E[(ψ(Z )−ψ (Z ))21(X = 0)] (cid:3) (cid:12) (cid:12) E(cid:12) n i 0 i i i 0 i i (cid:12) (cid:12) (cid:12) i=1 The convergence of the sup of the first term is established in Andrews (1994), ||θ−θ0||≤δn equation (2.4), and for that we need only to show that √1 n (cid:80)n i=1 (W i W i (cid:48)−E[W i W i (cid:48)]) = O p (1). Someoftheterms (cid:80)n W W areconstant, forexampleW W = X ψ (Z )1(X = i=1 ij is i1 i,dim(Z)+2 i 0 i i 0) = 0. For the terms that are not constant, we show that Liapounov’s condition is satisfied. By Assumption 1(ii), (cid:80)n E(cid:2) |W W −E[W W ]|2+α(cid:3) (cid:80)n E(cid:2) |W W |2+α(cid:3) ∆ i=1 ij is ij is ≤ i=1 ij is ≤ = o(1). ( (cid:80)n i=1 Var(W ij W is ))1+α 2 (nα)1+α 2 n α 2α1+α 2 For the remaining three terms, we note that Lemma 2.17 in Pakes and Pollard (1989) can be directly extended from their case f(·,θ) to our case f(·,ψ), and can be proven in the same way as theirs, as pointed out by Chen et al. (2003) in the proof of their Lemma 1. Next, we show that f(X ,Z ,(cid:15) ,ψ) = (ψ(Z )−ψ (Z ))1(X = 0)(cid:15) satisfies the conditions i i i i 0 i i i of Lemma 2.17 in Pakes and Pollard (1989). To see this, note that |f(X ,Z ,(cid:15) ,ψ )−f(X ,Z ,(cid:15) ,ψ )| ≤ b(X ,Z ,(cid:15) )||ψ −ψ ||, i i i 1 i i i 2 i i i 1 2 where b(X ,Z ,(cid:15) ) = 1(X = 0)(cid:15) , and thus f is Lipschitz continuous in ψ. Assumption 1(ii) i i i i i guarantees that E[b(X ,Z ,(cid:15) )2+α] ≤ ∆ for some α > 0. Therefore, f is L (P)-continuous i i i 2 in ψ. Finally, an identical argument to Chen et al. (2003)’s proof of Theorem 3, item (i) establishes that Hölder continuity of f in ψ combined with the finite uniform entropy of ψ 31

(Assumption1(iv))implythatf belongstoanEuclideanclass. AdirectapplicationofLemma (cid:12) (cid:12) 2.17 concludes that sup ||ψ−ψ0||≤δn (cid:12) (cid:12) √1 n (cid:80)n i=1 (ψ(Z i )−ψ 0 (Z i ))1(X i = 0)(cid:15) i (cid:12) (cid:12) = o p (1). Forthelasttwoterms,wecanshowthelocaluniformcontinuityinprobabilityanalogously. Note that the last term is not Lipschitz, but instead Hölder continuous with constant equal to 2, which is treated identically. • Assumption 2.6: √ 1 (cid:88) n √ n(M (θ ,ψ )+Γ (θ ,ψ )[ψˆ−ψ ] = √ W (cid:15) −δ nE[(ψˆ(Z )−ψ (Z ))1(X = 0)W ], n 0 0 2 0 0 0 i i i 0 i i i n i=1 Both terms are uncorrelated (because (cid:15) is mean independent of all X and Z ). White (1980) i j j establishes (by Assumption 1(ii)) that the first term converges to a normally distributed random variable with zero mean and variance equal to the middle term in Eicker-White’s covariance matrix. The second term converges to N(0,Ω) by Assumption 1(vi). (cid:3) B.2 Proof of Theorem 3.2 We use the notation defined in the beginning of the previous section. The convergence in probability of the first term of Vˆ to Σ is a consequence of Assumption 1(ii) and is established θ in White (1980). For the second term, δˆ→ δ proceeds from Theorem 3.1. p (cid:16) (cid:17)−1 Next, we establish that wˆ(cid:48)wˆ → E[WW(cid:48)]−1. We can decompose n p (cid:18) wˆ(cid:48)wˆ (cid:19) (cid:18) w(cid:48)w (cid:19) (cid:18) (wˆ −w)(cid:48)w (cid:19) (cid:18) w(cid:48)(wˆ −w) (cid:19) (cid:18) (wˆ −w)(cid:48)(wˆ −w) (cid:19) (cid:18) w(cid:48)w (cid:19) = + + + = n n n n n n   0 0 0 + 0 0 1 (cid:80)n Z (ψˆ(Z )−ψ (Z ))1(X = 0) .  n i=1 i i 0 i i  0 1 (cid:80)n (ψˆ(Z )−ψ (Z ))Z(cid:48)1(X = 0) 3 (cid:80)n (ψˆ(Z )−ψ (Z ))21(X = 0) n i=1 i 0 i i i n i=1 i 0 i i By Markov’s inequality, (cid:32)(cid:12) n (cid:12) (cid:33) n (cid:12)1 (cid:88) (cid:12) 1 (cid:88) P (cid:12) Z (ψˆ(Z )−ψ (Z ))1(X = 0)(cid:12) > τ ≤ sup|ψˆ(z)−ψ (z)| E[||Z ||]. (11) (cid:12)n i i 0 i i (cid:12) 0 τn i (cid:12) (cid:12) Z i=1 i=1 By Assumption 1(v), the first term in (11) is o (1). By Assumption 1(ii), the second term in p (11) is bounded. The other terms in the matrix are shown to be o (1) analogously. p The term w(cid:48)w → E[WW(cid:48)] by Assumption 1(ii) and Brunk-Chung’s Strong Law of n a.s. Large Numbers (see Chow and Teicher (1997), Theorem 10.1.3, for r = 1). Next, we decompose wˆ(cid:48)Vˆwˆ 1 (cid:88) n (cid:88) n − C W W(cid:48)1(X = 0,X = 0) = n2 n2 ij i j i j i=1 j=1 32

n n 1 (cid:88)(cid:88) = (Cˆ −C )(Wˆ −W )(Wˆ −W )(cid:48)1(X = 0,X = 0) n2 ij ij i i j j i j i=1 j=1 n n 1 (cid:88)(cid:88) + C (Wˆ −W )(Wˆ −W )(cid:48)1(X = 0,X = 0) n2 ij i i j j i j i=1 j=1 n n 2 (cid:88)(cid:88) + (Cˆ −C )(Wˆ −W )W(cid:48)1(X = 0,X = 0) n2 ij ij i i j i j i=1 j=1 n n 2 (cid:88)(cid:88) + C (Wˆ −W )W(cid:48)1(X = 0,X = 0) n2 ij i i j i j i=1 j=1 n n 1 (cid:88)(cid:88) + (Cˆ −C )W W(cid:48)1(X = 0,X = 0) n2 ij ij i j i j i=1 j=1 Note that the sums 1 (cid:80)n (cid:80)n |C |1(X = 0,X = 0), 1 (cid:80)n ||W ||1(X = 0,X = n2 i=1 j=1 ij i j n i=1 j i j 0), 1 (cid:80)n (cid:80)n ||C W ||1(X = 0,X = 0) and 1 (cid:80)n (cid:80)n ||W W(cid:48)||1(X = 0,X = 0) n2 i=1 j=1 ij j i j n2 i=1 j=1 i j i j are all von Mises statistics, corresponding to U-statistics with kernel h((X ,Z(cid:48))(cid:48),(X ,Z(cid:48))(cid:48)) = i i j j |C |1(X = 0,X = 0) in the first case, and analogously for the others. By Assumption 2(ii) ij i j and Assumption 1(ii), the U-statistic converges a.s. to the kernel mean (see Theorem 3.1.1 in Korolyuk and Borovskich (2013).) Since the U-statistic converges a.s. to a finite constant, the von Mises statistic also converges a.s. to that constant. The decomposition is therefore o (1) by the triangle inequality and Assumption 2(iii) and p because sup ||Wˆ − W || ≤ sup |ψˆ(Z ) − ψ (Z )| = o (1) as a consequence of i=1,...,n i i i=1,...,n i 0 i p Assumption 1(v). Finally, we show that 1 (cid:80)n (cid:80)n C W W(cid:48)1(X = 0,X = 0) → Ω. This is a von n2 i=1 j=1 ij i j i j a.s. Mises statistic corresponding to the U-statistic with kernel function h((X ,Z(cid:48))(cid:48),(X ,Z(cid:48))(cid:48)) = i i j j C W W(cid:48)1(X = 0,X = 0). By Assumption 2(ii), the U-statistic converges a.s. to its mean, ij i j i j Ω (Theorem 3.1.1 in Korolyuk and Borovskich (2013) again). Since the U-statistic converges a.s. to a finite constant, the von Mises statistic does as well. (cid:3) B.3 Proof of Remark 3.1 √ We show that when n(ψˆ−ψ ) converges in distribution to a Brownian Bridge, Assumption 0 1(vi) holds, and Ω = E[C 1(X =0,X =0)W W(cid:48)], for C = Cov(χ ,χ ). ij i j i j ij Zi Zj Define (cid:90) ϕ(T ) = T (z)1(x = 0)(x,z(cid:48),ψ (z)1(x = 0))(cid:48)P(dx,dz), n n 0 √ √ then, nE[(ψˆ(Z )−ψ (Z ))1(X = 0)W ] = n(ϕ(ψˆ)−ϕ(ψ )). The Hadamard derivative of i 0 i i i 0 ϕ at ψ is 0 (cid:90) ϕ(cid:48) (h) = h(z)1(x = 0)(x,z(cid:48),ψ (z)1(x = 0))(cid:48)P(dx,dz) ψ0 0 33

Therefore, by Assumption 1(iv) and the Functional Delta Method, √ (cid:90) n(ϕ(ψˆ)−ϕ(ψ )) → χ 1(x = 0)(x,z(cid:48),ψ (z)1(x = 0))(cid:48)P(dx,dz). 0 d z 0 Denote the limit random variable as A, and w(x,z) = (x,z(cid:48),ψ (z)1(x = 0))(cid:48), then 0 (cid:18) (cid:90)(cid:90) (cid:19) A ∼ N 0, C(z,z˜)1(x = 0,x˜ = 0)w(x,z)w˜(x˜,z˜)(cid:48)P(dx,dz)P(dx˜,dz˜) . (See e.g. Van der Vaart (1998) Example 22.11 for a similar calculation). (cid:3) B.4 Proof of Theorem 3.3 We show that the assumptions in Theorem B in Chen et al. (2003) hold. We use the notation defined in the beginning of Section B.1. Note that, in our case, m(Y ,X ,Z ,θ,ψ(Z )) = i i i i W (Y −W(cid:48) θ). iψ i iψ • First, we show that θˆ−θ = o (n−1/4). We prove this directly. The decomposition is the 0 a.s. same as in the first item in Appendix B.1: (cid:32) n (cid:33)−1(cid:34) n n n 1 (cid:88) 1 (cid:88) 1 (cid:88) 1 (cid:88) θˆ−θ = W W(cid:48) W (cid:15) + (W −W )(cid:15) − W (W −W )(cid:48)θ + 0 n iψˆ iψˆ n i i n iψˆ i i n i iψˆ i 0 i=1 i=1 i=1 i=1 n (cid:35) 1 (cid:88) − (W −W )(W −W )(cid:48)θ . n iψˆ i iψˆ i 0 i=1 Theterm 1 (cid:80)n W (cid:15) = o (n−1/2+α)forsomeα > 0byAssumption1(ii)andMarcinkiewiczn i=1 i i a.s. Zygmund Strong Law of Large Numbers (Chow and Teicher (1997) Theorem 5.3.2). (cid:12) (cid:12) The term (cid:12)1 (cid:80)n (W −W )(cid:15) (cid:12) ≤ ||ψˆ−ψ ||1 (cid:80)n |(cid:15) |. By Assumption 3(iii), the first (cid:12)n i=1 iψˆ i i(cid:12) 0 n i=1 i term is o (n−1/4). The second term is O (1) by the Strong Law of Large Numbers. a.s. a.s. Theremaining two termsinsidethe brackets areshownto beo (n−1/4)analogously. For a.s. the last term, in Section B.2 we showed that 1 (cid:80)n W W(cid:48) = 1 (cid:80)n W W(cid:48) plus a matrix n i=1 iψˆ iψˆ n i=1 i i with terms which can be seen there. The terms inside the matrix are all similar to the terms above, in that they combine (ψˆ(Z )−ψ (Z )), or the square of it, and other i.i.d. variables, i 0 i and thus it can be shown that they are all o (n−1/4) analogously to what we just did. In a.s. the previous section, we also showed that 1 (cid:80)n W W(cid:48) → E[WW(cid:48)], which is full rank by n i=1 i i a.s. Assumption 1(ii) and completes the proof. • Assumption 2.1 holds a.s. trivially. • Assumption 2.4 holds a.s. by Assumption 3(v). • Assumption 2.5’ holds a.s.: this assumption is used to show that ||ν (θˆ∗,hˆ∗)−ν (θˆ,hˆ)|| = n n o (1). Instead of using the almost sure stochastic equicontinuity, we establish this result p∗ directly. 34

By the triangle inequality, the term is bounded above by ||ν (θˆ∗,hˆ∗) − ν (θ ,h )|| + n n 0 0 ||ν (θˆ,hˆ)−ν (θ ,h )||,andinourcase,eachofthosetermsisboundedabovebyfourquantities n n 0 0 identical to the bounds in the proof of Assumption 2.5 in Section B.1, except that ψ and θ are substituted by ψˆb and θˆb, and ψˆ and θˆrespectively. (cid:12)(cid:12) (cid:12)(cid:12) The first quantity is bounded above by (cid:12)(cid:12) 1 (cid:80)n W W(cid:48)−E[W W(cid:48)](cid:12)(cid:12)nc||θ − θ || for (cid:12)(cid:12)n1/2+c i=1 i i i i (cid:12)(cid:12) 0 some 0 < c ≤ 1/4 and θ = θˆb or θ = θˆ, respectively. By Assumption 1(ii) and Marcinkiewicz- Zygmund Law of Large Numbers (Chow and Teicher (1997) Theorem 5.3.2), the first term is o (1). We also proved (first point in this Section) that ||θˆ− θ || = o (n−1/4). Given a.s. 0 a.s. Assumptions 3 (va) and (vc), all the assumptions of Theorem 3.1 hold with ψb in place of ψˆ and ψˆ in place of ψ , and changing the probability measure from P to Pb. Therefore, by the 0 proof of Theorem 2 in Chen et al. (2003) (as mentioned in the proof of their Theorem B), ||θˆb−θˆ|| = O (n−1/2). Therefore ||θˆb−θ || = O (n−1/2), which completes the proof. pb 0 pb Theresult forthe remaining quantitiesis implieddirectlybyAssumptions3 (iv)and (vb). • Assumption 2.6 does not have “in probability" in its statement and thus holds by the proof in Section B.1. • Assumption 2.2 with ψ replaced by ψ ∈ H : The derivative Γ (θ ,ψ) = E[W W(cid:48)] exists, 0 τ 1 0 ψ ψ is continuous everywhere in θ (it does not depend on θ), and is of full rank for τ sufficiently small by Assumption 1(ii). Note also that Γ (θ ,ψ) is continuous in ψ at θ = θ and ψ = ψ , since 1 0 0 0 ||Γ (θ ,ψ)−Γ (θ ,ψ )|| ≤ E[||(W −W)W(cid:48) +W(W −W)(cid:48)||] 1 0 1 0 0 ψ ψ ψ ≤ ||ψ−ψ ||E[||W ||+||W||] ≤ (2∆+τ)||ψ−ψ ||. 0 ψ 0 • Assumption 2.3 with ψ replaced by ψ ∈ H : 0 τn (cid:104) (cid:105) Γ (θ,ψ)[ψ˜−ψ] = E (ψ˜(Z)−ψ(Z))1(X = 0)(0,−θ Z(cid:48),−W(cid:48)(θ−θ )−θ ψ(Z))(cid:48) 2 E 0 E exists in all directions [ψ˜−ψ] ∈ H. Moreover, (cid:12)(cid:12) (cid:12)(cid:12) (cid:12)(cid:12) (cid:104) (cid:105)(cid:12)(cid:12) (cid:12)(cid:12)M(θ,ψ˜)−M(θ,ψ)−Γ (θ,ψ)[ψ˜−ψ](cid:12)(cid:12) = (cid:12)(cid:12)E −θ (ψ˜(Z)−ψ(Z))21(X = 0)e (cid:12)(cid:12) (cid:12)(cid:12) 2 (cid:12)(cid:12) (cid:12)(cid:12) E δ (cid:12)(cid:12) ≤ |θ |·||ψ˜−ψ||2 , E H and for τ = o(1), and ||θ−θ || ≤ τ , and ||ψ˜−ψ|| ≤ τ n 0 n n (cid:12)(cid:12) (cid:12)(cid:12) (cid:12)(cid:12)Γ (θ,ψ)[ψ˜−ψ]−Γ (θ ,ψ)[ψ˜−ψ](cid:12)(cid:12) = (cid:12)(cid:12) 2 2 0 (cid:12)(cid:12) (cid:12)(cid:12) (cid:104) (cid:105)(cid:12)(cid:12) (cid:12)(cid:12)E −(ψ˜(Z)−ψ(Z))1(X = 0)(0,(θ −δ)Z(cid:48),W(cid:48)(θ−θ )+(θ −δ)ψ(Z))(cid:48) (cid:12)(cid:12) (cid:12)(cid:12) E 0 E (cid:12)(cid:12) ≤ sup ||ψ˜−ψ||E[||W ||](|θ −δ|+||θ−θ ||) ≤ 2(∆+τ )τ2, ψ E 0 n n ||ψ˜−ψ||≤τn where the first inequality uses the triangle and Cauchy-Schwarz’s inequalities, and the second 35

inequality uses the triangle inequality again and Assumption 1(ii). • Assumption 2.4B holds by Assumption 3(va). • We show Assumption 2.5’B by applying Theorem 3 in Chen et al. (2003). In our case, l = dim(Z)+2 and m(Y,X,Z,θ,ψ) = m (Y,X,Z,θ,ψ), and m (Y,X,Z,θ,ψ) = 0, which c lc automatically satisfies condition 3.2. Condition 3.3 is satisfied by Assumptions 1 (iii) and (iv). We show condition 3.1: |m (Y,X,Z,θ ,ψ )−m (Y,X,Z,θ ,ψ )| = c 1 1 c 2 2 = |(W −W )Y −(W W −W W )θ −W W(cid:48) (θ −θ )| ψ1 ψ2 ψ1 ψ1 ψ2 ψ2 2 ψ1 ψ1 1 2 ≤ |ψ (Z)−ψ (Z)|(|Y|+||θ ||(2||Z||+|ψ (Z)|+|ψ (Z)|)) 1 2 1 1 2 +||θ −θ ||(X2+2|X|·||Z||+||Z||2+2||Z||·|ψ (Z)|+ψ (Z)2). 1 2 1 1 There exists ∆ > 0 such that by Assumption 1(iii), ||θ || ≤ ∆ and by Assumption 3(ii), 1 |ψ (Z)| ≤ ∆. Therefore, b(Y,X,Z) = max{|Y| + 2∆||Z|| + 2∆2,X2 + 2(|X| + ∆) · ||Z|| + 1 ||Z2||+∆2}. By Assumption 1(ii), E[b(Y,X,Z)] < ∞. Therefore, the condition holds with s = s = 1. 1j j • Assumption 2.6B: let Mb(θˆ,ψˆ) be equal to M (θˆ,ψˆ) as defined in Section B.1, except n n that it is calculated with the bootstrap sample. As stated in Chen et al. (2003) (p. 1596), √ from Giné and Zinn (1990) we know that the Pb-distribution of n(Mb(θˆ,ψˆ) − M (θˆ,ψˆ)) n n √ approximates the distribution of n(M (θˆ,ψˆ)−M(θˆ,ψˆ)), which is approximately the same n √ as the distribution of nM (θ ,ψ ) by condition 2.5’ shown in Section B.1. n 0 0 √ Next, we show that the the Pb-distribution of nΓ (θˆ,ψˆ)[ψˆb −ψˆ] approximates the dis- 2 √ √ tribution of nΓ (θ ,ψ )[ψˆ−ψ ]. Specifically, n(Γ (θˆ,ψˆ)[ψˆb −ψˆ]−Γ (θ ,ψ )[ψˆ−ψ ]) = 2 0 0 0 2 2 0 0 0 (0,A ,B ), where n n √ A =−δ nE[[(ψˆb(Z)−ψˆ(Z))−(ψˆ(Z)−ψ (Z))]1(X = 0)Z(cid:48)] n 0 √ − nE[(θˆ −δ)(ψˆb(Z)−ψˆ(Z))1(X = 0)Z(cid:48)] E √ B =−δ nE[[(ψˆb(Z)−ψˆ(Z))−(ψˆ(Z)−ψ (Z))]1(X = 0)ψ (Z)] n 0 0 √ − nE[(θˆ −δ)(ψˆb(Z)−ψˆ(Z))1(X = 0)ψ (Z)] E 0 √ − nE[θˆ (ψˆb(Z)−ψˆ(Z))(ψˆ(Z)−ψ (Z))1(X = 0)]. E 0 By Assumption 3(vc), the first terms in A and B are o (1). Now we discuss the second n n pb term in A . By Cauchy-Schwartz, its absolute value is bounded above by n E[n(θˆ −δ)4]1/4E[||Z||4]1/4n1/4E[(ψˆb(Z)−ψˆ(Z))2]1/2. (12) E By the first point in this Section, n1/4||θˆ−θ || = o (1), and by Assumption 1(iii) and the 0 a.s. Dominated Convergence Theorem, the first term in (12) converges to zero. The second term 36

is bounded by Assumption 1(ii). Assumption 3(va) implies that the third term is o (1). pb Therefore, the second term in A is o (1). The second and third terms in B are also o (1), n pb n pb and the proof is analogous, except that to bind the second term on B we use Assumption n 3(ii), and to bind the third term on B we use Assumption 3(iii), the Continuous Mapping n Theorem, and the Dominated Convergence Theorem. √ √ Thus, n(Mb(θˆ,ψˆ)−M (θˆ,ψˆ)+Γ (θˆ,ψˆ)[ψˆb−ψˆ])and n(M (θ ,ψ )+Γ (θ ,ψ )[ψˆ−ψ ]) n n 2 n 0 0 2 0 0 0 have the same asymptotic distribution. By the proof of Assumption 2.6 in Section B.1, this is the desired limit. (cid:3) B.5 Lemma: Establishing Stochastic Dominance in an Interval. Let 0 < a ≤ b < ∞, and suppose that (i) f and g are two non-negative functions; (ii) g(x) ≥ (cid:82)a (cid:82)b (cid:82)b f(x) for all x ∈ [0,a]; and (iii) g(x)dx = f(x)dx = Λ(b) < ∞. Then xf(x)dx ≥ 0 0 0 (cid:82)a xg(x)dx. 0 (cid:82)x (cid:82)x Proof. Let f(u)du = Λ(x), and g(u)du = G(x). By Leibniz rule, item (iii), and the 0 0 Mean Value Theorem for integrals, (cid:90) b (cid:90) a (cid:90) a xf(x)dx− g(x)dx = (b−a)(Λ(b)−Λ(c))+ (G(x)−Λ(x))dx, 0 0 0 wherec ∈ [a,b].Byitem(i),Λ(b) ≥ Λ(c),andbyitem(ii),thesecondtermisnon-negative. References Andrews, D. W. (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica: Journal of the Econometric Society, pages 43–72. Andrews, D. W. (1995). Nonparametric kernel estimation for semiparametric models. Econometric Theory, pages 560–596. BaumII,C.L.(2003). Doesearlymaternalemploymentharmchilddevelopment? Ananalysis of the potential benefits of leave taking. Journal of Labor Economics, 21(2):409–448. Bertanha, M., McCallum, A. H., and Seegert, N. (2020). Better bunching, nicer notching. Working Paper. Bertrand, M., Karlan, D., Mullainathan, S., Shafir, E., and Zinman, J. (2010). What’s advertising content worth? Evidence from a consumer credit marketing field experiment. The Quarterly Journal of Economics, 125(1):263–306. Bettinger, E., Hægeland, T., and Rege, M. (2014). Home with mom: The effects of stay-athome parents on children’s long-run educational outcomes. Journal of Labor Economics, 32(3):443–467. 37

Bhutani, S., Klempel, M. C., Kroeger, C. M., Aggour, E., Calvo, Y., Trepanowski, J. F., Hoddy, K. K., and Varady, K. A. (2013). Effect of exercising while fasting on eating behaviorsandfoodintake.JournaloftheInternationalSocietyofSportsNutrition,10(1):50. Bischoff-Ferrari, H. A., Willett, W. C., Wong, J. B., Giovannucci, E., Dietrich, T., and Dawson-Hughes, B. (2005). Fracture prevention with vitamin D supplementation: A metaanalysis of randomized controlled trials. JAMA, 293(18):2257–2264. Black, S. E., Devereux, P. J., and Salvanes, K. G. (2005). The more the merrier? The effect offamilysizeandbirthorderonchildren’seducation. The Quarterly Journal of Economics, 120(2):669–700. Black, S. E., Devereux, P. J., and Salvanes, K. G. (2010). Small family, smart family? Family size and the IQ scores of young men. Journal of Human Resources, 45(1):33–58. Bleemer, Z. (2018a). The effect of selective public research university enrollment: Evidence from California. Research & Occasional Paper Series: CSHE. 11.18. Center for Studies in Higher Education. Bleemer,Z.(2018b). Toppercentpoliciesandthereturntopostsecondaryselectivity. Working Paper. Blomquist, N. S., Newey, W. K., Kumar, A., and Liang, C.-Y. (2019). On bunching and identification of the taxable income elasticity. CENMAP Working Paper. Bonhomme, S., Lamadon, T., and Manresa, E. (2017). Discretizing Uunobserved Heterogeneity. Working Paper. Bonhomme, S. and Manresa, E. (2015). Grouped patterns of heterogeneity in panel data. Econometrica, 83(3):1147–1184. Boserup, S. H., Kopczuk, W., and Kreiner, C. T. (2016). The role of bequests in shaping wealth inequality: Evidence from Danish wealth records. American Economic Review, 106(5):656–61. Boulianne, S.(2015). Socialmediauseandparticipation: Ameta-analysisofcurrentresearch. Information, Communication & Society, 18(5):524–538. Brown, J. R., Coile, C. C., and Weisbenner, S. J. (2010). The effect of inheritance receipt on retirement. The Review of Economics and Statistics, 92(2):425–434. Caetano, C. (2015). A test of exogeneity without instrumental variables in models with bunching. Econometrica, 83(4):1581–1600. Caetano, C., Caetano, G., and Nielsen, E. (2020). Should children do more enrichment activities? Leveraging bunching to correct for endogeneity. FEDS Working Paper No. 2020-036. Caetano, G., Kinsler, J., and Teng, H. (2019). Towards causal estimates of children’s time allocation on skill development. Journal of Applied Econometrics, 34(4):588–605. 38

Caetano, G. and Maheshri, V. (2018). Identifying Dynamic Spillovers of Crime with a Causal Approach to Model Selection. Quantitative Economics, 9(1):343–394. Carman, K. G. (2013). Inheritances, intergenerational transfers, and the accumulation of health. American Economic Review, 103(3):451–55. Cattaneo,M.D.,Jansson,M.,andMa,X.(2019). Simplelocalpolynomialdensityestimators. Journal of the American Statistical Association, pages 1–7. Chatterji, P., Markowitz, S., and Brooks-Gunn, J. (2013). Effects of early maternal employment on maternal health and well-being. Journal of Population Economics, 26(1):285–301. Chay, K. Y. and Greenstone, M. (2005). Does air quality matter? Evidence from the housing market. Journal of political Economy, 113(2):376–424. Chen, X., Linton, O., and Van Keilegom, I. (2003). Estimation of semiparametric models when the criterion function is not smooth. Econometrica, 71(5):1591–1608. Chetty, R., Friedman, J.N., Hilger, N., Saez, E., Schanzenbach, D.W., andYagan, D.(2011). HowDoesyourKindergartenClassroomAffectyourEarnings? EvidencefromProjectStar. The Quarterly Journal of Economics, 126(4):1593–1660. Chow, Y. S. and Teicher, H. (1997). Probability Theory. Springer - New York. Cohen, M. A. (2008). The effect of crime on life satisfaction. The Journal of Legal Studies, 37(S2):S325–S353. Corrao, G., Rubbiati, L., Bagnardi, V., Zambon, A., and Poikolainen, K. (2000). Alcohol and coronary heart disease: A meta-analysis. Addiction, 95(10):1505–1523. DeVito, A., Jacob, M., andMüller, M.A.(2019). Avoidingtaxestofixthetaxcode. Working Paper. Ekici, T. and Dunn, L. (2010). Credit card debt and consumption: Evidence from householdlevel data. Applied Economics, 42(4):455–462. Elinder, M., Erixson, O., and Waldenström, D. (2018). Inheritance and wealth inequality: Evidence from population registers. Journal of Public Economics, 165:17 – 30. Eren, O. and Henderson, D. J. (2011). Are we wasting our children’s time by giving them more homework? Economics of Education Review, 30(5):950–961. Erhardt, E. C. (2017). Microfinance beyond self-employment: Evidence for firms in Bulgaria. Labour economics, 47:75–95. Erixson, O. (2017). Health responses to a wealth shock: Evidence from a Swedish tax reform. The Journal of Population Economics, 30:1281–1336. Ermisch, J.andFrancesconi, M.(2013). Theeffectofparentalemploymentonchildschooling. Journal of Applied Econometrics, 28(5):796–822. 39

Fawzi, W. W., Chalmers, T. C., Herrera, M. G., and Mosteller, F. (1993). Vitamin A supplementation and child mortality: A meta-analysis. JAMA, 269(7):898–903. Ferreira, D., Ferreira, M. A., and Mariano, B. (2018). Creditor control rights and board independence. The Journal of Finance, 73(5):2385–2423. Garen, J. (1984). The returns to schooling: A selectivity bias approach with a continuous choice variable. Econometrica: Journal of the Econometric Society, pages 1199–1218. Gentzkow, M. and Shapiro, J. M. (2008). Preschool television viewing and adolescent test scores: Historical evidence from the Coleman study. The Quarterly Journal of Economics, 123(1):279–323. Giné, E. and Zinn, J. (1990). Bootstrapping general empirical measures. The Annals of Probability, pages 851–869. Goldman, M. and Kaplan, D. M. (2018). Comparing distributions by multiple testing across quantiles or CDF values. Journal of Econometrics, 206(1):143–166. Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent data. Econometric Theory, pages 726–748. Härdle,W.,Liang,H.,andGao,J.(2000). Partially linear models. Physica-VerlagHeidelberg. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer Science & Business Media. Heckman,J.J.(1979). Sampleselectionbiasasaspecificationerror. Econometrica,47(1):153– 161. Hernán, M. A., Takkouche, B., Caamaño-Isorna, F., and Gestal-Otero, J. J. (2002). A metaanalysis of coffee drinking, cigarette smoking, and the risk of parkinson’s disease. Annals of neurology, 52(3):276–284. Holt, K., Shehata, A., Strömbäck, J., and Ljungberg, E. (2013). Age and the effects of news media attention and social media use on political interest and participation: Do social media function as leveller? European Journal of Communication, 28(1):19–34. James-Burdumy, S. (2005). The effect of maternal labor force participation on child development. Journal of Labor Economics, 23(1):177–211. Jenish,N.andPrucha,I.R.(2009). Centrallimittheoremsanduniformlawsoflargenumbers for arrays of random fields. Journal of econometrics, 150(1):86–98. Joulfaian, D. and Wilhelm, M. O. (1994). Inheritance and labor supply. The Journal of Human Resources, 29(4):1205–1234. Kim, B. and Ruhm, C. J. (2012). Inheritances, health and death. Health Economics, 21(2):127–144. Kleven, H. J. (2016). Bunching. Annual Review of Economics, 8(1):435–464. 40

Kleven, H. J. and Waseem, M. (2013). Using Notches to Uncover Optimization Frictions and Structural Elasticities: Theory and Evidence from Pakistan. The Quarterly Journal of Economics, 128(2):669–723. Korolyuk, V. S. and Borovskich, Y. V. (2013). Theory of U-statistics, volume 273. Springer Science & Business Media. Lavetti, K. and Schmutte, I. M. (2018). Estimating compensating wage differentials with endogenous job mobility. Working paper. Luoh, M.-C. and Herzog, A. R. (2002). Individual consequences of volunteer and paid work in old age: Health and mortality. Journal of health and social behavior, pages 490–509. Masry, E. (1996). Multivariate local polynomial regression for time series: Uniform strong consistency and rates. Journal of Time Series Analysis, 17(6):571–599. McDuffie,R.S.,Beck,A.,Bischoff,K.,Cross,J.,andOrleans,M.(1996).Effectoffrequencyof prenatal care visits on perinatal outcome among low-risk women: A randomized controlled trial. JAMA, 275(11):847–851. Melzer, B. T. (2011). The real costs of credit access: Evidence from the Payday lending market. The Quarterly Journal of Economics, 126(1):517–555. Munasib, A. and Bhattacharya, S. (2010). Is the ‘idiot’s box’ raising idiocy? Early and middle childhood television watching and child cognitive outcome. Economics of Education Review, 29(5):873 – 883. Noordzij, M., Uiterwaal, C.S., Arends, L.R., Kok, F.J., Grobbee, D.E., andGeleijnse, J.M. (2005). Blood pressure response to chronic intake of coffee and caffeine: A meta-analysis of randomized controlled trials. Oken, E., Levitan, E., and Gillman, M. (2008). Maternal smoking during pregnancy and child overweight: Systematic review and meta-analysis. International Journal of Obesity, 32(2):201–210. Pakes, A. and Pollard, D. (1989). Simulation and the asymptotics of optimization estimators. Econometrica: Journal of the Econometric Society, pages 1027–1057. Pang, J. (2017). Do subways improve labor market outcomes for low-skilled workers. Working Paper, Syracuse University. Peek, J., Rosengren, E. S., and Tootell, G. M. (2003). Identifying the macroeconomic effect of loan supply shocks. Journal of Money, Credit and Banking, pages 931–946. Pötscher, B. M. and Prucha, I. R. (1994). Generic uniform convergence and equicontinuity concepts for random functions: An exploration of the basic structure. Journal of Econometrics, 60(1-2):23–63. Reynolds, K., Lewis, B., Nolen, J.D.L., Kinney, G.L., Sathya, B., andHe, J.(2003). Alcohol consumption and risk of stroke: A meta-analysis. JAMA, 289(5):579–588. 41

Richardson, T., Elliott, P., and Roberts, R. (2013). The relationship between personal unsecureddebtandmentalandphysicalhealth: Asystematicreviewandmeta-analysis. Clinical Psychology Review, 33(8):1148–1162. Robinson, P. M. (1988). Root- N-Consistent Semiparametric Regression. Econometrica, 56(4):931–954. Rozenas, A., Schutte, S., and Zhukov, Y. (2017). The political legacy of violence: The longterm impact of Stalin’s repression in Ukraine. The Journal of Politics, 79(4):1147–1161. Ruhm, C. J. (2004). Parental employment and child cognitive development. The Journal of Human Resources, 39(1):155–192. Ruhm, C. J. (2008). Maternal employment and adolescent development. Labour Economics, 15(5):958 – 983. Saez, E. (2010). Do Taxpayers Bunch at Kink Points? American Economic Journal: Economic Policy, 2(3):180–212. Shinton, R. and Beevers, G. (1989). Meta-analysis of relation between cigarette smoking and stroke. BMJ, 298(6676):789–794. Song, K. (2008). Uniform convergence of series estimators over function spaces. Econometric Theory, pages 1463–1499. Tobin,J.(1958). EstimationofRelationshipsforLimitedDependentVariables. Econometrica, 26(1):24–36. Van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. Van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence. In Weak convergence and empirical processes, pages 16–28. Springer. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4):817–838. Zavodny, M. (2006). Does watching television rot your mind? Estimates of the effect on test scores. Economics of Education Review, 25(5):565 – 573. 42

ONLINE APPENDIX Correcting for Endogeneity in Models with Bunching Carolina Caetano Gregorio Caetano Eric Nielsen University of Georgia University of Georgia Federal Reserve Board August 2020 Thisappendixsupplementsthepaperbyofferingfurtheranalyses. InSection1,wediscuss alternative approaches for the identification of E[X∗|X∗ ≤ 0,Z] that were mentioned but not showninthepaper. InSection2, wepresenttwoempiricalchecksofthelinearityassumptions in equations (1) and (2) in the paper which proved useful in the application in Section 5, as well as in Caetano et al. (2020). Finally, in Section 3, we present the results from a real-data Monte Carlo simulation. 1 Identifying E[X∗|X∗ ≤ 0,Z] 1.1 Parametric Methods We discuss how E[X∗|X∗ ≤ 0,Z] may be obtained in parametric models under some wellknown distribution families. This section supplements the discussion in Section 4.2.1 in the paper. Model 4.2.3 (Logistic) η|Z ∼ Logistic(Z(cid:48)κ,σ) almost surely. In this case, E[X∗|X∗ ≤ 0,Z] = Z(cid:48)(π+κ)−σlog(1+exp(Z(cid:48)(π+κ)/σ)). The log-likelihood function is n (cid:88) L(π+κ,σ) = − 1(X = 0)log(1+exp(Z(cid:48)(π+κ)/σ))+1(X > 0)(X −Z(cid:48)(π+κ))/σ i i i i i i=1 +1(X > 0)logσ+2·1(X > 0)log(1+exp(−X +Z(cid:48)(π+κ))/σ). i i i i Note that, similarly to the Tobit case, we only identify the sum π+κ instead of π and κ separately,andthisissufficienttoidentifytheexpectation. Indeed,wheneverη canbewritten as Z(cid:48)κ+ζ, where ζ is a distribution with fixed location, we will only identify π+κ, but this will be sufficient. This is true in the Tobit case, where ζ ∼ N(0,σ), and in the Logistic case, where ζ ∼ Logistic(0,σ). These are both symmetric distributions where the location is equal to the mean. 1

The uniform distribution is an example of a symmetric distribution with location determined by the extremum. Model 4.2.4 (Uniform) η|Z ∼ U[Z(cid:48)κ,Z(cid:48)µ] almost surely. In this case, 1 E[X∗|X∗ ≤ 0,Z] = Z(cid:48)(π+κ). 2 The log-likelihood function is n (cid:88) L(π+κ,µ−κ) = 1(X = 0)log(−Z(cid:48)(π+κ))−log(Z(cid:48)(µ−κ)). i i i i=1 The last example is of an asymmetric distribution where the location is set by the lower limitofthesupport. ThemodelimpliesthatthedistributionofX∗ hassupport[Z(cid:48)(π+κ),∞) with the higher concentration towards the lower values of X∗. Model 4.2.5 (Exponential) η = Z(cid:48)κ+ζ, where ζ|Z ∼ Exp((Z(cid:48)µ)−1) almost surely. In this case, (1+exp(Z(cid:48)(π+κ)/Z(cid:48)µ)) E[X∗|X∗ ≤ 0,Z] = Z(cid:48)µ+Z(cid:48)(π+κ) . (1−exp(Z(cid:48)(π+κ)/Z(cid:48)µ)) The log-likelihood function is n (cid:88) L(π+κ,σ) = 1(X = 0)log(1−exp(Z(cid:48)(π+κ)/Z(cid:48)µ))−1(X > 0)(logZ(cid:48)µ+(X −Z(cid:48)(π+κ))/Z(cid:48)µ). i i i i i i i i i=1 1.2 Semiparametric Methods We discuss how E[X∗|X∗ ≤ 0,Z] may be obtained in semiparametric models in which the distributionfamilyisknown,buttheparametersareidentifiednonparametrically. Thissection supplements the discussion in Section 4.2.2 in the paper. Model 4.2.6 (Semiparametric Logistic) η|Z ∼ Logistic(κ(Z),σ(Z)) almost surely. In this case, (cid:18) Z(cid:48)π+κ(Z) (cid:18) (cid:18) Z(cid:48)π+κ(Z) (cid:19)(cid:19)(cid:19) E[X∗|X∗ ≤ 0,Z] = σ(Z) −log 1+exp . σ(Z) σ(Z) We can identify 1+exp((Z(cid:48)π+κ(Z))/σ(Z)) = (F (0))−1, and σ(Z) = −E[X|X > 0,Z](1− X|Z F (0))/log(F (0)), and thus X|Z X|Z (cid:18)(1−F (0))log(1−F (0))(cid:19) E[X∗|X∗ ≤ 0,Z] = −E[X|X > 0,Z] X|Z X|Z F (0)logF (0) X|Z X|Z 2

Model 4.2.7 (Semiparametric Uniform) η|Z ∼ U[κ(Z),µ(Z)] almost surely. In this case, 1 E[X∗|X∗ ≤ 0,Z] = (Z(cid:48)π+κ(Z)). 2 We can identify Z(cid:48)π +κ(Z)/(µ(Z)−κ(Z)) = −F (0), Z(cid:48)π +µ(Z)/(µ(Z)−κ(Z)) = 1− X|Z F (0), and E[X|X > 0,Z] = 1/2(Z(cid:48)π+µ(Z)), thus X|Z F (0) E[X∗|X∗ ≤ 0,Z] = −E[X|X > 0,Z] X|Z . 1−F (0) X|Z Model 4.2.8 (Semiparametric Exponential) η = κ(Z)+ζ, where ζ|Z ∼ Exp(µ(Z)−1) almost surely. In this case, (cid:18) (Z(cid:48)π+κ(Z))/µ(Z) (cid:19) E[X∗|X∗ ≤ 0,Z] = µ(Z) 1+ . 1−exp((Z(cid:48)π+κ(Z))/µ(Z)) We can identify (Z(cid:48)π+κ(Z))/µ(Z) = log(1−F (0)) and µ(Z) = E[X|X > 0,Z]. Thus, X|Z (cid:18)−log(1−F (0)) (cid:19) E[X∗|X∗ ≤ 0,Z] = −E[X|X > 0,Z] X|Z −1 . F (0) X|Z 1.3 Semi- and Nonparametric Methods for Discrete/Discretized Z We discuss how E[X∗|X∗ ≤ 0,Z] may be obtained in the semiparametric models of the previous section when Z is discrete or has been discretized. This section supplements the discussion in Section 4.3.1 in the paper. Throughout this section, assume that supp(Z) is a finite set. Model 4.3.3 (Semiparametric Logistic, discrete case) Suppose that Model 4.2.6 holds. Let α = z(cid:48)π+µ(z) and σ = σ(z). This implies that X∗|Z = z ∼ Logistic(α ,σ ). In this case, z z z z the two parameters α and σ can be identified and estimated with the log-likelihood function z z n (cid:88) L(α ,σ ) = − 1(X = 0,Z = z)log(1+exp(α /σ ))+1(X > 0,Z = z)(X −α )/σ z z i i z z i i i z z i=1 +1(X > 0,Z = z)logσ +2·1(X > 0)log(1+exp(−(X −α )/σ ). i i z i i z z Model 4.3.4 (Semiparametric Uniform, discrete case) Suppose that Model 4.2.7 holds. Let α = z(cid:48)π + κ(z) and ν = µ(z) − κ(z). This implies that X∗|Z = z ∼ U[α ,α + ν ]. In z z z z z this case, the two parameters α and ν can be identified and estimated with the log-likelihood z z function n (cid:88) L(α ,ν ) = 1(X = 0,Z = z)log(−α )−1(Z = z)logν . z z i i z i z i=1 3

Model 4.3.5 (Semiparametric Exponential, discrete case) Suppose that Model 4.2.8 holds. Let α = z(cid:48)π + κ(z) and µ = µ(z). This implies that X∗ = α + ζ almost surely, where z z Z ζ|Z = z ∼ Exp(µ−1). In this case, the two parameters α and µ can be identified and z z z estimated with the log-likelihood function n (cid:88) L(α ,µ ) = 1(X = 0,Z = z)log(1−exp(α /µ ))−1(X > 0,Z = z)(logµ +(X −α )/µ ). z z i i z z i i z i z z i=1 Model 4.3.6 (Conditional Symmetry) Suppose that F (0) ≤ 0.5. Assume that the distri- X|Z=z bution of η|Z = z is symmetric around its mean, so that letting E[η|Z = z] = µ , F (a) = z η|Z=z 1−F (2µ −a). This implies that the distribution of X∗|Z = z is also symmetric around η|Z=z z its mean/median z(cid:48)π+µ . Thus, for all x < 0, F (x) = 1−F (2(z(cid:48)π+µ )−x) = z X∗|Z=z X∗|Z=z z 1 − F (2(z(cid:48)π + µ ) − x). Because F (0) ≤ 0.5, z(cid:48)π + µ = med(X∗|Z = z) = X|Z=z z X|Z=z z med(X|Z = z) is identifiable, and thus so is F (x) = 1−F (2med(X|Z = z)−x) X∗|Z=z X|Z=z for all x. Therefore, if we calculate the expectation, we can identify the conditional expectation via change of variables as E[X∗|X∗ ≤ 0,Z = z] = 2med(X|Z = z)−E[X|X ≥ 2med(X|Z = z),Z = z]. (1) Toestimatethisquantity,simplysubstitutethesampleequivalents. Notethatfullsymmetry is testable. The distribution of X∗|Z = z is observed in (0,2med(X|Z = z)]. Therefore, the equality between the functions F (x) and 1 − F (2med(X|Z = z) − x) for X|Z=z X|Z=z x ∈ [0,med(X|Z = z)] can be tested, as discussed in Remark 4.1 in the paper. 2 Two Empirical Checks of the Linearity Assumption Wesuggesttwochecksofwhetherthemainconclusionsofourapplicationwouldchangeunder violations of the linearity assumption in equations (1) and (2). These checks proved to be of practical value both in this application and in Caetano et al. (2020). 2.1 Plotting Residuals from the Uncorrected Regression for X > 0 By equation (4) in the paper, when X > 0, E[Y|X,Z] = X(cid:48)(β+δ)+Z(cid:48)(π−γδ). The expected valueoftheresidualsofaregressionofY onX andZ forobservationswithX > 0shouldthus beequaltozeroforeachpositivevalueofX.Figure1showsthelocallinearfitoftheestimated residuals for cognitive (left panel) and non-cognitive (right panel) skills in our application, which are always close to zero in the positive side. The points at zero in Figure 1 represent the average of the residuals of the same regression among observations with X = 0. This corresponds exactly to an estimator of δE[X∗|X∗ ≤ 0], asdiscussedinRemark2.1inthepaper. Thepositiveandstronglysignificantresultintheleft 4

Figure 1: Evidence that Uncorrected Estimates Are Biased laudiseR evitingoC 51. 1. 50. 0 50.- P-value of Exogeneity Test: 0.000 0 5 10 15 20 Hours Per Week Watching TV laudiseR evitingoC-noN 1. 0 1.- 2.- 3.- P-value of Exogeneity Test: 0.000 0 5 10 15 20 Hours Per Week Watching TV Note: Each panel shows a plot of the local linear estimator (bandwidth equals to 10) of the residuals from a regression of Y onto X and Z, estimated using only observations with X >0, where X represents the hours spent watching TV in a typical week. We also present the average of the residuals for X =0 and show 90% confidence intervals everywhere. The caption shows the p-value of a test for whether the average residual at X =0 is equal to zero. P-value is calculated using 1,000 bootstrapped samples. panel shows that δ < 0 for cognitive skills. Analogously, the right panel shows that δ > 0 for non-cognitive skills. Incidentally, this implies a rejection of exogeneity in both cases (p-value shown in the caption of each panel). 2.2 Sequenced Sample Truncation To fix ideas, consider two scenarios. In scenario 1, we restrict our sample to observations with X ∈ [0,1], and in scenario 2, we restrict our sample to observations with X ∈ [0,50]. We exploittheideathat,underviolationsofthelinearityassumption,theeffectoftheconfounders atX = 0islikelytobemoresimilartotheeffectoftheconfoundersatX = 1thanatX = 50. Similarly, the effect of an additional hour of TV at X = 0 is more likely to be similar to the effect of an additional hour at X = 1 than at X = 50. We build on this idea by first restricting the sample to reflect the first scenario and then progressively expanding the sample until it reaches the second scenario. In Figure 2, we show how our main estimate βˆ for cognitive (left panel) and non-cognitive skills (right panel) changesfordifferenttruncationsofoursample, rangingfromX = 5toX = 50).1 Since max max over 99% of our sample spends less than 50 hours per week watching TV, the estimates in the far right of each panel are almost identical to the corresponding estimates reported in Table 1 of the paper. This approach also allows us to assess the robustness of our conclusions to 1We do not include the cases X = 1,2,3,4 in the plot because the confidence intervals are too large, max whichmakesithardtovisualizewhatishappeningintherestoftheplot. Nevertheless,theunreportedresults for these cases are consistent with the conclusions we draw from Figure 2. To keep everything else constant irrespective of X , we keep Eˆ[X∗|X∗ ≤0,Z] fixed across the different truncations using our preferred tail max symmetry approach. Note that the identification of E[X∗|X∗ ≤ 0,Z] does not depend on the assumptions about equations (1) and (2) of the paper, which is what we are trying to test here. 5

the elimination of television time outliers from our sample. Figure 2: Estimates for Different Sub-Samples of Data setamitsE evitingoC 5 0 5- 01- 51- 0 5 10 15 20 25 30 35 40 45 50 X max setamitsE evitingoc-noN 5 0 5- 01- 51- 0 5 10 15 20 25 30 35 40 45 50 X max Note: Each panel shows the estimates of β for cognitive (left panel) or non-cognitive (right panel) skills restricting the sample to only children who watch at most X hours of TV per week, for X =5,...,50, max max withtheX =50restrictionincludingover99%ofthesample. The90%bootstrappedconfidenceintervals max are also shown. For lower values of X , the sample is smaller and the confidence intervals are larger, max suggesting small-sample variability. Nevertheless, for both types of skills, the qualitative conclusionsarestableforallvaluesofX .Forcognitiveskills, thepointestimatesarestable max for X ≥ 15, while for non-cognitive skills, the point estimates are stable for X ≥ 8. max max 3 Monte Carlo We assess the performance of our method with an empirical Monte Carlo based on the data andapplicationintroducedinSection5ofthepaper. Section3.1explainshowwecalibratethe parameters of the data generating process, Section 3.2 describes how we use these calibrated parameterstodrawrandomsampleswithdifferentdistributionsofη|Z,andSection3.3reports and discusses the Monte Carlo results. 3.1 Parameter Calibration Toestimatetheparameters,wefitthesemiparametricTobitmodel(Model4.3.1inthepaper.) AllofthestepsenumeratedbelowarecarriedoutonthesampleusedinSection5inthepaper with 10 clusters defined as in the application. 1. For each cluster k = 1,...10 we run a Tobit regression of X on a constant. Denote by αˆ theestimatedconstantandbyσˆ2 theestimatedvariancefromthisTobitregression. k η,k Let e = (0,...,0,1,0,...,0)(cid:48) be the k−th 10×1 canonical vector, then we estimate k (cid:16) (cid:17) Eˆ[X∗|X∗ ≤ 0,Cˆ = e ] = αˆ −σˆ λ − αˆ k . 10 k k η,k σˆ η,k 6

2. We run an OLS regression of Y on X, Z, and X +Eˆ[X∗|X∗ ≤ 0,Cˆ = e ]1(X = 0), 10 k and denote by βˆ, aˆ and δˆ the respective estimated coefficients. We denote by σˆ2 the (cid:15),k variance (conditional on Cˆ = e ) of the residuals from this regression. 10 k 3. For each k, we estimate Fˆ (0) by calculating the proportion of observations with X|Cˆ 10=e k X = 0 among all observations with Cˆ = e (i.e., all observations in cluster k). We i 10i k also estimate pˆ = Pˆ(Cˆ = e ). k 10 k 4. We estimate the vector γˆ = aˆ+αˆδˆ, where α = (α ,...,α )(cid:48). 1 10 5. We estimate σˆ2 = 1 (cid:80)10 pˆ2σˆ2 , and σ2 = σˆ2 −49. η,H (cid:80)10 pˆ2 k=1 k η,k η,k,Mix η,k k=1 k 6. We calculate (cid:18) (cid:18) (cid:19) (cid:18) (cid:19)(cid:19) αˆ αˆ αˆ σˆ2 = σˆ2 −δˆ2σˆ2 1+ k λ − k −λ2 − k . ε,k (cid:15),k η,k σˆ σˆ σˆ η,k η,k η,k 3.2 Drawing the Monte Carlo Samples With the parameters estimated in the previous section in hand, we next produce the Monte Carlo samples using the general scheme: for observation j ∈ {1,2,...,N}: 1. Z is a random draw from the entire sample of Z ’s in the data set used in Section 5 of j i the paper. Suppose that Z belongs to the cluster k in the original analysis sample. j 2. ε is a random draw from a N(0,σˆ2 ) distribution. j ε,k 3. η is a random draw from one of the distributions below, depending on the simulation:2 j • Homoskedastic Normal: Draw η from a N(0,σˆ2 ) distribution. j η,H • Heteroskedastic Normal: Draw η from a N(0,σˆ2 ) distribution. j η,k √ • Heteroskedastic Logistic: Draw η from a Logistic(0, 3σˆ /π) where π is the j η,k mathematical constant pi. • Heteroskedastic Triangular: Let u be a draw from the Uniform[0,1] distribution. j Then (cid:16) √ √ (cid:17) η = 3σˆ 2 u − 2 , if u < 0.5 j η,k j j (cid:16)√ (cid:17) (cid:112) = 3σˆ 2−2 1−u , if u ≥ 0.5 η,k j j • Heteroskedastic Uniform: Draw η from a uniform distribution in the interval √ √ j (cid:2) (cid:3) − 3σˆ , 3σˆ . η,k η,k 2Note that although we are drawing η from distributions with zero means, we are not operating under j a zero mean assumption on the confounder. The mean of the confounder may be different from zero and is incorporated implicitly through the αˆ in step 4. k 7

• Heteroskedastic Symmetric Mixture Normal: let u be a draw from a uniform[0,1] j distribution. Then, η˜ (1) is a draw from N(7,σ2 ) and η˜ (2) be a draw from j η,k,Mix j N(−7,σ2 ). Then, we obtain η as follows: η,k,Mix j (1) η = η˜ if u < 0.5 j j j (2) = η˜ if u ≥ 0.5 j j 4. We calculate X = max{0,αˆ +η }. j k j 5. We calculate Y = βˆX +Z(cid:48)γˆ+δˆη +ε j j j j j 6. We keep (Y ,X ,Z(cid:48),Cˆ(cid:48) ). This is one observation. j j j 10j 7. For each of the M Monte Carlo samples of size N, indexed by r, we estimate βˆ (s) r separately for each identification strategy s ∈ {uncorrected with controls, Tobit, semiparametric Tobit, and tail symmetry}. We then calculate the bias for each strategy as (cid:88) Bias(s) = M−1 (βˆ (s)−βˆ). r r where βˆ is the true value of the parameter in this Monte Carlo, obtained in step 2 of the previous section (this is equivalent to βˆ in column (iv) in Table 1 of the paper). We calculate the standard deviation as (cid:115) SD(s) = (M −1)−1 (cid:88) (βˆ (s)−β ¯ˆ(s))2, r r where β ¯ˆ(s) = M−1(cid:80) βˆ (s). We set M = 10,000 and consider three sample sizes: r r N = 500, N = 1,000, and N = 5,000. 3.3 Monte Carlo Results Table 1 presents the Monte Carlo results for non-cognitive skills. The results for cognitive skills yield similar conclusions and are therefore omitted for brevity. We report both the average bias and the standard deviation of the Monte Carlo estimates in percentage points of the standard deviation of the outcome variable, the same units as in Table 1 of the paper (Section 5). Irrespective of the sample size N, and irrespective of the distribution of η|Z, the bias of the uncorrected approach with controls (column (ii)) is very large. In all rows we strongly reject a t-test that the bias is equal to zero. In the next columns of the table, we show the results using the correction strategies proposed in the paper under different distributional assumptions on η|Z. It is immediate that all of the correction strategies yield much smaller biases than the uncorrected strategy irrespective of the sample size and the true distribution of η|Z. 8

Table 1: Monte Carlo Results, Non-Cognitive Skills (ii) (iii) (iv) (v) Uncorrected Homoskedastic Semiparametric Conditional w/ Controls Tobit Tobit Tail Symmetry N=500 Bias SD Bias SD Bias SD Bias SD Hom. Normal 3.10** 0.45 0.08 3.50 0.24 3.37 0.42 3.43 Het. Normal 3.10** 0.46 0.03 3.56 0.21 3.39 0.38 3.47 Het. Logistic 3.12** 0.47 -0.85 3.86 -0.59 3.69 0.20 3.15 Het. Triangular 3.41** 0.30 0.16 1.45 0.27 1.42 0.33 1.45 Het. Uniform 3.00** 0.43 1.49 3.01 1.55 2.84 0.84 4.63 Het. Mixt. Norm. 3.07** 0.45 0.44 3.48 0.56 3.32 0.45 3.71 N=1,000 Bias SD Bias SD Bias SD Bias SD Hom. Normal 3.09** 0.31 -0.00 2.42 0.08 2.37 0.17 2.38 Het. Normal 3.09** 0.31 -0.02 2.47 0.08 2.40 0.15 2.42 Het. Logistic 3.11** 0.32 -0.78 2.67 -0.64 2.61 0.12 2.18 Het. Triangular 3.40** 0.21 0.16 1.00 0.19 1.00 0.18 1.03 Het. Uniform 2.99** 0.30 1.55 2.08 1.57 2.01 0.60 3.57 Het. Mixt. Norm. 3.06** 0.31 0.36 2.35 0.42 2.30 0.19 2.59 N=5,000 Bias SD Bias SD Bias SD Bias SD Hom. Normal 3.09** 0.14 0.02 1.07 0.03 1.06 0.05 1.07 Het. Normal 3.09** 0.14 -0.00 1.09 0.03 1.09 0.05 1.09 Het. Logistic 3.12** 0.14 -0.79 1.17 -0.74 1.16 0.01 0.95 Het. Triangular 3.40** 0.09 0.16 0.44 0.13 0.44 0.03 0.46 Het. Uniform 3.00** 0.13 1.54* 0.91 1.53* 0.89 0.12 1.73 Het. Mixt. Norm. 3.07** 0.13 0.38 1.05 0.38 1.04 0.05 1.18 Note: The values are reported in percentage points of the standard deviation of the outcome variable, as in Table 1 of the paper. Columns (iii), (iv) and (v) show results for corrections using Models 4.2.1, 4.3.1 and 4.3.2 in the paper, respectively. Simulations are based on 10,000 drawn samples of size N using 10 clusters. For each panel representing a different sample size N, each row represents the results from a Monte Carlo assuming a different distribution of η|Z. **: significant at the 5% level. *: significant at the 10% level. Although all three correction strategies perform substantially better than the uncorrected strategy, a comparison among them reveals interesting patterns. First, the biases of the three correctionmethodsarealmostneversignificantlydifferentfromzero,includingallinstancesin which the distribution assumption is wrong. The only exception is when the true distribution is the uniform and the correction strategies assume normality instead (columns (iii) and (iv)). Second, irrespective of the distribution of η|Z, the tail symmetry assumption yields biases 9

that decline in magnitude as N grows. Third, both Tobit strategies perform better than the tail symmetry strategy under normality (two first rows in each panel), though the difference diminishes substantially when the sample increases. Finally, note that the standard errors reported by the uncorrected approach are much smaller than the standard errors reported by any of the corrected approaches. This mirrors what we find in the paper (Table 1 in Section 5). References Caetano, C., Caetano, G., and Nielsen, E. (2020). Should children do more enrichment activities? Leveraging bunching to correct for endogeneity. FEDS Working Paper No. 2020-036. 10

Cite this document

APA

Carolina Caetano, Gregorio Caetano, & and Eric Nielsen (2020). Correcting for Endogeneity in Models with Bunching (FEDS 2020-080). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2020-080

BibTeX

@techreport{wtfs_feds_2020_080,
  author = {Carolina Caetano and Gregorio Caetano and and Eric Nielsen},
  title = {Correcting for Endogeneity in Models with Bunching},
  type = {Finance and Economics Discussion Series},
  number = {2020-080},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2020},
  url = {https://whenthefedspeaks.com/doc/feds_2020-080},
  abstract = {We show that in models with endogeneity, bunching at the lower or upper boundary of the distribution of the treatment variable may be used to build a correction for endogeneity. We derive the asymptotic distribution of the parameters of the corrected model, provide an estimator of the standard errors, and prove the consistency of the bootstrap. An empirical application reveals that time spent watching television, corrected for endogeneity, has roughly no net effect on cognitive skills and a significant negative net effect on non-cognitive skills in children. Accessible materials (.zip)},
}