feds · May 21, 2018

Density Forecasts in Panel Data Models: A Semiparametric Bayesian Perspective

Abstract

This paper constructs individual-specific density forecasts for a panel of firms or households using a dynamic linear model with common and heterogeneous coefficients and cross-sectional heteroskedasticity. The panel considered in this paper features a large cross-sectional dimension N but short time series T. Due to the short T, traditional methods have difficulty in disentangling the heterogeneous parameters from the shocks, which contaminates the estimates of the heterogeneous parameters. To tackle this problem, I assume that there is an underlying distribution of heterogeneous parameters, model this distribution nonparametrically allowing for correlation between heterogeneous parameters and initial conditions as well as individual-specific regressors, and then estimate this distribution by pooling the information from the whole cross-section together. Theoretically, I prove that both the estimated common parameters and the estimated distribution of the heterogeneous parameters achieve posterior consistency, and that the density forecasts asymptotically converge to the oracle forecast. Methodologically, I develop a simulation-based posterior sampling algorithm specifically addressing the nonparametric density estimation of unobserved heterogeneous parameters. Monte Carlo simulations and an application to young firm dynamics demonstrate improvements in density forecasts relative to alternative approaches. Accessible materials (.zip)

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C. Density Forecasts in Panel Data Models: A Semiparametric Bayesian Perspective Laura Liu 2018-036 Please cite this paper as: Liu, Laura (2018). “Density Forecasts in Panel Data Models: A Semiparametric Bayesian Perspective,” Finance and Economics Discussion Series 2018-036. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2018.036. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

Density Forecasts in Panel Data Models: ∗ A Semiparametric Bayesian Perspective Laura Liu† March 18, 2018 Abstract This paper constructs individual-specific density forecasts for a panel of firms or households using a dynamic linear model with common and heterogeneous coefficients and cross-sectional heteroskedasticity. The panel considered in this paper features a large cross-sectional dimension N but short time series T. Due to the short T, traditional methods have difficulty in disentangling the heterogeneous parameters from the shocks, which contaminates the estimates of the heterogeneous parameters. To tackle this problem, I assume that there is an underlying distributionofheterogeneousparameters,modelthisdistributionnonparametricallyallowingfor correlationbetweenheterogeneousparametersandinitialconditionsaswellasindividual-specific regressors, and then estimate this distribution by pooling the information from the whole crosssection together. Theoretically, I prove that both the estimated common parameters and the estimated distribution of the heterogeneous parameters achieve posterior consistency, and that the density forecasts asymptotically converge to the oracle forecast. Methodologically, I developasimulation-basedposteriorsamplingalgorithmspecificallyaddressingthenonparametric density estimation of unobserved heterogeneous parameters. Monte Carlo simulations and an application to young firm dynamics demonstrate improvements in density forecasts relative to alternative approaches. JEL Codes: C11, C14, C23, C53, L25 Keywords: Bayesian, Semiparametric Methods, Panel Data, Density Forecasts, Posterior Consistency, Young Firms Dynamics ∗Firstversion: November15,2016. Latestversion: https://goo.gl/8zZZwn. Iamindebtedtomyadvisors,Francis X. Diebold and Frank Schorfheide, for much help and guidance at all stages of this research project. I also thank the other members of my committee, Xu Cheng and Francis J. DiTraglia, for their advice and support. I further benefited from many helpful discussions with St´ephane Bonhomme, Benjamin Connault, Hyungsik R. Moon, and seminarparticipantsattheUniversityofPennsylvania,theFederalReserveBankofPhiladelphia,theFederalReserve Board, the University of Virginia, Microsoft, the University of California, Berkeley, the University of California, San Diego (Rady), Boston University, the University of Illinois at Urbana–Champaign, Princeton University, and Libera Universita` di Bolzano, as well as conference participants at the 26th Annual Meeting of the Midwest Econometrics Group, NBER-NSF Seminar on Bayesian Inference in Econometrics and Statistics, the 11th Conference on Bayesian Nonparametrics,MicroeconometricsClassof2017Conference,InteractionsWorkshop2017,andFirstItalianWorkshop ofEconometricsandEmpiricalEconomics: PanelDataModelsandApplications. Iwouldalsoliketoacknowledgethe Kauffman Foundation and the NORC Data Enclave for providing researcher support and access to the confidential microdata. All remaining errors are my own. This paper does not necessarily reflect the views of the Board of Governors or the Federal Reserve System. †Federal Reserve Board, laura.liu@frb.gov.

1 Introduction Panel data, such as a collection of firms or households observed repeatedly for a number of periods, are widely used in empirical studies and can be useful for forecasting individuals’ future outcomes, which is interesting and important in many applications. For example, PSID can be used to analyze income dynamics (Hirano, 2002; Gu and Koenker, 2017b), and bank balance sheet data can help conduct bank stress tests (Liu et al., 2017). This paper constructs individual-specific density forecasts using a dynamic linear panel data model with common and heterogeneous parameters and cross-sectional heteroskedasticity. In this paper, I consider young firm dynamics as the empirical application. For illustrative purposes, let us consider a simple dynamic panel data model as the baseline setup: y = βy + λ + u , u ∼ N (cid:0) 0,σ2(cid:1) , (1.1) it i,t−1 i it it (cid:124)(cid:123)(cid:122)(cid:125) (cid:124)(cid:123)(cid:122)(cid:125) (cid:124)(cid:123)(cid:122)(cid:125) performance skill shock where i = 1,··· ,N, and t = 1,··· ,T + 1. y is the observed firm performance such as the log it of employment,1 λ is the unobserved skill of an individual firm, and u is an i.i.d. shock. Skill i it is independent of the shock, and the shock is independent across firms and times. β and σ2 are common across firms, where β represents the persistence of the dynamic pattern and σ2 gives the size of the shocks. Based on the observed panel from period 0 to period T, I am interested in forecasting the future performance of any specific firm in period T +1, y .2 i,T+1 The panel considered in this paper features a large cross-sectional dimension N but short time series T.3 This framework is appealing to the young firm dynamics example because the number of observations for each young firm is restricted by its age.4 Good estimates of the unobserved skill λ s facilitate good forecasts of y s. Because of the short T, traditional methods have difficulty i i,T+1 in disentangling the unobserved skill λ from the shock u , which contaminates the estimates of λ . i it i The naive estimators that only utilize the firm-specific observations are inconsistent, even if N goes to infinity. 1Employment is one of the standard measures in the firm dynamics literature (Akcigit and Kerr, 2016; Zarutskie and Yang, 2015). 2In the main body of the paper, I consider a more general specification that accommodates many important features of real-world empirical studies, such as strictly exogenous and predetermined covariates, correlated random coefficients, and cross-sectional heteroskedasticity. 3Which T can be considered small depends on the dimension of individual heterogeneity (which can be multidimensional in the general model), the cross-sectional dimension, and the size of the shocks. There can still be a significant gain in density forecasts even when T exceeds 100 in simulations with fairly standard data generating processes. Roughly speaking, the proposed predictor would provide a sizable improvement as long as the time series for individual i is not informative enough to fully reveal its individual effects. 4Although I describe the econometric intuition using the young firm dynamics application as an example, the method can be applied to many other economic and financial analyses that feature panel data with relatively large N and small T, such as microeconomic panel surveys (e.g. PSID, NLSY, and Consumer Expenditure Survey (CE)), macroeconomic sectoral and regional panel data (e.g. Industrial Production (IP) and State and Metro Area Employment,Hours,andEarnings(SAE)),andfinancialinstitutionperformance(e.g.commercialbankandholdingcompany data). 1

To tackle this problem, I assume that λ is drawn from an underlying skill distribution f and i estimate this distribution by pooling the information from the whole cross-section. In terms of modeling f, the parametric Gaussian density misses many features in real-world data, such as asymmetricity, heavy tails, and multiple peaks. For example, as good ideas are scarce, the skill distribution of young firms may be highly skewed. In this sense, the challenge now is how we can model f more carefully and flexibly. Here I estimate f via a nonparametric Bayesian approach where the prior is constructed from a mixture model and allows for correlation between λ and the i initial condition y (i.e. a correlated random effects model). i0 Conditional on f, we can, intuitively speaking, treat it as a prior distribution and combine it with firm-specific data to obtain the firm-specific posterior. In a special case where the common parameters are set to β = 0 and σ2 = 1, the firm-specific posterior is characterized by Bayes’ theorem, p(y |λ )f(λ |y ) ´ i,1:T i i i0 p(λ |f,y ) = . (1.2) i i,0:T p(y |λ )f(λ |y )dλ i,1:T i i i0 i This firm-specific posterior helps provide a better inference about the unobserved skill λ of each i individualfirmandabetterforecastofthefirm-specificfutureperformance,thankstotheunderlying distribution f that integrates the information from the whole panel in an efficient and flexible way.5 It is natural to construct density forecasts based on the firm-specific posterior. In general, forecasting can be done in point, interval, or density fashion, with density forecasts giving the richest insight regarding future outcomes. By definition, a density forecast provides a predictive distribution of firm i’s future performance and summarizes all sources of uncertainties; hence, it is preferableinthecontextofyoungfirmdynamicsandotherapplicationswithlargeuncertaintiesand nonstandard distributions. In particular, for the dynamic panel data model as specified in equation (1.1), the density forecasts reflect uncertainties arising from the future shock u , individual i,T+1 heterogeneity λ , and estimation uncertainty of common parameters (cid:0) β,σ2(cid:1) and skill distribution i f. Moreover, once the density forecasts are obtained, one can easily recover the point and interval forecasts. The contributions of this paper are threefold. First, I establish the theoretical properties of the proposed Bayesian predictor when the cross-sectional dimension N tends to infinity. To begin, I provide conditions for identifying both the parametric component and the nonparametric component.6 Then, I prove that both the estimated common parameters and the estimated distribution of the individual heterogeneity achieve posterior consistency in strong topology. Compared with previous literature on posterior consistency, there are several challenges in the panel data framework: (1) a deconvolution problem disentangling unobserved individual effects and independent shocks, 5Notethatthisisonlyanintuitiveexplanationwhytheskilldistributionf iscrucial. Intheactualimplementation, the estimation of the correlated random effect distribution f, the estimation of common parameters (cid:0) β,σ2(cid:1) , and the inference of firm-specific skill λ are all done simultaneously. i 6In the baseline model, the parametric component is (cid:0) β,σ2(cid:1) , and the nonparametric component is f. 2

(2) an unknown common shock size in cross-sectional homoskedastic cases, (3) unknown individualspecificshocksizesincross-sectionalheteroskedasticcases,(4)strictlyexogenousandpredetermined variables (including lagged dependent variables) as covariates, and (5) correlated random effects7 addressed by flexible conditional density estimation. Building on the posterior consistency of the estimates, we can bound the discrepancy between the proposed density predictor and the oracle to be arbitrarily small, where the oracle predictor is an (infeasible) benchmark that is defined as the individual-specific posterior predictive distribution under the assumption that the common parameters and the distribution of the heterogeneous parameters are known. Second, I develop a posterior sampling algorithm specifically addressing nonparametric density estimation of the unobserved individual effects. For a random effects model, which is a special case where the individual effects are independent of the conditioning variables,8 the f part becomes a relatively simple unconditional density estimation problem. I adopt a Dirichlet Process Mixture (DPM)priorforf andconstructaposteriorsamplerbuildingontheblockedGibbssamplerproposed by Ishwaran and James (2001, 2002). For a correlated random effects model, I further adapt the proposed algorithm to the much harder conditional density estimation problem using a probit stickbreaking process prior suggested by Pati et al. (2013). Third, Monte Carlo simulations demonstrate improvements in density forecasts relative to alternative predictors with various parametric priors on f, evaluated by the log predictive score. An application to young firm dynamics also shows that the proposed predictor provides more accurate densitypredictions. Thebetterforecastingperformanceislargelyduetothreekeyfeatures(inorder ofimportance): thenonparametricBayesianprior, cross-sectionalheteroskedasticity, andcorrelated randomcoefficients. Theestimatedmodelalsohelpsshedlightonthelatentheterogeneitystructure of firm-specific coefficients and cross-sectional heteroskedasticity, as well as whether and how these unobserved heterogeneous features depend on the initial condition of the firms. Moreover, the method proposed in this paper is applicable beyond forecasting. Here estimating heterogeneous parameters is important because we want to generate good forecasts, but in other cases, the heterogeneous parameters themselves could be the objects of interest. For example, the technique developed here can be adapted to infer individual-specific treatment effects. Related Literature First, this paper contributes to the literature on individual forecasts in a panel data setup, and is closely related to Liu et al. (2017) and Gu and Koenker (2017a,b). Liu et al. (2017) focus on point forecasts. They utilize the idea of Tweedie’s formula to steer away from the complicated deconvolution problem in estimating λ and establish the ratio optimality i of point forecasts. Unfortunately, the Tweedie shortcut is not applicable to the inference of the 7In heteroskedastic cases, the terminologies “random effects” and “correlated random effects” also apply to individual-specific σ2, which is slightly different from the traditional definitions focusing on λ . i i 8Theconditioningsetcanincludeinitialvaluesofpredeterminedcovariatesandentiresequencesofstrictlyexogenous covariates. 3

underlying λ distribution and therefore not suitable for density forecasts. In addition, this paper i addresses cross-sectional heteroskedasticity where σ2 is an unobserved random quantity, while Liu i et al. (2017) incorporate cross-sectional and time-varying heteroskedasticity via a deterministic function of observed conditioning variables.9 Gu and Koenker (2017b) address the density estimation problem, but with a different method. This paper infers the underlying λ distribution via a full Bayesian approach (i.e. imposing a prior i on the λ distribution and updating the prior belief by the observed data), whereas they employ an i empirical Bayes procedure (i.e. picking the λ distribution by maximizing the marginal likelihood i of data). In principle, the full Bayesian approach is preferable for density forecasts, as it captures all kinds of uncertainties, including estimation uncertainty of the underlying λ distribution, which i has been omitted by the empirical Bayes procedure. In addition, this paper features correlated random effects allowing for cross-sectional heterogeneity (including cross-sectional heteroskedasticity) interacting with the initial conditions, whereas the Gu and Koenker (2017b) approach focuses on random effects models without this interaction. In their recent paper, Gu and Koenker (2017a) also compare their method with an alternative nonparametric Bayesian estimator featuring a Dirichlet Process (DP) prior under a set of fixed scale parameters. There are two major differences between their DP setup and the DPM prior used in this paper. First, the DPM prior provides continuous individual effect distributions, which is the case in many empirical setups. Second, unlike their set of fixed scale parameters, this paper incorporates a hyperprior for the scale parameter and updates it via the observed data, hence let the data choose the complexity of the mixture approximation, which can essentially be viewed as an“automatic”model selection.10 There have also been empirical works on the DPM model with panel data,11 such as Hirano (2002),BurdaandHarding(2013),Rossi(2014),andJensenetal.(2015),buttheyfocusonempirical studies rather than theoretical analysis. Hirano (2002) and Jensen et al. (2015) use linear panel modelswithsetupsbeingslightlydifferentfromthispaper. Hirano(2002)considersflexibilityinthe u distribution instead of the λ distribution. Jensen et al. (2015) assume random effects instead of it i correlated random effects. Burda and Harding (2013) and Rossi (2014) implement nonlinear panel data models via either a probit model or a logit model, respectively. Among others, Li and Vuong (1998), Delaigle et al. (2008), Evdokimov (2010), and Hu (2017) have studied the similar deconvolution problem and estimated the λ distribution in a frequentist i 9Note that, in this paper, the identification restriction to ensure unobserved random σ2 can only permit timei varying distribution for v while keeping zero mean and unit variance (see Remark 3.2 (iv)). However, considering it that this paper focuses on the scenarios with a short time dimension, lack of time-varying heteroskedasticity would not be a major concern. 10Section 5 shows the simulation results comparing the DP prior vs the DPM prior, where both incorporate a hyperprior for the scale parameter. 11ForearlierworksregardingfullBayesiananalyseswithparametricpriorsonλ ,seeLancaster(2002)(orthogonal i reparametrization and a flat prior), Chamberlain and Hirano (1999), Chib and Carlin (1999), Sims (2000) (Gaussian prior), and Chib (2008) (student-t and finite mixture priors). 4

way. Also see Compiani and Kitamura (2016) for a review of frequentist applications of mixture models. However, the frequentist approach misses estimation uncertainty, which matters in density forecasts, as mentioned previously. Second, in terms of asymptotic properties, this paper relates to the literature on posterior consistency of nonparametric Bayesian methods in density estimation problems. See the handbooks, Ghosh and Ramamoorthi (2003), Hjort et al. (2010), and Ghosal and van der Vaart (2017), for a more thorough review and discussion on posterior consistency in Bayesian nonparametric problems. In particular, Canale and De Blasi (2017) relax the tail conditions to accommodate multivariate location-scale mixtures for unconditional density estimation. To handle conditional density estimation, the mixing probabilities can be characterized by a multinomial choice model (Norets, 2010; Norets and Pelenis, 2012), a kernel stick-breaking process (Norets and Pelenis, 2014; Pelenis, 2014; Norets and Pati, 2017), or a probit stick-breaking process (Pati et al., 2013). I adopt the Pati et al. (2013) approach to offer a more coherent nonparametric framework that is more flexible in the conditional measure. This paper builds on these previous works and establishes the strong consistency for a multivariate conditional density estimator featuring infinite location-scale mixtures with a probit stick-breaking process. Then, this paper further takes into account the deconvolution and dynamic panel data structure, as well as obtains the convergence of the proposed predictor to the oracle predictor in strong topology. Third, the algorithms constructed in this paper build on the literature on the posterior sampling schemes for DPM models. The vast Markov chain Monte Carlo (MCMC) algorithms can be divided into two general categories. One is the P´olya urn style samplers that marginalize over the unknown distribution (Escobar and West, 1995; Neal, 2000). The other resorts to the stick-breaking process (Sethuraman,1994)anddirectlyincorporatestheunknowndistributionintothesamplingprocedure. This paper utilizes a sampler from the second category, the blocked Gibbs sampler by Ishwaran and James (2001, 2002), as a building block for the proposed algorithm. It incorporates truncation approximation and augments the data with auxiliary component probabilities, which breaks down thecomplexposteriorstructureandthusenhancesmixingpropertiesaswellasreducescomputation time.12 I further adapt the proposed algorithm to the conditional density estimation for correlated random effects using the probit stick-breaking process prior suggested by Pati et al. (2013). Last but not least, the empirical application in this paper also links to the young firm dynamics literature. Akcigit and Kerr (2016) document the fact that R&D intensive firms grow faster, and such boosting effects are more prominent for smaller firms. Robb and Seamans (2014) examine the role of R&D in capital structure and performance of young firms. Zarutskie and Yang (2015) present some empirical evidence that young firms experienced sizable setbacks during the recent recession, which may partly account for the slow and jobless recovery. See the handbook by Hall 12Robustness checks have been conducted with the more sophisticated slice-retrospective sampler (Dunson, 2009; Yau et al., 2011; Hastie et al., 2015), which does not involve hard truncation but is more complicated to implement. Results from the slice-retrospective sampler are comparable to the simpler truncation sampler. 5

andRosenberg(2010)forathoroughreviewonyoungfirminnovation. Theempiricalanalysisofthis paper builds on these previous findings. Besides more accurate density forecasts, we can also obtain the latent heterogeneity structure of firm-specific coefficients and cross-sectional heteroskedasticity. The rest of the paper is organized as follows. Section 2 introduces the general panel data model, the predictors for density forecasts, and the nonparametric Bayesian priors. Section 3 characterizes identification conditions and large sample properties. Section 4 proposes the posterior sampling algorithms. Section 5 examines the performance of the semiparametric Bayesian predictor using simulated data, and Section 6 applies the proposed predictor to the confidential microdata from the Kauffman Firm Survey and analyzes the empirical findings on young firm dynamics. Finally, Section 7 concludes and sketches future research directions. Notations, proofs, as well as additional algorithms and results are in the Appendix. 2 Model 2.1 General Panel Data Model The general panel data model with (correlated) random coefficients and potential cross-sectional heteroskedasticity can be specified as y = β(cid:48)x +λ(cid:48)w +u , u ∼ N (cid:0) 0,σ2(cid:1) (2.1) it i,t−1 i i,t−1 it it i where i = 1,··· ,N, and t = 1,··· ,T +1. Similar to the baseline setup in equation (1.1), y is the it observed individual outcome, such as young firm performance. The main goal of this paper is to estimate the model using the sample from period 1 to period T and forecast the future distribution of y for any individual i. In the remainder of the paper, I focus on the case where h = 1 (i.e. i,T+h one-period-ahead forecasts) for notation simplicity, but the discussion can be extended to multiperiod-ahead forecasts via either a direct or an iterated approach (Marcellino et al., 2006). w is a vector of observed covariates that have heterogeneous effects on the outcomes, with i,t−1 λ being the unobserved heterogeneous coefficients. w is strictly exogenous and captures the i i,t−1 key sources of individual heterogeneity. The simplest choice would be w = 1, where λ can be i,t−1 i interpreted as an individual-specific intercept, i.e. firm i’s skill level in the baseline model (1.1). Moreover, it is also helpful to include other key covariates of interest whose effects are more diverse cross-sectionally; for example, R&D activities may benefit the young firms in different magnitudes. Furthermore, the current setup can also take into account deterministic or stochastic aggregate effects;forexample,differentyoungfirmsmayresponddifferentlytothefinancialcrisis. Fornotation (cid:104) (cid:105)(cid:48) clarity, I partition w = 1,wA(cid:48) ,wI(cid:48) , where wA stands for a vector of aggregate variables, i,t−1 t−1 i,t−1 t−1 and wI is composed of individual-specific variables. i,t−1 x is a vector of observed covariates that have homogeneous effects on the outcomes, and i,t−1 β is the corresponding vector of common parameters. x can be either strictly exogenous or i,t−1 6

(cid:104) (cid:105)(cid:48) predetermined, which can be further denoted as x = xO(cid:48) ,xP(cid:48) , where xO is the strictly i,t−1 i,t−1 i,t−1 i,t−1 exogenous part and xP is the predetermined part. The one-period-lagged outcome y is a i,t−1 i,t−1 typical candidate for xP in the dynamic panel data literature, which captures the persistence i,t−1 structure. In addition, both xO and xP can incorporate other general control variables, such i,t−1 i,t−1 as firm characteristics as well as local and national economic conditions, which help control for other sources of variation and facilitate forecasts. In addition, we let xP∗ denote the subgroup i,t−1 of xP excluding lagged outcomes and decompose β = (cid:2) βO(cid:48),βP∗(cid:48),ρ (cid:3)(cid:48) , where (cid:0) βO(cid:48),βP∗(cid:48),ρ (cid:1) are i,t−1 (cid:16) (cid:17) the coefficients corresponding to xO(cid:48) ,xP∗(cid:48) ,y , respectively. Here, the distinction between i,t−1 i,t−1 i,t−1 homogeneous effects β(cid:48)x versus heterogeneous effects λ(cid:48)w reveals the latent nonstandard i,t−1 i i,t−1 structures for the key effects while avoiding the curse-of-dimensionality problem. u is an individual-time-specific shock characterized by zero mean and potential cross-sectional it heteroskedasticity σ2,13 with cross-sectional homoskedasticity being a special case where σ2 = σ2. i i In a unified framework, denote ϑ as the common parameters, h as the individual heterogeneity, i and f as the underlying distribution of h . i ϑ = (cid:0) β,σ2(cid:1) , h = λ , in cross-sectional homoskedastic cases, i i ϑ = β, h = (cid:0) λ ,σ2(cid:1) , in cross-sectional heteroskedastic cases. i i i We can define the conditioning set at period t to be c = (cid:0) xP ,xO ,wA ,wI (cid:1) . (2.2) i,t−1 i,0:t−1 i,0:T 0:T i,0:T Note that as xP is predetermined, the sequences of xP in the conditioning set c start from i,t−1 i,t−1 i,t−1 period 0 to period t − 1; xO , wA , and wI are strictly exogenous, so the conditioning set i,t−1 t−1 i,t−1 c contains their entire sequences. Moreover, we can define the part of c that is composed i,t−1 i,t−1 (cid:16) (cid:17) of individual-specific variables as c∗ = xP ,xO ,wI . Then, we can further denote i,t−1 i,0:t−1 i,0:T i,0:T (cid:16) (cid:17) D = {D }N ,D as a shorthand for the data sample used for estimation, which constitutes the i i=1 A conditioning set for posterior inference. D = c∗ contains the observed data for individual i, and i i,T D = wA contains the aggregate regressors with heterogeneous effects. A 0:T Asstressedinthemotivation,theunderlyingdistributionofindividualeffectsisthekeytobetter density forecasts. In the literature, there are usually two kinds of assumptions imposed on this distribution. One is the random coefficients model, where the individual effects h are independent i of the conditioning variables c , which include initial values of predetermined covariates and full i0 sequences of strictly exogenous covariates.14 The other is the correlated random coefficients model, where h and c could be correlated with each other. This paper considers both random coefficients i i0 13In many empirical applications, such as the young firm analysis in Section 6, risk may largely vary over the cross-section, which also contributes considerably to more precise density forecasts. 14In the baseline setup as a special case, the conditioning set is a singleton with the initial outcome y being the i0 only element. 7

and correlated random coefficients models while focusing on the latter. The random coefficients model is more parsimonious and easier to implement, but the correlated random coefficients model is more realistic for young firm dynamics as well as many other empirical setups,15 and random coefficients can be viewed as a special case of correlated random coefficients with zero dependence. In practice, it is not necessary to incorporate all initial values of the predetermined variables and the whole series of the strictly exogenous variables. It is more feasible to only take into account a subset of c or a function of c that is relevant for the specific analysis. i0 i0 Extension: Unbalanced Panels The above setup can be extended to unbalanced panels with randomly omitted observations, which incorporates more data into the estimation and elicits more information for the prediction. Conditional on the covariates, the common parameters, and the distributions of individual heterogeneities, y s are cross-sectionally independent, so the theoretical i,t+1 argument and numerical implementation are still valid in like manner. Let T denote the longest chain for individual i that has complete observations, from t to t . i 0i 1i That is, (y ,w ,x ) are observed for all t = t ,··· ,t . Then, I discard the unobserved it i,t−1 i,t−1 0i 1i periods and redefine the conditioning set at time t = 1,t ,··· ,t ,T +1 to be 0i 1i (cid:16) (cid:17) c = xP ,xO ,wA ,wI , (2.3) i,t−1 i,τi,t−1 i,τiT τiT i,τiT where the set for time periods τ = {0,t −1,··· ,t −1,T}∩{0,··· ,t−1}. Note that t can i,t−1 0i 1i i0 be 1, and t can be T, so this structure is also able to accommodate balanced panels. Accordingly, i1 (cid:16) (cid:17) the individual-specific component of c is c∗ = xP ,xO ,wI . i,t−1 i,t−1 i,τi,t−1 i,τiT i,τiT 2.2 Oracle and Feasible Predictors This subsection formally defines the infeasible optimal oracle predictor and the feasible semiparametric Bayesian predictor proposed in this paper. The kernel of both definitions relies on the conditional predictor, ˆ fcond (y|ϑ,f) = p(y|h ,ϑ,w ,x )· p(h |ϑ,f,D ,D )dh , (2.4) i,T+1 i iT iT i i A i (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) futureshock individualheterogeneity which provides the density forecasts of y conditional on the common parameters ϑ, underlying i,T+1 distribution f, and individual i’s and aggregate data (D ,D ). The first term p(y|h ,ϑ,w ,x ) i A i iT iT captures individual i’s uncertainty due to the future shock u . The second term i,T+1 (cid:81)T p(y |h ,ϑ,w ,x )f(h |c ) p(h |ϑ,f,D ,D ) = ´ t=1 it i i,t−1 i,t−1 i i0 i i A (cid:81)T p(y |h ,ϑ,w ,x )f(h |c )dh t=1 it i i,t−1 i,t−1 i i0 i 15In the baseline setup, the correlated random coefficients model can be interpreted as saying that a young firm’s initial performance may reflect its underlying skill, which is a more sensible assumption. 8

is the individual-specific posterior. It characterizes individual i’s uncertainty due to heterogeneity that arises from insufficient time-series information to infer individual h . The common distribution i f helps in formulating this source of uncertainty and hence contributes to individual i’s density forecasts. The infeasible oracle predictor is defined as if we knew all the elements that can be consistently estimated. Specifically,theoracleknowsthecommonparametersϑ andtheunderlyingdistribution 0 f , but not the individual effects h . Then, the oracle predictor is formulated by plugging the true 0 i values (ϑ ,f ) into the conditional predictor in equation (2.4), 0 0 foracle(y) = fcond (y|ϑ ,f ). (2.5) i,T+1 i,T+1 0 0 In practice, (ϑ,f) are all unknown and need to be estimated, thus introducing another source of uncertainty. I adopt a conjugate prior for the common parameters ϑ (mulitvariate normal inverse gamma for cross-sectional homoskedastic cases and mulitvariate normal for cross-sectional heteroskedastic cases) in order to stay close to the linear regression framework. I resort to the nonparametric Bayesian prior (specified in the next subsection) to flexibly model the underlying distribution, which could better approximate the true distribution f , and the resulting feasible 0 predictor would be close to the oracle. Then, I update the prior belief using the observations from the whole panel and obtain the posterior. The semiparametric Bayesian predictor is constructed by integrating the conditional predictor over the posterior distribution of (ϑ,f),16 ˆ fsp (y) = fcond (y|ϑ,f) · dΠ(ϑ,f|D) dϑdf. (2.6) i,T+1 i,T+1 (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) shock&heterogentity estimationuncertainty The conditional predictor reflects uncertainties due to future shock and individual heterogeneity, whereas the posterior of (ϑ,f) captures estimation uncertainty. Note that the inference of (ϑ,f) poolsinformationfromthewholecross-section; onceconditionedon(ϑ,f)andtheaggregateobservables D , individuals’ outcomes are independent across i, and only individual i’s data are further A needed for its density forecasts. 2.3 Nonparametric Bayesian Priors Aprioronthedistributionf canbeviewedasadistributionoverasetofdistributions. Amongother options, IchoosemixturemodelsforthenonparametricBayesianprior, becausemixturemodelscan effectively approximate a general class of distributions (see Section 3) while being relatively easy to implement (see Section 4). The specific functional form of the nonparametric Bayesian prior depends on whether f is characterized by a random coefficients model or a correlated random coefficients model. The correlated random coefficients setup is more involved but can be crucial in 16The superscript“sp”stands for“semiparametric”. 9

some empirical studies, such as the young firm dynamics application in this paper. Incross-sectionalheteroskedasticcases,Iincorporateanotherflexibleprioronthedistributionof σ2. Define l = log σ¯2(σ i 2−σ2) , where σ2 is some small positive number and σ¯2 is some large positive i i σ¯2−σ2 i number. Then, the support of fσ2 is bounded by (cid:2) σ2,σ¯2(cid:3) and thus satisfies the requirement for the 0 asymptotic convergence of the Bayesian estimates and density forecasts in Propositions 3.10, 3.13, and 3.14. This transformation ensures an unbounded support for l so that we can employ similar i prior structures to λ and l . Note that because λ and σ2 are independent with respect to each i i i i other, their mixture structures are completely separate. For a concise exposition, I define a generic variable z that can represent either λ or l, and then include z as a superscript to indicate whether a specific parameter belongs to the λ part or the l part. 2.3.1 Random Coefficients Model In the random coefficients model, the individual heterogeneity z (= λ or l ) is assumed to be indei i i pendent of the conditioning variables c , so the inference of the underlying distribution f can be i0 considered an unconditional density estimation problem. ThefirstcandidateistheDirichletProcess(DP),whichcastsadistributionoverasetofdiscrete distributions. We denote G ∼ DP (α,G ), where the base distribution G characterizes the center 0 0 of the DP, and the scale parameter α represents the precision (inverse-variance) of the DP. 17 However, considering the baseline model, imposing a DP prior on the distribution f means restricting firms’ skills to some discrete levels, which may not be very appealing for young firm dynamics as well as some other empirical applications. A natural extension is to assume z follows i a continuous parametric distribution f(z;θ) where θ are the parameters, and adopt a DP prior for the distribution of θ.18 Then, the parameters θ are discrete while the individual heterogeneity z enjoys a continuous distribution. This additional layer of mixture leads to the idea of the Dirichlet Process Mixture (DPM) model. For variables supported on the whole real space, like the individual heterogeneity z here, a typical choice of the kernel of f(z;θ) is a (multivariate) normal distribution with θ = (µ,Ω) being the mean and covariance matrix of the normal. With component label k, component probability p , and component parameters (µ ,Ω ), one k k k draw from the DPM prior can be written as an infinite location-scale mixture of (multivariate) normals, ∞ (cid:88) z ∼ p N (µ ,Ω ). (2.7) i k k k k=1 Different draws from the DPM prior are characterized by different combinations of {p ,µ ,Ω }, k k k and different combinations of {p ,µ ,Ω } lead to different shapes of f. That is why the DPM k k k 17See Appendix A for a formal definition of DP. 18Here and below, I suppress the superscript z in the parameters when there is no confusion. 10

prior is flexible enough to approximate many distributions. The component parameters (µ ,Ω ) are k k directly drawn from the DP base distribution G , which is chosen to be the conjugate multivariate- 0 normal-inverse-Wishart distribution (or a normal-inverse-gamma distribution if z is a scalar). The i component probability p is constructed via the stick-breaking process governed by the DP scale k parameter α. (µ ,Ω ) ∼ G , k k 0 (cid:89) p ∼ ζ (1−ζ ), where ζ ∼ Beta(1,α). (2.8) k k j k j<k Thestick-breakingprocessdistinguishestherolesofG andαinthattheformergovernscomponent 0 valueθ whilethelatterguidesthechoiceofcomponentprobabilityp . Fromnowon,forconciseness, k k I denote the p part in equation (2.8) as p ∼ SB(1,α), where the function name SB is the acronym k k for“stick-breaking”, and the two arguments are from the parameters of the Beta distribution for “stick length”ζ ’s. k One virtue of the nonparametric Bayesian framework is its ability to flexibly elicit the tuning parameter from the data. Namely, we can set up a relatively flexible hyperprior for the DP scale parameter, α ∼ Ga(aα,bα), and update it based on the observations. Roughly speaking, the DP 0 0 scale parameter α is linked to the number of unique components in the mixture density and thus determines and reflects the flexibility of the mixture density. Let K∗ denote the number of unique components. As derived in Antoniak (1974), we have E[K∗|α] ≈ αlog (cid:0)α+N(cid:1) , Var[K∗|α] ≈ α α (cid:2) log (cid:0)α+N(cid:1) −1 (cid:3) . α 2.3.2 Correlated Random Coefficients Model Toaccommodatethecorrelatedrandomcoefficientsmodelwheretheindividualheterogeneityz (= λ or l ) i i i can be potentially correlated with the conditioning variables c , it is necessary to consider a noni0 parametric Bayesian prior that is compatible with the much harder conditional density estimation problem. One issue is associated with the uncountable collection of conditional densities, and Pati et al. (2013) circumvent it by linking the properties of the conditional density to the corresponding ones of the joint density without explicitly modeling the marginal density of c . As suggested in i0 Pati et al. (2013), I utilize the Mixtures of Gaussian Linear Regressions (MGLR ) prior, a genx eralization of the Gaussian-mixture prior for conditional density estimation, and extend it to the multivariate setup. Conditioning on c , i0 ∞ z |c ∼ (cid:88) p (c )N (cid:16) µ (cid:2) 1,c(cid:48) (cid:3)(cid:48) ,Ω (cid:17) . (2.9) i i0 k i0 k i0 k k=1 Similar to the DPM prior, the component parameters can be directly drawn from the base distribution, θ = (µ ,Ω ) ∼ G . G is again specified as a conjugate form with a multivariate normal k k k 0 0 11

distribution for vec(µ ) and an inverse Wishart distribution for Ω (or a multivariate-normalk k inverse-gamma distribution if z is a scalar). Now the mixture probabilities are characterized by the i probit stick-breaking process (cid:89) p (c ) = Φ(ζ (c )) (1−Φ(ζ (c ))), (2.10) k i0 k i0 j i0 j<k where stochastic function ζ is drawn from the Gaussian process ζ ∼ GP (0,V ) for k = 1,2,···.19 k k k This setup has three key features: (1) component means are linear in c ; (2) component coi0 variance matrices are independent of c ; and (3) mixture probabilities are flexible functions of c . i0 i0 This framework is general enough to accommodate a broad class of conditional distributions due to the flexibility in the mixture probabilities. Intuitively, the current setup is similar to approximating the conditional density via Bayes’ theorem, but does not explicitly model the distribution of the conditioning variables c , and thus allows for more relaxed assumptions on it.20 i0 3 Theoretical Properties 3.1 Background Generally speaking, a Bayesian analysis starts with a prior belief and updates it with data. It is desirable to ensure that the prior belief does not dominate the posterior inference asymptotically. Namely, as more and more data have been observed, one would have weighed more on the data and less on prior, and the effect from the prior would have ultimately been washed out. For pure Bayesianswhohavedifferentpriorbeliefs,theasymptoticpropertiesensurethattheywilleventually agreeonsimilarpredictivedistributions(BlackwellandDubins,1962;DiaconisandFreedman,1986). Forfrequentistswhoperceivethatthereisanunknowntruedatageneratingprocess, theasymptotic propertiesactasfrequentistjustificationfortheBayesiananalysis—asthesamplesizeincreases, the updated posterior recovers the unknown truth. Moreover, the conditions for posterior consistency provide guidance in choosing better-behaved priors. In the context of infinite dimensional analysis such as density estimation, posterior consistency cannot be taken as given. On the one hand, Doob’s theorem (Doob, 1949) indicates that the Bayesian posterior will achieve consistency almost surely under the prior measure. On the other hand, the null set for the prior can be topologically large, and hence the true model can easily fall beyond the scope of the prior, especially in nonparametric analysis. Freedman (1963) gives a simple counterexample in the nonparametric setup, and Freedman (1965) further examines the combinations of the prior and the true parameters that yield a consistent posterior, and proves that such combinations are meager in the joint space of the prior and the true parameters. Therefore, 19See Appendix A for the definition of Gaussian process. 20See Appendix B.1 for a more detailed explanation. 12

for problems involving density estimation, it is crucial to find reasonable conditions on the joint behavior of the prior and the true density to establish the posterior consistency argument. In this section, I show the asymptotic properties of the proposed semiparametric Bayesian predictor when the time dimension T is fixed and the cross-sectional dimension N tends to infinity. Basically, under reasonably general conditions, the joint posterior of the common parameters and the individual effect distribution concentrates in an arbitrarily small region around the true data generating process, and the density forecasts concentrate in an arbitrarily small region around the oracle. 3.2 Identification To establish the posterior consistency argument, we first need to ensure identification of both the common parameters and the (conditional) distribution of individual effects. Here, I present the identification result in terms of the correlated random coefficients model with cross-sectional heteroskedasticity, where random coefficients and cross-sectional homoskedasticity can be viewed as special cases. Assumption 3.1. (Identification: General Model) 1. Model setup: (a) Conditional on wA , (cid:0) c∗ ,λ ,σ2(cid:1) are i.i.d. across i. 0:T i0 i i (b) For all t, conditional on (y ,c ), xP∗ is independent of (cid:0) λ ,σ2,β (cid:1) . it i,t−1 it i i (cid:16) (cid:17) (c) xO ,w are independent of (cid:0) λ ,σ2,β (cid:1) . i,0:T i,0:T i i (d) Conditioning on c , λ and σ2 are independent of each other. i0 i i (e) Let u = σ v . v is i.i.d. across i and t and independent of c . it i it it i,t−1 2. Identification: (a) The characteristic functions for λ |c and σ2|c are non-vanishing almost everywhere.21 i i0 i i0 (b) For all i, w has full rank d . i,0:T−1 w (c) After orthogonal forward differencing, (cid:32) T (cid:33)−1 T (cid:88) (cid:88) x˜ = x −w(cid:48) w w(cid:48) w x . i,t−1 i,t−1 i,t−1 i,s−1 i,s−1 i,s−1 i,s−1 s=t+1 s=t+1 Then, the matrix E(cid:2)(cid:80)T−dwx˜ x˜(cid:48) (cid:3) has full rank d . t=1 i,t−1 i,t−1 x Remark 3.2. (i) Condition 1-a characterizes the correlated random coefficients model, where there can be a potential correlation between the individual heterogeneity (cid:0) λ ,σ2(cid:1) and the conditioning i i variables c . Therefore, despite the conditional independence in condition 1-d, λ and σ2 can i0 i i potentially relate to each other through c . For example, a young firm’s initial performance may i0 reveal its underlying ability and risk. 21This assumption can be relaxed based on Evdokimov and White (2012). 13

(ii) For the random coefficients case, condition 1-a can be altered to“ (cid:0) λ ,σ2(cid:1) are independent of c i i i0 and i.i.d. across i”. Together with condition 1-d, it implies that (cid:0) λ ,σ2,c (cid:1) are independent among i i i0 one another. (iii)“v is i.i.d. across i and t”in condition 1-e implies that v is independent of (cid:0) λ ,σ2(cid:1) , because it it i i the individual effects λ and σ2 are time invariant. i i (iv) It is possible to allow some additional flexibility in the distribution of the shock u . For exit ample, the identification argument still holds as long as (1) conditional on c , v is i.i.d. across i,t−1 it i, (2) the distributions of v , fv(v |c ), have known functional forms, such that E[v |c ] = it t it i,t−1 it i,t−1 0, V[v |c ] = 1, and (3) the characteristic function for v |c is non-vanishing almost evit i,t−1 it i,t−1 erywhere. Nevertheless, as this paper studies panels with short time spans, time-varying shock distribution may not play a significant role. I will keep the normality assumption in the rest of this paper to streamline the arguments. Proposition 3.3. (Identification: General Model) Under Assumption 3.1, the common parameters β and the conditional distribution of individual effects, fλ(λ |c ) and fσ2(σ2|c ), are all identified. i i0 i i0 See Appendix B.2 for the proof. Assumption 3.1 and Proposition 3.3 are similar to Assumption 2.1-2.2 and Theorem 2.3 in Liu et al. (2017)22 except for the treatment of cross-sectional heteroskedasticity with σ2 being an unobserved random quantity (see the literature review for a more i detailed comparison). The rank condition supports the posterior consistency of β via orthogonal forward differencing. 23 λ is additively separable from the shocks, so I follow the original proof i based on the characteristic function (i.e. the Fourier transform). Note that unlike λ , σ2 interacts i i with the shocks in a multiplicative way. The Fourier transform is not suitable for disentangling products of random variables, so I resort to the Mellin transform (Galambos and Simonelli, 2004) to deliver the identification of fσ2. Example: Baseline Model For the baseline setup in equation (1.1), we can reduce Assumption 3.1 and establish the identification result based on a simpler set of assumptions as follows. Assumption 3.4. (Identification: Baseline Model) 1. (y ,λ ) are i.i.d. across i. i0 i 2. u is i.i.d. across i and t. it 3. The characteristic function for λ |y is non-vanishing almost everywhere. i i0 4. T ≥ 2. Intuitively speaking, taking young firm dynamics as the example, the second condition implies that skill is independent of shock (see Remark 3.2 (iii)) and that shock is independent across firms 22It is in turn based on early works such as Arellano and Bover (1995) and Arellano and Bonhomme (2012). 23The identification of common parameters in panel data models is standard in the literature. See textbooks and handbook chapters, Baltagi (1995), Arellano and Honor´e (2001), Arellano (2003), and Hsiao (2014). 14

and times, so skill and shock are intrinsically different and distinguishable. The third condition facilitates the deconvolution between the signal (skill) and the noise (shock) via Fourier transform. The last condition guarantees that the time span is long enough to distinguish persistence βy i,t−1 and individual effects λ . i Extension: UnbalancedPanels Fortheunbalancedpanelswithrandomlyomittedobservations as specified in Subsection 2.1, we have: Assumption 3.5. (Identification: Unbalanced Panels) For all i, 1. c is observed. i0 2. x and w are observed. iT iT 3. w has full rank d . i,(t0i−1):(t1i−1) w 4. After orthogonal forward differencing, (cid:32) (cid:88) t1i (cid:33)−1 (cid:88) t1i x˜ = x −w(cid:48) w w(cid:48) w x . i,t−1 i,t−1 i,t−1 i,s−1 i,s−1 i,s−1 i,s−1 s=t+1 s=t+1 Then, the matrix E(cid:2)(cid:80)t1i−dwx˜ x˜(cid:48) (cid:3) has full rank d . t=t0i i,t−1 i,t−1 x The first condition guarantees the existence of the initial conditioning set for the correlated random coefficients model. The second condition ensures that the covariates in the forecast equation are available in order to make predictions. The third and fourth conditions are the unbalanced panel counterparts of Assumption 3.1 (2-b,c). They guarantee that the observed chain is long and informative enough to distinguish different aspects of common effects and individual effects. Now we can obtain similar identification results for unbalanced panels under Assumptions 3.1 (except 2-b,c) and 3.5. 3.3 Posterior Consistency In this subsection, I establish the posterior consistency of the estimated common parameters ϑ and the estimated (conditional) distribution of individual effects fz. Note that the estimated individual effects z are not consistent because information is accumulated only along the cross-sectional i dimension but not along the time dimension. 3.3.1 Random Coefficients Model In the random coefficients model, fz is an unconditional distribution. Let Θ be the space of the parametric component ϑ, and let Fz be the set of densities on Rdz (with respect to Lebesgue measure) as the space of the nonparametric component fz. Hence, Θ = Rdx ×R+, f = fλ, and F = Fλ in cross-sectional homoskedastic cases, and Θ = Rdx, f = fλfl, and F = FλFl in cross-sectional heteroskedastic cases due to the independence between λ and σ2. Let Π(·,·) be a i i 15

joint prior distribution on Θ ×F with marginal priors being Πϑ(·) and Πf (·). The corresponding joint posterior distribution is denoted as Π(·,·|D) with the marginal posteriors indicated with superscripts. The posterior consistency results are established with respect to the strong topology on f, which is generated by the L -metric on integrable functions and is closely related to convergence 1 of the probability distribution function (pdf). Note that the strong topology is stronger than the weak topology, and convergence of the pdf is stronger than convergence in distribution or weak convergence. It in turn implies the convergence of quantiles. Definition 3.6. Theposteriorachievesstrong consistency at(ϑ ,f )ifforany(cid:15) > 0andanyδ > 0, 0 0 as N → ∞, Π((ϑ,f) : (cid:107)ϑ−ϑ (cid:107) < δ, (cid:107)f −f (cid:107) < (cid:15)|D) → 1 0 2 0 1 in probability with respect to the true data generating process. To give the intuition behind the posterior consistency argument, let us first consider a simpler scenario where we estimate the distribution of observables without the deconvolution and dynamic panel data structure. The following lemma restates Theorem 1 in Canale and De Blasi (2017). Note that space F is not compact, so we introduce a compact subset F (i.e. sieve) that asymptotically N approximates F and then regularize the asymptotic behavior of F instead of F. N Lemma 3.7. (Canale and De Blasi, 2017) The posterior is strongly consistent at f under two sufficient conditions: 0 1. Kullback-Leibler (KL) property: f is in the KL support of Π, i.e. for all (cid:15) > 0, 0 Π(f ∈ F : d (f ,f) < (cid:15)) > 0, KL 0 ´ where d (f ,f) = f log f0 is the KL divergence of f from f . KL 0 0 f 0 2. Sieve property: There exists F ⊂ F that can be partitioned as F = ∪ F such that, for N N j N,j any (cid:15) > 0, (a) For some β > 0, Π(Fc ) = O(exp(−βN)). N (b) For some γ > 0, (cid:80) (cid:112) N ((cid:15),F )Π(F ) = o (cid:0) exp (cid:0) (1−γ)N(cid:15)2(cid:1)(cid:1) , where N ((cid:15),F ) is j N,j N,j N,j the covering number of F by balls with a radius (cid:15).24 N,j ByBayes’Theorem,theposteriorprobabilityofthealternativeregionUc = {f ∈ F : (cid:107)f −f (cid:107) ≥ (cid:15)} 0 1 24Asthecoveringnumberincreasesexponentiallywiththedimensionofz,adirectadoptionofTheorem2inGhosal et al. (1999) would impose a strong tail restriction on the prior and exclude the case where the base distribution G 0 containsaninverseWishartdistributionforcomponentvariances. Hence,IfollowtheideaofGhosalandvanderVaart (2007)andCanaleandDeBlasi(2017), wheretheyrelaxtheassumptiononthecoveragebehaviorbyasummability condition of covering numbers weighted by their corresponding prior probabilities. 16

can be expressed as the ratio on the right hand side, ˆ N (cid:44)ˆ N Π(Uc|x ) = (cid:89) f(x i ) dΠ(f) (cid:89) f(x i ) dΠ(f). 1:N f (x ) f (x ) Uc 0 i F 0 i i=1 i=1 Intuitively speaking, for the numerator, the sieve property ensures that the sieve expands to the entire alternative region and puts an asymptotic upper bound on the number of balls that cover the sieve. As the likelihood ratio is small in each covering ball, the integration over the alternative region is still sufficiently small. For the denominator, the KL property implies that the prior of distributions puts positive weight on the true distribution, so the likelihood ratio integrated over the whole space is large enough. Therefore, the posterior probability of the alternative region is arbitrarily small. Lemma 3.7 establishes posterior consistency in a density estimation context. However, as mentioned in the introduction, there are a number of challenges in adapting to the dynamic panel data setting. The first challenge is, because we observe y rather than λ , to disentangle the unit i certainty generated from unknown cross-sectional heterogeneity λ and from independent shocks i u , i.e. a deconvolution problem.25 Second is to incorporate an unknown shock size σ2 in crossit sectional homoskedastic cases.26 Third is to address unknown individual-specific shock sizes σ2 in i cross-sectional heteroskedastic cases. Fourth is to take care of strictly exogenous and predetermined variables (including lagged dependent variables) as covariates. More specifically, in the dynamic panel data model, the posterior probability of the alternative region can be decomposed as Π((ϑ,f) : (cid:107)ϑ−ϑ (cid:107) ≥ δ or (cid:107)f −f (cid:107) ≥ (cid:15)|D) 0 2 0 1 =Πϑ((cid:107)ϑ−ϑ (cid:107) ≥ δ|D)+Π((cid:107)f −f (cid:107) ≥ (cid:15),(cid:107)ϑ−ϑ (cid:107) < δ|D), 0 2 0 1 0 2 and we want to show that the whole expression tends to zero as N goes to infinity. The first term is the marginal posterior probability of the finite dimensional common parameters. Its posterior consistency is relatively straightforward to obtain. When a candidate ϑ is far from the true ϑ , 0 we can employ orthogonal forward differencing to get rid of λ (see Appendix B.2), and then apply i the traditional posterior consistency argument to a linear regression of the“residues”. The second term accounts for the infinite dimensional underlying distribution when ϑ approaches ϑ but f is 0 separated from f . The following two paragraphs outline the intuition behind this part of the proof. 0 25Somepreviousstudies(Amewou-Atissoetal.,2003;Tokdar,2006)estimatedistributionsofquantitiesthatcanbe inferredfromobservablesgivencommoncoefficients. Forexample,inthelinearregressionproblemswithanunknown errordistribution,i.e.y =β(cid:48)x +u ,conditionalontheregressioncoefficientsβ,u =y −β(cid:48)x isinferrablefromthe i i i i i i data. However, here the target λ intertwines with u and cannot be easily inferred from the observed y . i it it 26Notethatwhenλ andu arebothGaussianwithunknownvariances,wecannotseparatelyidentifythevariances i it in the cross-sectional setting (T =1). This is no longer a problem if either of the distributions is non-Gaussian or if we work with panel data. 17

Basically, we can re-establish the two conditions in Lemma 3.7 by linking the distribution of the observables y with the underlying distribution of λ (and σ2). it i i The KL requirement ensures that the prior puts positive weight on the true distribution. To satisfytheKLrequirement,weneedsomejointassumptionsonthetruedistributionf andtheprior 0 Π. Compared to general nonparametric Bayesian modeling, the DPM structure (and the MGLR x structure for the correlated random coefficients model) offers more regularities on the prior Π and thus weaker assumptions on the true distribution f (see Assumptions 3.9 and 3.12). In terms 0 of deconvolution,27 the KL requirement is achieved through the convexity of the KL divergence. Intuitively, convolution with a common distribution would reduce the difference in f. In terms of the common parameters, the KL requirement is delivered via controlling the tail behavior of covariates, so the continuity at ϑ is preserved after integrating out covariates. 0 The sieve property guarantees that the data are informative enough to differentiate the true distribution from the alternatives. It relies on both the DPM setup and the strong topology characterized by the L -norm. In terms of deconvolution, (de)convolution preserves the L -norm as well 1 1 as the number of balls that cover the sieve. In terms of the common parameters, when ϑ is close to ϑ but f is far from f , we want to make sure that the deviation generated from ϑ is small enough 0 0 so that it cannot offset the difference in f. Now let us formally state the assumptions and main theorem for random coefficients models. Appendix B.3 provides the complete proof. Assumption 3.8. (Covariates) 1. wI is bounded. i,0:T ´ (cid:16) (cid:17) 2. Let c˜ = xP,xO . As C → 0, (cid:107)c˜ (cid:107)2p(c˜ |c \c˜ )dc˜ → 0. i0 ´ i0 i,0:T−1 (cid:13) (cid:13)2 (cid:16) (cid:107)c˜i0(cid:107) 2 ≥C i0 2 (cid:17) i0 i0 i0 i0 3. As C → 0, (cid:13)xP∗ (cid:13) p xP∗ |y ,c dxP∗ → 0. (cid:107)xP∗ (cid:107) ≥C(cid:13) i,t−1(cid:13) i,t−1 i,t−1 i,0:t−2 i,t−1 i,t−1 2 2 Considering that we have a regression model, the tail condition prevents a slight difference in β from obscuring the difference in f, which is satisfied when x exhibits a tail behavior similar to i,t−1 λ (see Assumption 3.9 (1-e) below). The boundedness of wI serves the same purpose for the i i,0:T heterogenous term λ(cid:48)w . i i,t−1 Assumption 3.9. (Nonparametric Bayesian Prior: Random Coefficients) 1. Conditions on fz: 0 (a) fz(z) is a continuous density. 0 (b) For some 0 < M < ∞, 0 < fz(z) ≤ M for all z. ´ 0 (cid:12) (cid:12) (c) (cid:12) fz(z)logfz(z)dz(cid:12) < ∞. ´ 0 0 (d) fz(z)log f 0 z(z) dz < ∞, where ϕ (z) = inf fz(z), for some δ > 0. 0 ϕ δ (z) ´ δ (cid:107)z(cid:48)−z(cid:107) 2 <δ 0 (e) For some η > 0, (cid:107)z(cid:107) 2(1+η) fz(z)dz < ∞. 2 0 27Here and below,“deconvolution”also refers to the multiplicative deconvolution for cross-sectional heteroskedasticity. 18

2. Conditions on Gz(µ,Ω): 0 (a) Gz 0 has full support on Rdz ×S, where S is the space of d z ×d z positive definite matrices with the spectral norm.28 (b) For some c , c , c > 0, r > (d −1)/2, and κ > d (d −1), for sufficiently large 1 2 3 z z z x > 0, Gz 0 ((cid:107)µ(cid:107) (cid:16) 2 > x) = (cid:17) O (cid:0) x−2(r+1)(cid:1) , Gz 0 (Λ 1 > x) = O(exp(−c 1 xc2)), Gz 0 (cid:0) Λ dz < x 1(cid:1) = O(x−c3), Gz 0 Λ Λ d 1 z > x = O(x−κ), where Λ 1 and Λ dz are the largest and smallest eigenvalues of Ω−1, respectively. First, the KL property is obtained based on conditions 1 and 2-a. Condition 1 ensures that the true distributionf iswell-behaved,andcondition2-aguaranteesthattheDPMpriorisgeneralenoughto 0 containthetruedistribution. Then,condition2-bfurtheraccountsforthesieveproperty. According to Corollary 1 in Canale and De Blasi (2017), condition 2-b holds for a multivariate-normal-inverse- WishartbasedistributionGz (oranormal-inverse-gammabasedistributionifz isascalar)aslongas 0 the degree of freedom of the inverse Wishart component νz > max{2d ,(2d +1)(d −1)}, where 0 z z z 2d controls the tail behavior of component mean µ and (2d +1)(d −1) regulates the eigenvalue z z z structure of component variance Ω. Proposition 3.10. (Consistency: Random Coefficients) Suppose we have: 1. Model: The random effects version of Assumption 3.1.29 2. Covariates: (x ,w ) satisfies Assumption 3.8. i,0:T i,0:T 3. Common parameters: ϑ is in the interior of supp (cid:0) Πϑ(cid:1) . 0 4. Distributions of individual effects: (a) fz and Πz satisfy Assumption 3.9. 0 (cid:16) (cid:17) (b) For cross-sectional heteroskedastic models, supp fσ2 is bounded above by some large 0 σ¯2 > 0. Then, the posterior is strongly consistent at (ϑ ,f ). 0 0 3.3.2 Correlated Random Coefficients Model For the correlated random coefficients model, the definitions and notations are parallel with those in the random coefficients model with a slight adjustment considering that f is now a conditional distribution. As in Pati et al. (2013), it is helpful to link the properties of the conditional density to the corresponding ones of the joint density without explicitly modeling the marginal density of c , which circumvents the difficulty associated with an uncountable set of conditional densities. i0 Let q (c ) be the true marginal density of the conditioning variables. Then, the induced q - 0 0 ´ ´ 0 (cid:2) (cid:3) integrated L -distance is defined as (cid:107)f −f (cid:107) ≡ |f(z|c )−f (z|c )|dz q (c )dc , and the 1 0 1 ´ ´ 0 0 0 0 0 0 (cid:104) (cid:105) induced q -integrated KL divergence is d (f ,f) ≡ f (z|c )log f0(z|c0) dz q (c )dc . Note 0 KL 0 0 0 f(z|c0) 0 0 0 28The spectral norm is induced by the L -norm on vectors, (cid:107)Ω(cid:107) =max (cid:107)Ωx(cid:107)2. 2 2 x(cid:54)=0 (cid:107)x(cid:107)2 29Or Assumptions 3.1 (except 2-b,c) and 3.5 for unbalanced panels. 19

that in both definitions, the conditioning variables c are integrated out with respect to the true q , 0 0 so this setup does not require estimating q and thus relaxes the assumption on the conditioning 0 set.30 The main assumptions and theorem for correlated random coefficients models are stated as follows. Assumption 3.11. (Conditioning set) Let C be the support of c i0 . C is a compact subset of Rdc0, and q 0 (c 0 ) > 0 for all c 0 ∈ C. The compactness fosters uniform behavior along the conditioning variables. This assumption is stronger than Assumption 3.8 (1 and 2) for random coefficients models. Assumption 3.12. (Nonparametric Bayesian Prior: Correlated Random Coefficients) 1. Conditions on fz: 0 (a) For some 0 < M < ∞, 0 < fz(z|c ) ≤ M for all (z,c ). ´ ´ 0 0 0 (b) ´ (cid:12) (cid:12) (cid:2) ´ f 0 z(z|c 0 )logf 0 z(z|c 0 )dz (cid:3) q 0 (c 0 )dc 0 (cid:12) (cid:12) < ∞. (c) (cid:104) fz(z|c )log f 0 z(z|c0) dz (cid:105) q (c )dc < ∞, where ϕ (z|c ) = inf fz(z|c ), for 0 0 ϕ δ (z|c0) 0 0 0 δ 0 (cid:107)z(cid:48)−z(cid:107) 2 <δ 0 0 some δ > 0. ´ ´ (cid:104) (cid:105) (d) For some η > 0, (cid:107)z(cid:107) 2(1+η) fz(z|c )dz q (c )dc < ∞. 2 0 0 0 0 0 (e) fz(·|·) is jointly continuous in (z,c ). 0 0 2. Conditions on Gz: Let d = d (d +1) be the dimension of vec(µ). 0 µ z c0 (a) Gz has full support on Rdµ ×S. 0 (b) Gz is absolutely continuous. 0 (c) For some c , c , c > 0, r > (d −1)/2, and κ > d (d −1), for sufficiently large x > 0, 1 2 3 µ z z Gz 0 ((cid:107)vec(µ)(cid:107) 2 (cid:16) > x) = (cid:17) O (cid:0) x−2(r+1)(cid:1) , Gz 0 (Λ 1 > x) = O(exp(−c 1 xc2)), Gz 0 (cid:0) Λ dz < x 1(cid:1) = O(x−c3), Gz Λ1 > x = O(x−κ). 0 Λ dz (cid:16) (cid:17) 3. Conditions on the stick-breaking process: Vz(c,c˜) = τzexp −Az(cid:107)c−c˜(cid:107)2 , τz > 0 fixed. k k 2 (a) The prior for Az has full support on R+. k (cid:16) (cid:17) (b) There exist β, γ > 0 and a sequence δ = O N−5/2(logN)2 such that P(Az > δ ) ≤ N k N exp (cid:0) −Nβh(β+2)/γlogh (cid:1) . (cid:16) (cid:17) (c) There exists an increasing sequence r N → ∞ and (r N )dc0 = o N1−γ(logN)−(dc0 +1) such that P(Az > r ) ≤ exp(−N). k N These conditions build on Pati et al. (2013) for posterior consistency under the conditional density topology and further extend it to multivariate conditional density estimation with infinite locationscale mixtures. The conditions on fz and Gz can be viewed as conditional density analogs of the 0 0 conditions in Assumption 3.9. Conditions 1, 2-a,b, and 3-a ensure the KL property, and conditions 30Denotethejointdensitiesf˜ (z,c )=f (z|c )·q (c ), f˜(z,c )=f(z|c )·q (c ),wheref˜andf˜ sharethesame 0 0 0 0 0 0 0 0 0 0 0 marginal density q , but different conditional densities f and f . Then, the induced q -integrated L -distance/KL 0 0 0 1 divergence of f with respect to f equals to the L -distance/KL divergence of f˜with respect to f˜. 0 1 0 20

2-c and 3-b,c address the sieve property. For Gz with a multivariate normal distribution on vec(µ) 0 and an inverse Wishart distribution on Ω (or an inverse gamma distribution if z is a scalar), the tail condition on vec(µ) automatically holds, and Corollary 1 in Canale and De Blasi (2017) is satisfied as long as the degree of freedom of the inverse Wishart component νz > (2d +1)(d −1). 0 z z Proposition 3.13. (Consistency: Correlated Random Coefficients) Suppose we have: 1. Model: Assumption 3.1.31 2. Covariates: (x ,w ) satisfy Assumption 3.8 (3) and Assumption 3.11. i,0:T i,0:T 3. Common parameters: ϑ is in the interior of supp (cid:0) Πϑ(cid:1) . 0 4. Distributions of individual effects: (a) fz and Πz satisfy Assumption 3.12. 0 (cid:16) (cid:17) (b) For cross-sectional heteroskedastic models, supp fσ2 is bounded above by some large 0 σ¯2 > 0. Then, the posterior is strongly consistent at (ϑ ,f ). 0 0 The proof in Appendix B.4 parallels the random effects case except that now both the KL property and the sieve property are constructed on the q -induced measure. 0 3.4 Density forecasts Once the posterior consistency results are obtained, we can bound the discrepancy between the proposed predictor and the oracle by the estimation uncertainties in ϑ and f, and then show the asymptotical convergence of the density forecasts to the oracle forecast (see Appendix B.5 for the detailed proof). Proposition 3.14. (Density Forecasts) Suppose we have: 1. For random coefficients models, conditions in Proposition 3.10. 2. For correlated random coefficients models, (a) Conditions in Proposition 3.13. (b) There exists q > 0 such that |q (c )| > q for all c ∈ C. 0 0 0 (cid:16) (cid:17) 3. In addition, for cross-sectional heteroskedastic models, supp fσ2 is bounded below by some 0 σ2 > 0. Then, for any i and any (cid:15) > 0, as N → ∞, (cid:16)(cid:13) (cid:13) (cid:12) (cid:17) P (cid:13)fcond −foracle(cid:13) < (cid:15)(cid:12)D → 1. (cid:13) i,T+1 i,T+1(cid:13) (cid:12) 1 31Or Assumptions 3.1 (except 2-b,c) and 3.5 for unbalanced panels. 21

Remark 3.15. (i) Requirement 3 ensures that the likelihood would not explode in cross-sectional heteroskedastic models. (ii) The asymptotic convergence of aggregate-level density forecasts can be derived by summing individual-specific forecasts over different subcategories. 4 Numerical Implementation Inthissection,Iproposeaposteriorsamplingprocedureforthegeneralpaneldatamodelintroduced in Subsection 2.1. The nonparametric Bayesian prior is specified in Subsection 2.3 and enjoys desirable theoretical properties as discussed in Section 3.32 4.1 Random Coefficients Model For the random coefficients model, I impose the Gaussian-mixture DPM prior on f. The posterior sampling algorithm builds on the blocked Gibbs sampler proposed by Ishwaran and James (2001, 2002). They truncate the number of components by a large K, and prove that as long as K is large enough, the truncated prior is“virtually indistinguishable”from the original one. Once truncation is conducted, it is possible to augment the data with latent component probabilities, which boosts numerical convergence and leads to faster code. To check the robustness regarding the truncation, I also implement the more sophisticated yet complicated slice-retrospective sampler (Dunson, 2009; Yau et al., 2011; Hastie et al., 2015), which does not truncate the number of components at a predetermined K (see Algorithm C.4 in the Appendix). The estimates and forecasts for the two samplers are almost indistinguishable, so I will only show the results generated from the simpler truncation sampler in this paper. Suppose the number of components is truncated at K. Then, the component probabilities are constructed via a truncated stick-breaking process governed by the DP scale parameter α.  ∼ SB(1,α), k < K, p k = 1− (cid:80)K−1p , k = K. j=1 j Note that due to the truncation approximation, the probability for component K is different from its infinite mixture counterpart in equation (2.8). Resembling the infinite mixture case, I denote the above truncated stick-breaking process as p ∼ TSB(1,α,K), where TSB stands for“truncated k stick-breaking”, the first two arguments are from the parameters of the Beta distribution, and the last argument is the truncated number of components. 32The hyperparameters are chosen in a relatively ignorant sense without inferring too much from the data except aligning the scale according to the variance of the data. See Appendix C.1 for details of the baseline model with random effects. 22

Below, the algorithms are stated for cross-sectional heteroskedastic models, while the adjustments for cross-sectional homoskedastic scenarios are discussed in Remark 4.2 (ii). For individual heterogeneity z = λ,l, let γz be individual i’s component affiliation, which can take values i {1,··· ,Kz}, Jz be the set of individuals in component k, i.e. Jz = {i : γz = k}, and nz be the k k i k number of individuals in component k, i.e. n = #J . Then, the (data-augmented) joint posterior k k for the model parameters is given by p({αz,{pz,µz,Ωz},{γz,z }},β|D) (4.1) k k k i i (cid:89) (cid:89) (cid:16) (cid:12) (cid:17) = p(y |λ ,l ,β,w ,x )· p z (cid:12)µz ,Ωz p(γz|{pz}) it i i i,t−1 i,t−1 i(cid:12) γz γz i k i i i,t z,i (cid:89) · p(µz,Ωz)p(pz|αz)·p(αz)·p(β), k k k z,k where z = λ,l, k = 1,··· ,Kz, i = 1,···N, and t = 1,··· ,T. The first block links observations to model parameters {λ ,l } and β. The second block links the individual heterogeneity z to the i i i underlying distribution fz. The last block formulates the prior belief on (β,f). The following Gibbs sampler cycles over the following blocks of parameters (in order): (1) component probabilities, αz,{pz}; (2) component parameters, {µz,Ωz}; (3) component memberships, k k k {γz}; (4) individual effects, {λ ,l }; and (5) common parameters, β. A sequence of draws from this i i i algorithm forms a Markov chain with the sampling distribution converging to the posterior density. Note that if the individual heterogeneity z were known, only step 5 would be sufficient to i recover the common parameters. If the mixture structure of fz were known (i.e. if (pz,µz,Ωz) k k k for all components were known), only steps 3 to 5 would be needed to first assign individuals to components and then infer z based on the specific component that individual i has been assigned i to. In reality, neither z nor its distribution fz is known, so I incorporate two more steps 1 and 2 i to model the underlying distribution fz. Algorithm 4.1. (Random Coefficients with Cross-sectional Heteroskedasticity)33 For each iteration s = 1,··· ,n , sim 1. Component probabilities: For z = λ,l, (a) Draw αz(s) from a gamma distribution p (cid:16) αz(s) (cid:12) (cid:12)p z(s−1) (cid:17) : Kz (cid:16) (cid:17) αz(s) ∼ Ga aαz +Kz −1,bαz −logp z(s−1) . 0 0 Kz (b) For k = 1,··· ,Kz, draw p z(s) from the truncated stick-breaking process k 33Below,IpresenttheformulasforthekeynonparametricBayesiansteps,andleavethedetailsofstandardposterior sampling procedures, such as drawing from a normal-inverse-gamma distribution or a linear regression, to Appendix C.3. 23

(cid:16)(cid:110) (cid:111)(cid:12) (cid:110) (cid:111)(cid:17) p p z(s) (cid:12)αz(s), n z(s−1) : k (cid:12) k   Kz p z(s) ∼ TSB1+n z(s−1) ,αz(s)+ (cid:88) n z(s−1) ,Kz . k k j j=k+1 (cid:16) (cid:17) 2. Componentparameters: Forz = λ,l, fork = 1,··· ,Kz, draw µ z(s) ,Ω z(s) fromamultivariatek k normal-inverse-Wishart distribution (or a normal-inverse-gamma distribution if z is a scalar) (cid:18) (cid:12) (cid:19) z(s) z(s)(cid:12)(cid:110) (s−1) (cid:111) p µ ,Ω (cid:12) z . k k (cid:12) i i∈Jz(s−1) k z(s) 3. Component memberships: For z = λ,l, for i = 1,···N, draw γ from a multinomial distri- (cid:16)(cid:110) (cid:111)(cid:12)(cid:110) (cid:111) (cid:17) i bution p γ z(s) (cid:12) p z(s) ,µ z(s) ,Ω z(s) ,z (s−1) : i (cid:12) k k k i Kz z(s) z(s) (cid:16) (s−1) z(s) z(s) (cid:17) (cid:88) γ = k, with probability p ∝ p φ z ; µ ,Ω , p = 1. i ik k i k k ik k=1 4. Individual-specific parameters: (s) (a) For i = 1,··· ,N, draw λ from a multivariate normal distribution (or a normal distri- (cid:16) i (cid:12) (cid:17) bution if λ is a scalar) p λ (s)(cid:12)µ λ(s) ,Ω λ(s) , (cid:0) σ2(cid:1)(s−1) ,β(s−1),D ,D . i (cid:12) γλ γλ i i A i i (s) (b) For i = 1,··· ,N, draw l via the random-walk Metropolis-Hastings approach, i (cid:16) (cid:12) (cid:17) p l (s)(cid:12)µ l(s) ,Ω l(s) ,λ (s) ,β(s−1),D ,D i (cid:12) γl γl i i A i i T ∝ φ (cid:16) l (s) ; µ l(s) ,Ω l(s) (cid:17)(cid:89) φ (cid:16) y ; λ (s)(cid:48) w +β(s−1)(cid:48)x ,σ2 (cid:16) l (s) (cid:17)(cid:17) , i γl γl it i i,t−1 i,t−1 i i i t=1 where σ2(l) = σ¯2−σ2 +σ2. Then, calculate (cid:0) σ2(cid:1)(s) based on σ2(l). 1+σ¯2exp(−l) i 5. Common parameters: Draw β(s) from a linear regression model with“known”variance (cid:16) (cid:12)(cid:110) (cid:111) (cid:17) p β(s)(cid:12) λ (s) , (cid:0) σ2(cid:1)(s) ,D . (cid:12) i i Remark 4.2. (i) With the above prior specification, all steps enjoy closed-form conditional posterior distributions except step 4-b for σ2, which does not exhibit a well-known density form. Hence, I i resort to the random-walk Metropolis-Hastings algorithm to sample σ2. In addition, I also incori porate an adaptive procedure based on Atchad´e and Rosenthal (2005) and Griffin (2016), which adaptively adjusts the random walk step size and keeps acceptance rates around 30%. Intuitively, whentheacceptancerateforthecurrentiterationistoohigh(low),theadaptivealgorithmincreases (decreases) the step size in the next iteration, and thus potentially raises (lowers) the acceptance rate in the next round. The change in step size decreases with the number of iterations completed, and the step size converges to the optimal value. See Algorithm C.1 in the Appendix for details. (ii) In cross-sectional homoskedastic cases, the algorithm would need the following changes: (1) in 24

steps 1 to 4, only λ is considered, and (2) in step 5, (cid:0) β(s),σ2(s)(cid:1) are drawn from a linear regression i (cid:16) (cid:12)(cid:110) (cid:111) (cid:17) model with“unknown”variance, p β(s),σ2(s)(cid:12) λ (s) ,D . (cid:12) i 4.2 Correlated Random Coefficients Model To account for the conditional structure in the correlated random coefficients model, I implement a multivariate MGLR prior as specified in Subsection 2.3, which can be viewed as the conditional x counterpart of the Gaussian-mixture prior. The conditioning set c is characterized by equation i0 (2.2) for balanced panels or equation (2.3) for unbalanced panels. The major computational difference from the random coefficients model in the previous subsection is that now the component probabilities become flexible functions of c . As suggested in i0 Pati et al. (2013), I adopt the following priors and auxiliary variables in order to take advantage of conjugacy as much as possible. First, the covariance function for Gaussian process V (c,c˜) is k specified as (cid:16) (cid:17) V (c,c˜) = exp −A (cid:107)c−c˜(cid:107)2 , k k 2 where A = C B . Let η = d + 1, then Bη follows the standard exponential distribution, i.e. k k k c0 k p (cid:0) Bη(cid:1) = exp (cid:0) −Bη(cid:1) , and C = k−2(3η+2)(logk)−1/η. This prior structure satisfies Pati et al. (2013) k k k Remark 5.12 that ensures strong consistency.34 Furthermore, it is helpful to introduce a set of auxiliary stochastic functions ξ (c ), k = 1,2,···, such that k i0 ξ (c ) ∼ N (ζ (c ),1), k i0 k i0 p (c ) = Prob(ξ (c ) ≥ 0, and ξ (c ) < 0 for all j < k). k i0 k i0 j i0 Notethattheprobitstick-breakingprocessdefinedinequation(2.10)canberecoveredbymarginalizing over {ξ (y )}. Finally, I blend the MGLR prior with Ishwaran and James (2001, 2002) k i0 x truncation approximation to simplify the numerical procedure while still retaining reliable results. DenoteN×1vectorsζ = [ζ (c ),ζ (c ),··· ,ζ (c )](cid:48)andξ = [ξ (c ),ξ (c ),··· ,ξ (c )](cid:48), k k 10 k 20 k N0 k k 10 k 20 k N0 (cid:16) (cid:17) as well as an N ×N matrix V with the ij-th element being (V ) = exp −A (cid:107)c −c (cid:107)2 . The k k ij k i0 j0 2 next algorithm extends Algorithm 4.1 to the correlated random coefficients scenario. Step 1 for component probabilities has been changed, while the rest of the steps are in line with those in Algorithm 4.1. Algorithm 4.3. (Correlated Random Coefficients with Cross-sectional Heteroskedasticity)35 For each iteration s = 1,··· ,n , sim 34Their p is the d in the notation of this paper, and their d, η , and η can be constructed as d +1, dc0 , and c0 1 c0 dc0 +1 1 , respectively. 2(dc0 +1) 35See Remark 4.2 (ii) for cross-sectional homoskedastic models. 25

1. Component probabilities: For z = λ,l, (a) For k = 1,··· ,Kz −1, draw A z(s) via the random-walk Metropolis-Hastings approach,36 k (cid:32) (cid:32) z(s) (cid:33)η(cid:33) (cid:16) (cid:12) (cid:17) (cid:16) (cid:17)η−1 A p A z(s)(cid:12)ζ z(s−1) ,{y } ∝ A z(s) exp − k k (cid:12) k i0 k C k (cid:16) (cid:16) (cid:17)(cid:17) ·φ ζ z(s−1) ; 0,exp −A z(s) (cid:107)c −c (cid:107)2 . k k i0 j0 2 (cid:16) (cid:17) (cid:16) (cid:17) Then, calculate V z(s) , where V z(s) = exp −A z(s) (cid:107)c −c (cid:107)2 . k k k i0 j0 2 ij (b) For k = 1,··· ,Kz − 1, and i = 1,··· ,N, draw ξ z(s) (c ) from a truncated normal (cid:16) (cid:12) (cid:17) k i0 distribution p ξ z(s) (c )(cid:12)ζ z(s−1) (c ),γ z(s−1) :37 k i0 (cid:12) k i0 i  (cid:16) (cid:17) (cid:16) (cid:17) z(s−1) z(s) z(s−1)    ∝ N ζ k (c i0 ),1 1 ξ k (c i0 ) < 0 , if k < γ i ,  (cid:16) (cid:17) (cid:16) (cid:17) ξ z(s) (c ) ∝ N ζ z(s−1) (c ),1 1 ξ z(s) (c ) ≥ 0 , if k = γ z(s−1) , k i0 k i0 k i0 i   (cid:16) (cid:17)  z(s−1) z(s−1) ∼ N ζ (c ),1 , if k > γ . k i0 i (cid:16) (cid:12) (cid:17) (c) Fork = 1,··· ,Kz−1,ζ z(s) fromamultivariatenormaldistributionp ζ z(s)(cid:12)V z(s) ,ξ z(s) : k k (cid:12) k k (cid:16) (cid:17) (cid:20) (cid:16) (cid:17)−1 (cid:21)−1 ζ z(s) ∼ N mζ,Σζ , where Σζ = V z(s) +I and mζ = Σζξ z(s) . k k k k k N k k k (d) For k = 1,··· ,Kz, and i = 1,··· ,N, the component probabilities p z(s) (c ) are fully k i0 z(s) determined by ζ : k  (cid:16) (cid:17) (cid:16) (cid:16) (cid:17)(cid:17) z(s) (cid:81) z(s) z(s) Φ ζ k (c i0 ) j<k 1−Φ ζ j (c i0 ) , if k < K, p (y ) = k i0 1− (cid:80)Kz−1p z(s) (c ), if k = K. j=1 k i0 2. Component parameters: For z = λ,l, for k = 1,··· ,Kz, (cid:18) (cid:12) (cid:19) (cid:16) z(s) (cid:17) z(s)(cid:12) z(s−1) (cid:110) (s−1) (cid:111) (a) Drawvec µ k fromamultivariatenormaldistributionp µ k (cid:12) (cid:12) Ω k , z i ,c i0 i∈Jz(s−1) . k z(s) (b) Draw Ω from an inverse Wishart distribution (or an inverse gamma distribution if z k (cid:18) (cid:12) (cid:19) z(s)(cid:12) z(s) (cid:110) (s−1) (cid:111) is a scalar) p Ω k (cid:12) (cid:12) µ k , z i ,c i0 i∈Jz(s−1) . k z(s) 3. Component memberships: For z = λ,l, for i = 1,···N, draw γ from a multinomial distrii 36The first term comes from the change of variables from Bη to A . k k 371(·) is an indicator function. 26

(cid:16)(cid:110) (cid:111)(cid:12)(cid:110) (cid:111) (cid:17) bution p γ z(s) (cid:12) p z(s) ,µ z(s) ,Ω z(s) ,z (s−1) ,c : i (cid:12) k k k i i0 Kz γ z(s) = k, with probability p ∝ p z(s) (c )φ (cid:16) z (s−1) ; µ z(s)(cid:2) 1,c(cid:48) (cid:3)(cid:48) ,Ω z(s) (cid:17) , (cid:88) p = 1. i ik k i0 i k i0 k ik k=1 4. Individual-specific parameters: (s) (a) For i = 1,··· ,N, draw λ from a multivariate normal distribution (or a normal distri- (cid:16) i (cid:12) (cid:17) bution if λ is a scalar) p λ (s)(cid:12)µ λ(s) ,Ω λ(s) , (cid:0) σ2(cid:1)(s−1) ,β(s−1),D ,D . i (cid:12) γλ γλ i i A i i (s) (b) For i = 1,··· ,N, draw l via the random-walk Metropolis-Hastings approach (cid:16) (cid:12) i (cid:17) p l (s)(cid:12)µ l(s) ,Ω l(s) ,λ (s) ,β(s−1),D ,D , then calculate (cid:0) σ2(cid:1)(s) based on σ2(l). i (cid:12) γl γl i i A i i i 5. Common parameters: Draw β(s) from a linear regression model with“known”variance (cid:16) (cid:12)(cid:110) (cid:111) (cid:17) p β(s)(cid:12) λ (s) , (cid:0) σ2(cid:1)(s) ,D . (cid:12) i i 5 Simulation In this section, I have conducted extensive Monte Carlo simulation experiments to examine the numericalperformanceoftheproposedsemiparametricBayesianpredictor. Subsection5.1describes the evaluation criteria for point forecasts and density forecasts. Subsection 5.2 introduces other alternative predictors. Subsection 5.3 considers the baseline setup with random effects. Subsection 5.4 extends to the general setup incorporating cross-sectional heterogeneity and correlated random coefficients. 5.1 Forecast Evaluation Methods As mentioned in the model setup in Subsection 2.1, this paper focuses on one-step-ahead forecasts, butasimilarframeworkcanbeappliedtomulti-period-aheadforecasts. Theforecastingperformance is evaluated along both the point and density forecast dimensions, with particular attention to the latter. Point forecasts are evaluated via the Mean Square Error (MSE), which corresponds to the quadratic loss function. Let yˆ denote the forecast made by the model, i,T+1 yˆ = βˆ(cid:48)x +λˆ(cid:48)w , i,T+1 iT i iT where λˆ and βˆ stand for the estimated parameter values. Then, the forecast error is defined as i eˆ = y −yˆ , i,T+1 i,T+1 i,T+1 with y being the realized value at time T +1. The formula for the MSE is provided in the i,T+1 27

following equation, 1 (cid:88) MSE = eˆ2 . N i,T+1 i The Diebold and Mariano (1995) test is further implemented to assess whether the difference in the MSE is significant. The accuracy of the density forecasts is measured by the log predictive score (LPS) as suggested in Geweke and Amisano (2010), 1 (cid:88) LPS = logpˆ(y |D), i,T+1 N i where y is the realization atT+1, andpˆ(y |D) represents the predictive likelihood with rei,T+1 i,T+1 specttotheestimatedmodelconditionalontheobserveddataD. Inaddition, exp(LPS −LPS ) A B gives the odds of future realizations based on predictor A versus predictor B. I also perform the Amisano and Giacomini (2007) test to examine the significance in the LPS difference. 5.2 Alternative Predictors In the simulation experiments, I compare the proposed semiparametric Bayesian predictor with alternatives. Different predictors can be interpreted as different priors on the distribution of λ . As i these priors are distributions over distributions, Figure 5.1 plots two draws from each prior – one in red and the other in black.38 The homogeneous prior (Homog) implies an extreme kind of pooling, which assumes that all firms share the same level of skill λ∗. It can be viewed as a Bayesian counterpart of the pooled OLS estimator. Because λ∗ is unknown beforehand, the corresponding subgraph plots two vertical lines representing two degenerate distributions with different locations. More rigorously, this prior is defined as λ ∼ δ , where δ is the Dirac delta function representing a degenerate distribution i λ∗ λ∗ P(λ = λ∗) = 1. The unknown λ∗ becomes another common parameter, similar to β, so I adopt a i multivariate-normal-inverse-gamma prior on (cid:0) [β,λ∗](cid:48),σ2(cid:1) . The flat prior (Flat) is specified as p(λ ) ∝ 1, an uninformative prior with the posterior mode i being the MLE estimate. Roughly speaking, given the common parameters, there is no pooling from the cross-section, so we learn firm i’s skill λ only using its own history. i Theparametricprior(Param)poolstheinformationfromthecross-sectionviaaparametricskill distribution, such as a Gaussian distribution with unknown mean and variance. The corresponding subgraph contains two curves with different means and variances. More explicitly, we have λ ∼ i N (cid:0) µ,ω2(cid:1) , where a normal-inverse-gamma hyperprior is further imposed on (cid:0) µ,ω2(cid:1) . This prior can be thought of as a limit case of the DPM prior when the scale parameter α → 0, so there is only one component, and (cid:0) µ,ω2(cid:1) are directly drawn from the base distribution G . The choice of the 0 38For easier illustration, here I consider the baseline model with univariate λ and homoskedasticity. i 28

Figure 5.1: Alternative Predictors The black and red lines represent two draws from each prior. hyperprior follows the suggestion by Basu and Chib (2003) to match the Gaussian model with the DPM model such that“the predictive (or marginal) distribution of a single observation is identical under the two models”(pp. 226-227). The nonparametric discrete prior (NP-disc) is modeled by a DP where λ follows a flexible i nonparametric distribution, but on a discrete support. This paper focuses on continuous f, which may be more sensible for the skill of young firms as well as other similar empirical studies. In this sense, the NP-disc predictor helps examine how much can be gained or lost from the continuity assumption and from the additional layer of mixture. In addition, NP-R denotes the proposed nonparametric prior for random effects/coefficients models, and NP-C for correlated random effects/coefficients models. Both of them are flexible priors on continuous distributions, and NP-C allows λ to depend on the initial condition of the i firms. The nonparametric predictors would reduce the estimation bias due to their flexibility while increasing the estimation variance due to their complexity. It is not transparent ex-ante which predictor performs better – the parsimonious parametric ones or the flexible nonparametric ones. Therefore, it is worthwhile to implement the Monte Carlo experiments and assess which predictor produces more accurate forecasts under which circumstances. 29

Table 5.1: Simulation Setup: Baseline Model (a) Dynamic Panel Data Model Law of motion y = βy +λ +u , u ∼ N (cid:0) 0,σ2(cid:1) it i,t−1 i it it Common parameters β = 0.8, σ2 = 1 0 0 4 Initial conditions y ∼ N (0,1) i0 Sample size N = 1000, T = 6 (b) Random Effects Degenerate λ = 0 i Skewed λ ∼ 1N (cid:0) 2, 1(cid:1) + 8N (cid:0) −1, 1(cid:1) i 9 2 9 4 2 Fat tail λ ∼ 1N (0,4)+ 4N (cid:0) 0, 1(cid:1) i 5 5 4 √ Bimodal λ ∼ (0.35N (0,1)+0.65N (10,1))/ 1+102·0.35·0.65 i 5.3 Baseline Model Let us first consider the baseline model with random effects. The specifications are summarized in Table 5.1. β is set to 0.8, as economic data usually exhibit some degree of persistence. σ2 equals 1/4, so 0 0 theroughmagnitudeofsignal-noiseratioisσ2/V(λ ) = 1/4. Theinitialconditiony isdrawnfrom 0 i i0 a standard normal distribution, which complies with the tail condition in Assumption 3.8. Choices of N = 1000 and T = 6 are comparable with the young firm dynamics application. There are four experiments with different true distributions of λ , f (·). As this subsection i 0 focuses on the simplest baseline model with random effects, λ is independent of y in all four i i0 experiments. The first experiment features a degenerate λ distribution, where all firms enjoy the i same skill level. Note that it does not satisfy Assumption 3.9 (1-a), which requires the true λ i distribution to be continuous. The purpose of this distribution is to learn how poorly things can go under the misspecification that the true λ distribution is completely off from the prior support. i The second and third experiments are based on skewed and fat tail distributions, which reflect more realistic scenarios in empirical studies. The last experiment portrays a bimodal distribution with asymmetric weights on the two components. I simulate 1,000 panel datasets for each setup and report the average statistics of these 1,000 repetitions. Forecasting performance, especially the relative rankings and magnitudes, is highly stable across repetitions. In each repetition, I generate 40,000 MCMC draws with the first 20,000 being discarded as burn-in. Based on graphical and statistical tests, the MCMC draws appear to converge to a stationary distribution. Both the Brook-Draper diagnostic and the Raftery-Lewis diagnostic yield desirable MCMC accuracy. See Figures D.1 to D.4 for trace plots, prior/posterior distributions, rolling means, and autocorrelation graphs of β, σ2, α, and λ . 1 Table 5.2 shows the forecasting comparison among alternative predictors. For each experiment, 30

Table 5.2: Forecast Evaluation: Baseline Model Degenerate Skewed Fat Tail Bimodal MSE* LPS*N MSE* LPS*N MSE* LPS*N MSE* LPS*N Oracle 0.25*** -725*** 0.29*** -798*** 0.29*** -804*** 0.27*** -766*** NP-R 0.8%*** -4*** 0.04%*** -0.3*** 0.08%*** -1*** 1.2%*** -6*** Homog 0.03%*** -0.2*** 32%*** -193*** 29%*** -187*** 126%*** -424*** Flat 21%*** -102*** 1.4%*** -7*** 0.3%*** -2*** 8%*** -38*** Param 0.8%*** -4*** 0.3%*** -1*** 0.1%*** -1.5*** 7%*** -34*** NP-disc 0.03%*** -0.2*** 31%*** -206*** 29%*** -205*** 7%*** -40*** The point forecasts are evaluated by MSE together with the Diebold and Mariano (1995) test. The performance of the density forecasts is assessed by the LPS and the Amisano and Giacomini (2007) test. Both performance statistics are further averaged over 1,000 Monte Carlo samples. For the oracle predictor, the table reports the exact values of MSE and LPS*N (i.e. LPS multiplied by the cross-sectional dimension N). For other predictors, the table reports the percentage deviations from the oracle MSE and difference with respecttotheoracleLPS*N.ThetestsareconductedwithrespecttoNP-R,withsignificancelevelsindicated by *: 10%, **: 5%, and ***: 1%. The entries in bold indicate the best feasible predictor in each column. point forecasts and density forecasts share comparable rankings. When the λ distribution is degeni erate, Homog and NP-disc are the best, as expected. They are followed by NP-R and Param, and Flat is considerably worse. When the λ distribution is non-degenerate, there is a substantial gain i in both point forecasts and density forecasts from employing the NP-R predictor. In the bimodal case, the NP-R predictor far exceeds all other competitors. In the skewed and fat tailed cases, the Flat and Param predictors are second best, yet still significantly inferior to NP-R. The Homog and NP-disc predictors yield the poorest forecasts, which suggests that their discrete supports are not able to approximate the continuous λ distribution, and even the nonparametric DP prior with i countably infinite support (NP-disc) is far from enough. Therefore, when researchers believe that the underlying λ distribution is indeed discrete, the i DPprior(NP-disc)isamoresensiblechoice; ontheotherhand, whentheunderlyingλ distribution i isactuallycontinuous,theDPMprior(ortheMGLR priorforthecorrelatedrandomeffectsmodel) x promotes better forecasts. In the empirical application to young firm dynamics, it would be more reasonable to assume continuous distributions of individual heterogeneity in levels, reactions to R&D, and shock sizes, and results show that the continuous nonparametric prior outperforms the discrete DP prior in terms of density forecasts (see Table 6.2). Toinvestigatewhyweobtainbetterforecasts, Figure5.2demonstratestheposteriordistribution of the λ distribution (i.e. a distribution over distributions) for experiments Skewed, Fat Tail, and i Bimodal. In each case, the subgraphs are constructed from the estimation results of one of the 1,000 repetitions, with the left subgraph given by the Param estimator and the right one by NP-R. In each subgraph, the black solid line represents the true λ distribution, f . The blue bands show i 0 the posterior distribution of f, Π(f|y ). 1:N,0:T 31

For the skewed λ distribution, the NP-R estimator better tracks the peak on the left and the i tail on the right. For the λ distribution with fat tails, the NP-R estimator accommodates the i slowly decaying tails, but is still not able to fully mimic the spiking peak. For the bimodal λ i distribution, it is not surprising that the NP-R estimator nicely captures the M-shape. In summary, the nonparametric prior flexibly approximates a vast set of distributions, which helps provide more preciseestimatesoftheunderlyingλ distributionsandconsequentlymoreaccuratedensityforecasts. i This observation confirms the connection between skill distribution estimation and density forecasts as revealed in Proposition 3.14. Ihavealsoconsideredvariousrobustnesschecks. Intermsofthesetup,Ihaverundifferentcrosssectional dimensions N = 100, 500, 1000, 105, different time spans T = 6, 10, 20, 50, different persistences β = 0.2, 0.5, 0.8, 0.95, different sizes of the i.i.d. shocks σ2 = 1/4 and 1 (affecting the signal-to-noise ratio), and different underlying λ distributions (including standard normal). i In general, the NP-R predictor is the overall best for density forecasts except when the true λ i comes from a degenerate distribution or a normal distribution. In the latter case, the parsimonious Param prior coincides with the underlying λ distribution but is only marginally better than the i NP-R predictor. Intuitively, in the language of young firm dynamics, the superiority of the NP-R predictor is more prominent when the time series for a specific firm i is not informative enough to reveal its skill but the whole panel can recover the skill distribution and hence firm i’s uncertainty due to heterogenous skill. That is, NP-R works better than the alternatives when N is not too small, T is not too long, σ2 is not too large, and the λ distribution is relatively non-Gaussian. i Furthermore, as the cross-sectional dimension N increases, the blue band in Figure 5.2 gets closer to the true f and eventually completely overlaps it (see Figure D.5), which resonates the posterior 0 consistency statement. In terms of estimators, I have also constructed the posterior sampler for more sophisticated priors, suchasthePitman-Yorprocesswhichallowsapowerlawtailforclusteringbehaviors, aswell as DPM with skew normal components which better accommodates asymmetric data generating processes. They provide some improvement in the corresponding situations, but call for extra computation efforts. 5.4 General Model Thegeneralmodelaccountsforthreekeyfeatures: (i)multidimensionalindividualheterogeneity,(ii) cross-sectional heteroskedasticity, and (iii) correlated random coefficients. The exact specification is characterized in Table 5.3. In terms of multidimensional individual heterogeneity, λ is now a 3-by-1 vector, and the cori (2) responding covariates are composed of the level, time-specific w , and individual-time-specific t−1 (3) w . i,t−1 In terms of correlated random coefficients, I adopt the conditional distribution following Dunson 32

Figure 5.2: f vs Π(f|y ) : Baseline Model 0 1:N,0:T (a) Skewed (b) Fat Tail (c) Bimodal 33

Table 5.3: Simulation Setup: General Model Law of motion y = βy +λ(cid:48)w +u , u ∼ N (cid:0) 0,σ2(cid:1) it i,t−1 i i,t−1 it it i Covariates w = [1,w (2) ,w (3) ](cid:48), i,t−1 t−1 i,t−1 (2) (3) where w ∼ N (0,1) and w ∼ Ga(1,1) t−1 i,t−1 Common parameters β = 0.8 0 Initial conditions y ∼ U (0,1) i0 Correlated random coefficients λ i |y i0 ∼ e−2yi0N (cid:0) y i0 v,0.12vv(cid:48)(cid:1) + (cid:0) 1−e−2yi0 (cid:1) N (cid:0) y i 4 0 v,0.22vv(cid:48)(cid:1) , where v = [1,2,−1](cid:48) (cid:104) (cid:105) Cross-sectional heteroskedasticity σ2|y ∼ 0.454(y +0.5)2·IG(51,50)+0.05 ·1 (cid:0) σ2 ≤ 106(cid:1) i i0 i0 i Sample size N = 1000, T = 6 and Park (2008) and Norets and Pelenis (2014). They regard it as a challenging problem because such conditional distribution exhibits rapid changes in its shape, which considerably restricts local sample size. The original conditional distribution in their papers is one-dimensional, and I expand ittoaccommodatethethree-dimensionalλ viaalineartransformationoftheoriginal. InFigure5.3 i panel (a), the left subgraph presents the joint distribution of λ and y , where λ is the coefficient i1 i0 i1 (1) on w = 1 and can be interpreted as the heterogeneous intercept. It shows that the shape of the i,t−1 joint distribution is fairly complex, containing many local peaks and valleys. The right subgraph showstheconditionaldistributionofλ giveny = 0.25, 0.5, 0.75. Wecanseethattheconditional i1 i0 distribution is involved as well and evolves with the conditioning variable y . i0 In addition, I also let the cross-sectional heteroskedasticity interact with the initial conditions, and the functional form is modified from Pelenis (2014) case 2. The modification guarantees that the σ2 distribution is continuous with a bounded support above zero (see Propositions 3.13 and i 3.14), and that the signal-to-noise ratio is not far from 1. Their joint and conditional distributions are depicted in Figure 5.3 panel (b). Due to cross-sectional heteroskedasticity and correlated random coefficients, the prior structures becomemorecomplicated. Table5.4describesthepriorsetupsofλ andl , withthepredictorlabels i i being consistent with the definitions in Subsection 5.2. Note that I further add the Homosk-NP-C predictor in order to examine whether it is practically relevant to model heteroskedasticity. Table 5.5 assesses the forecasting performance of these predictors. Considering point forecasts, Heterosk-NP-R and Heterosk-Param constitute the first tier, Heterosk-NP-disc and Heterosk-NP-C can be viewed as the second tier, Homosk-NP-C is the third tier, and Homog and Heterosk-Flat are markedly inferior. It is not surprising that more parsimonious estimators outperform Heterosk-NP- C in terms of point forecasts, though Heterosk-NP-C is correctly specified while the parsimonious ones are not. Nevertheless, the focus of this paper is density forecasting, where Heterosk-NP-C becomes the most accurate density predictor. Several lessons can be inferred from a more detailed comparison 34

Figure 5.3: DGP: General Model (a) p(λ |y ) i1 i0 (b) p (cid:0) σ2|y (cid:1) i i0 Table 5.4: Prior Structures Predictor λ prior σ2 prior i i Heterosk NP-C fλ ∼ MGLR fl ∼ MGLR x x Homog Point mass Point mass Homosk NP-C fλ ∼ MGLR Point mass x Heterosk Flat Uninformative Uninformative Param N IG NP-disc fλ ∼ DP fl ∼ DP NP-R fλ ∼ DPM fl ∼ DPM See Appendix C.5 for details of Heterosk-Param. 35

Table 5.5: Forecast Evaluation: General Model MSE* LPS*N Oracle 0.70*** -1150*** Heterosk NP-C 13.68%*** -74*** Homog 89.28%*** -503*** Homosk NP-C 20.84%*** -161*** Heterosk Flat 151.60%*** -515*** Param 11.30%*** -139*** NP-disc 13.08%*** -150*** NP-R 11.25%*** -93*** See the description of Table 5.2. Here the tests are conducted with respect to Heterosk-NP-C. among predictors. First, based on the comparison between Heterosk-NP-C and Homog/Homosk- NP-C, it is important to account for individual effects in both coefficients λ s and shock sizes σ2s. i i Second, comparing Heterosk-NP-C with Heterosk-Flat/Heterosk-Param, we see that the flexible nonparametric prior plays a significant role in enhancing density forecasts. Third, the difference between Heterosk-NP-C and Heterosk-NP-disc indicates that the discrete prior performs less satisfactorily when the underlying individual heterogeneity is continuous. Last, Heterosk-NP-R is less favorable than Heterosk-NP-C, which necessitates a careful modeling of the correlated random coefficient structure. 6 Empirical Application: Young Firm Dynamics 6.1 Background and Data To see how the proposed predictor works in real-world analysis, I applied it to provide density forecasts of young firm performance. Studies have documented that young firm performance is affected by R&D, recession, etc. and that different firms may react differently (Akcigit and Kerr, 2016; Robb and Seamans, 2014; Zarutskie and Yang, 2015). In this empirical application, I examine this type of firm-specific latent heterogeneity from a density forecasting perspective. To analyze firm dynamics, traditional cross-sectional data are not sufficient, whereas panel data are more suitable as they track the firms over time. In particular, it is desirable to work with a dataset that contains sufficient information on early firm innovation and performance, and spreads over the recent recession. The restricted-access Kauffman Firm Survey (KFS) is the ideal candidate for such purposes, as it offers the largest panel of startups (4,928 firms founded in 2004, nationally representative sample) and the longest time span (2004-2011, one baseline survey and seven followup annual surveys), together with detailed information on young firms. See Robb et al. (2009) for further description of the survey design.39 39Here I do not impose KFS sample weights on firms as the purpose of the current study is forecasting individual 36

6.2 Model Specification Iconsiderthegeneralmodelwithmultidimensionalindividualheterogeneityinλ andcross-sectional i heteroskedasticityinσ2. Followingthefirmdynamicsliterature,suchasAkcigitandKerr(2016)and i Zarutskie and Yang (2015), firm performance is measured by employment. From an economic point of view, young firms make a significant contribution to employment and job creation (Haltiwanger et al., 2012), and their struggle during the recent recession may partly account for the recent jobless recovery. Specifically, here y is chosen to be the log of employment denoted as logemp . I adopt it it the log of employment instead of the employment growth rate, as the latter significantly reduces the cross-sectional sample size due to the rank requirement for unbalanced panels. Below, I focus on the following model specification,40 logemp = βlogemp +λ +λ R&D +u , u ∼ N (cid:0) 0,σ2(cid:1) , it i,t−1 1i 2i i,t−1 it it i where R&D is given by the ratio of a firm’s R&D employment over its total employment, conit sidering that R&D employment has more complete observations compared with other innovation intensity gauges.41 The panel used for estimation spans from 2004 to 2010 with time dimension T = 6.42 The data for 2011 are reserved for pseudo out-of-sample forecast evaluation. The sample is constructed as follows. First, for any (i,t) combination where R&D employment is greater than the total employment, there is an incompatibility issue, so I set R&D = NA, which only affects 0.68% of it the observations. Then, I only keep firms with long enough observations according to Assumption 3.5, which ensures identification in unbalanced panels. This results in cross-sectional dimension N = 654. The proportion of missing values are (#missing obs)/(NT) = 6.27% . The descriptive statistics for logemp and R&D are summarized in Table 6.1, and the corresponding histograms it it are plotted in Figure 6.1, where both distributions are right skewed and may have more than one peak. Therefore, we anticipate that the proposed predictors with nonparametric priors would perform well in this scenario. 6.3 Results The alternative priors are similar to those in the Monte Carlo simulation except for one additional prior,Heterosk-NP-C/R,whereλ canbecorrelatedwithy whileσ2 isindependentwithrespectto i i0 i firm performance. Further extensions can easily incorporate weights into the estimation procedure. 40See Appendix D.2 for other setups. 41Ihavealsoexploredothermeasuresoffirmperformance(e.g. thelogofrevenue)andinnovationactivities(e.g.a binaryvariableonwhetherthefirmspendsanymoneyonR&D,numbersofintellectualproperties–patents,copyrights, or trademarks–owned or licensed by the firm). The relative rankings of density forecasts are generally robust across measures. 42Notethattheestimationsamplestartsfromperiod0(i.e.2004)andendsatperiodT (i.e.2010)withT +1=7 periods in total. 37

Table 6.1: Descriptive Statistics for Observable 10% mean med 90% std skew kurt log emp 0.41 1.44 1.34 2.63 0.86 0.82 3.58 R&D 0.05 0.22 0.17 0.49 0.18 1.21 4.25 Figure 6.1: Histograms for Observables y , by adopting an MGLR prior on λ and a DPM prior on l = log σ¯2(σ i 2−σ2) ,.43 The conditioning i0 x i i σ¯2−σ2 i set is chosen to be standardized y . The standardization ensures numerical stability in practice, as i0 the conditioning variables enter exponentially into the covariance function for the Gaussian process. ThefirsttwocolumnsinTable6.2characterizetheposteriorestimatesofthecommonparameter β. Inmostofthecases,theposteriormeansarearound0.5 ∼ 0.6,whichsuggeststhattheyoungfirm performance exhibits some degree of persistence, but not remarkably strong, which is reasonable as young firms generally experience more uncertainty. For Homog and NP-disc, their posterior means of β are much larger. This may arise from the fact that homogeneous or discrete λ structure is not i able to capture all individual effects, so these estimators may attribute the remaining individual effects to persistence and thus overestimate β. NP-R also gives a large estimate of β. The reason is similar – if the true data generating process is correlated random effects/coefficients, the random effects/coefficientsmodelwouldmisstheeffectsoftheinitialconditionandmisinterpretthemasthe persistence of the system. In all scenarios, the posterior standard deviations are relatively small, which indicates that the posterior distributions are very tight.44 The last two columns in Table 6.2 compare the forecasting performance. The Heterosk-NP-C/R 43It is possible to craft other priors according to the specific heterogeneity structure of the empirical problem at hand. Forexample,letλ correlatewithy whilesettingλ independentofy . Iwillleavethistofutureexploration. i1 i0 i2 i0 44Comparingwiththeliterature,theclosestoneisZarutskieandYang(2015)usingtraditionalpaneldatamethods, where the estimated persistence of log employment is 0.824 and 0.816 without firm fixed effects (Table 2) which is closetoHomog,and0.228withfirmfixedeffectsestimatedviaOLS(Table4)whichissmallerthanalltheestimates here. 38

Table 6.2: Parameter Estimation and Forecast Evaluation: Young Firm Dynamics β Forecast Mean Std MSE* LPS*N Heterosk NP-C/R 0.52 0.01 0.20*** -228*** Homog 0.89 0.02 8%*** -74*** Homosk NP-C 0.51 0.03 9%*** -52*** Heterosk Flat 0.50 0.00 102%*** -309*** Param 0.56 0.03 7%*** -52*** NP-disc 0.84 0.04 2%*** -20*** NP-R 0.74 0.04 3%*** -16*** NP-C 0.53 0.01 0.1%*** -5*** See the description of Table 5.2 for forecast evaluation. Here Heterosk-NP-C/R is the benchmark for both normalization and significance tests. predictoristhebenchmarkforallcomparisons. Intermsofpointforecasts,mostoftheestimatorsare comparable according to MSE, with only Flat performing significantly poorly. Intuitively, shrinkage in general leads to better forecasting performance, especially for point forecasts, whereas the Flat prior does not introduce any shrinkage to individual effects (cid:0) λ ,σ2(cid:1) . Conditional on the common i i parameter β, the Flat estimator of (cid:0) λ ,σ2(cid:1) is a Bayesian analog to individual-specific MLE/OLS i i that utilizes only the individual-specific observations, which is inadmissible under fixed T (Robbins, 1956; James and Stein, 1961; Efron, 2012). For density forecasts measured by LPS, the overall best is the Heterosk-NP-C/R predictor. The main message is similar to the Monte Carlo simulation of the general model in Subsection 5.4. In summary, it is crucial to account for individual effects in both coefficients λ s and shock sizes i σ2s through a flexible nonparametric prior that acknowledges continuity and correlated random i effects/coefficients when the underlying individual heterogeneity is likely to possess these features. Intuitively, the odds given by the exponential of the difference in LPS indicate that the future realizations are on average 12% more likely in Heterosk-NP-C/R versus Homog, 60% more likely in Heterosk-NP-C/R versus Heterosk-Flat, etc. Note that now both NP-R and NP-C are inferior to NP-C/R where the distribution of λ depends on the initial conditions but the distribution of σ2 i i does not.45 Figure 6.2 provides the histograms of the probability integral transformation (PIT). While LPS characterizes the relative ranks of predictors, PIT supplements LPS and can be viewed as an absolute evaluation on how well the density forecasts coincide with the true (unobserved) conditional forecasting distributions with respect to the current information set. In this sense, under the null hypothesis that the density forecasts coincide with the true data generating process, the probability 45This result cannot be directly compared to the Gibrat’s law literature (Lee et al., 1998; Santarelli et al., 2006), as the dependent variable here is the log of employment instead of employment growth. 39

Figure 6.2: PIT Red lines indicate the confidence interval. integral transforms are i.i.d. U (0,1) and the histogram is close to a flat line.46 In each subgraph, the two red lines indicate the confidence interval. We can see that, in NP-C/R, NP-C, and Flat, the histogram bars are mostly within the confidence band, while other predictors yield apparent inverse-U shapes. The reason might be that the other predictors do not take correlated random coefficients into account but instead attribute the subtlety of correlated random coefficients to the estimated variance, which leads to more diffused predictive distributions. Figure 6.3 shows the predictive distributions of 10 randomly selected firms. In terms of the Homog predictor, all predictive distributions share the same Gaussian shape paralleling with each other. Onthecontrary,intermsoftheNP-C/Rpredictor,itisclearthatthepredictivedistributions are fairly different in their center location, variance, and skewness. Figure 6.4 further aggregates the predictive distributions over sectors based on the two-digit NAICS codes (Table 6.3). It plots the predictive distributions of the log of the average employment within each sector. Comparing Homog and NP-C/R across sectors, we can see the following several patterns. First, NP-C/R predictive distributions tend to be narrower. The reason is that NP-C/R tailors to each individual firm while Homog prescribes a general model to all the firms, so NP-C/R yields more precise predictive distributions. Second, NP-C/R predictive distributions have longer right tails, whereas Homog ones are distributed in the standard bell shape. The long right tails in NP-C/R concur with the general intuition that good ideas are scarce. Finally, there is substantial heterogeneity in density forecasts across sectors. For sectors with relatively large average employment, e.g. construction (sector 23), Homog pushes the forecasts down and hence systematically underpredicts their future employment, while NP-C/R respects this source of heterogeneity and 46See Diebold et al. (1998) for details of PIT and Amisano and Geweke (2017) for formal PIT tests. 40

Figure 6.3: Predictive Distributions: 10 Randomly Selected Firms Table 6.3: Two-digit NAICS Codes Code Sector Code Sector 11 Agriculture, Forestry, Fishing 52 Finance and Insurance and Hunting 21 Mining, Quarrying, and Oil 53 Real Estate and Rental and Leasing and Gas Extraction 22 Utilities 54 Professional, Scientific, and Technical Services 23 Construction 56 Administrative and Support and Waste Management and Remediation Services 31-33 Manufacturing 61 Educational Services 42 Wholesale Trade 62 Health Care and Social Assistance 44-45 Retail Trade 71 Arts, Entertainment, and Recreation 48-49 Transportation and 72 Accommodation and Food Services Warehousing 51 Information 81 Other Services (except Public Administration) significantly lessens the underprediction problem. On the other hand, for sectors with relatively small average employment, e.g. retail trade (sector 44), Homog introduces an upward bias into the forecasts, while NP-C/R reduces such bias by flexibly estimating the underlying distribution of firm-specific heterogeneity. The latent heterogeneity structure is presented in Figure 6.5, which plots the joint distributions of the estimated individual effects and the conditional variable. In all the three subgraphs, the pairwise relationships among λ , λ , and standardized y are nonlinear and exhibit multiple i,level i,RD i0 components, which reassures the utilization of nonparametric prior with correlated random coefficients. Furthermore, λ , λ , and standardized y are positively correlated with each other, i,level i,RD i0 which roughly indicates that larger firms respond more positively to R&D activities within the KFS 41

Figure 6.4: Predictive Distributions: Aggregated by Sectors Subgraph titles are two-digit NAICS codes. Only sectors with more than 10 firms are shown. Figure 6.5: Joint Distributions of λˆ and Condition Variable i young firm sample.47 47The model here mainly serves the forecasting purpose, so we need to be careful with any causal interpretation. 42

7 Concluding Remarks This paper proposes a semiparametric Bayesian predictor, which performs well in density forecasts of individuals in a panel data setup. It considers the underlying distribution of individual effects and pools the information from the whole cross-section in a flexible and efficient way. Monte Carlo simulations and an empirical application to young firm dynamics show that the keys for the better density forecasts are, in order of importance, nonparametric Bayesian prior, cross-sectional heteroskedasticity, and correlated random coefficients. Moving forward, I plan to extend my research in the following directions. Theoretically, I will continue the Bayesian asymptotic discussion with rates of convergence, which will provide more insight into how N, T, d , and shock size affect the performance of the proposed semiparametric w Bayesianpredictor. Methodologically,Iwillexploresomevariationsofthecurrentsetup. First,some empirical studies may include a large number of covariates with potential heterogeneous effects (i.e. more variables included in w ), so it is both theoretically and empirically desirable to i,t−1 investigate a variable selection scheme in a high-dimensional nonparametric Bayesian framework. Chung and Dunson (2012) and Liverani et al. (2015) employ variable selection via binary switches, which may be adaptable to the panel data setting. Another possible direction is to construct a Bayesian-Lasso-type estimator coherent with the current nonparametric Bayesian implementation. Second, IwillconsiderpanelVAR(CanovaandCiccarelli,2013), ausefultooltoincorporateseveral variables for each of the individuals and to jointly model the evolution of these variables, allowing us to take more information into account for forecasting purposes and offer richer insights into the latent heterogeneity structure. Meanwhile, it is also interesting to incorporate extra crossvariable restrictions derived from economic theories and implement the Bayesian GMM method as proposed in Shin (2014). Third, I will experiment with nonlinear panel data models, such as the Tobit model, which helps accommodate firms’ endogenous exit choice. Such extensions would be numerically feasible, but require further theoretical work. A natural next step would be extending the theoretical discussion to the family of“generalized linear models”. 43

References Akcigit, U. and Kerr, W. R. (2016). Growth through heterogeneous innovations. Journal of Political Economy, forthcoming. Amewou-Atisso, M., Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (2003). Posterior consistency for semi-parametric regression problems. Bernoulli, 9 (2), 291–312. Amisano, G.andGeweke, J.(2017).Predictionusingseveralmacroeconomicmodels.The Review of Economics and Statistics, 99 (5), 912–925. — and Giacomini, R. (2007). Comparing density forecasts via weighted likelihood ratio tests. Journal of Business & Economic Statistics, 25 (2), 177–190. Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, pp. 1152–1174. Arellano, M. (2003). Panel Data Econometrics. Oxford University Press. —andBonhomme, S.(2012).Identifyingdistributionalcharacteristicsinrandomcoefficientspanel data models. The Review of Economic Studies, 79 (3), 987–1020. —andBover, O.(1995).Anotherlookattheinstrumentalvariableestimationoferror-components models. Journal of Econometrics, 68 (1), 29 – 51. — and Honore´, B. (2001). Panel data models: some recent developments. Handbook of econometrics, 5, 3229–3296. Atchade´, Y.F.andRosenthal, J.S.(2005).OnadaptiveMarkovchainMonteCarloalgorithms. Bernoulli, 11 (5), 815–828. Baltagi, B. (1995). Econometric Analysis of Panel Data. John Wiley & Sons, New York. Barron, A., Schervish, M. J. and Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. Ann. Statist., 27 (2), 536–561. Basu, S. and Chib, S. (2003). Marginal likelihood and Bayes factors for Dirichlet process mixture models. Journal of the American Statistical Association, 98 (461), 224–235. Blackwell, D. and Dubins, L. (1962). Merging of opinions with increasing information. The Annals of Mathematical Statistics, 33 (3), 882–886. Burda, M. and Harding, M. (2013). Panel probit with flexible correlated effects: quantifying technology spillovers in the presence of latent heterogeneity. Journal of Applied Econometrics, 28 (6), 956–981. 44

Canale, A. and De Blasi, P. (2017). Posterior asymptotics of nonparametric location-scale mixtures for multivariate density estimation. Bernoulli, 23 (1), 379–404. Canova, F. and Ciccarelli, M. (2013). Panel Vector Autoregressive Models: A Survey. Working Paper Series, European Central Bank 1507, European Central Bank. Chamberlain, G.andHirano, K.(1999).Predictivedistributionsbasedonlongitudinalearnings data. Annales d’Economie et de Statistique, pp. 211–242. Chib, S. (2008). Panel data modeling and inference: a Bayesian primer. In The econometrics of panel data, Springer, pp. 479–515. — and Carlin, B. P. (1999). On MCMC sampling in hierarchical longitudinal models. Statistics and Computing, 9 (1), 17–26. Chung, Y. and Dunson, D. B. (2012). Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association. Compiani, G. and Kitamura, Y. (2016). Using mixtures in econometric models: a brief review and some new results. The Econometrics Journal, 19 (3), C95–C127. Delaigle, A., Hall, P.andMeister, A.(2008).Ondeconvolutionwithrepeatedmeasurements. The Annals of Statistics, pp. 665–685. Diaconis, P. and Freedman, D. (1986). On inconsistent Bayes estimates of location. The Annals of Statistics, pp. 68–87. Diebold, F. X., Gunther, T. A. and Tay, A. S. (1998). Evaluating density forecasts with applications to financial risk management. International Economic Review, 39 (4), 863–883. — and Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13 (3). Doob, J. L. (1949). Application of the theory of martingales. Le calcul des probabilites et ses applications, pp. 23–27. Dunson, D. B.(2009).NonparametricBayeslocalpartitionmodelsforrandomeffects.Biometrika, 96 (2), 249–262. — and Park, J.-H. (2008). Kernel stick-breaking processes. Biometrika, 95 (2), 307–323. Efron, B. (2012). Large-scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, vol. 1. Cambridge University Press. 45

Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90 (430), 577–588. Evdokimov, K. (2010). Identification and estimation of a nonparametric panel data model with unobserved heterogeneity. — and White, H. (2012). Some extensions of a lemma of Kotlarski. Econometric Theory, 28 (4), 925–932. Freedman, D. A. (1963). On the asymptotic behavior of Bayes’ estimates in the discrete case. The Annals of Mathematical Statistics, pp. 1386–1403. — (1965). On the asymptotic behavior of Bayes estimates in the discrete case II. The Annals of Mathematical Statistics, 36 (2), 454–456. Galambos, J. and Simonelli, I. (2004). Products of Random Variables: Applications to Problems of Physics and to Arithmetical Functions. Marcel Dekker. Geweke, J.andAmisano, G.(2010).ComparingandevaluatingBayesianpredictivedistributions of asset returns. International Journal of Forecasting, 26 (2), 216–230. Ghosal, S., Ghosh, J. K., Ramamoorthi, R. et al. (1999). Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics, 27 (1), 143–158. — and van der Vaart, A. (2007). Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann. Statist., 35 (2), 697–723. — and — (2017). Fundamentals of Nonparametric Bayesian Inference, vol. 44. Cambridge University Press. Ghosh, J. K. and Ramamoorthi, R. (2003). Bayesian Nonparametrics. Springer-Verlag. Griffin, J. E. (2016). An adaptive truncation method for inference in Bayesian nonparametric models. Statistics and Computing, 26 (1), 423–441. Gu, J. and Koenker, R. (2017a). Empirical bayesball remixed: Empirical bayes methods for longitudinal data. Journal of Applied Econometrics, 32 (3), 575–599. — and — (2017b). Unobserved heterogeneity in income dynamics: An empirical bayes perspective. Journal of Business & Economic Statistics, 35 (1), 1–16. Hall, B.H.andRosenberg, N.(2010).HandbookoftheEconomicsofInnovation,vol.1.Elsevier. Haltiwanger, J., Jarmin, R. S. and Miranda, J. (2012). Who creates jobs? Small versus large versus young. Review of Economics and Statistics, 95 (2), 347–361. 46

Hastie, D. I.,Liverani, S.andRichardson, S.(2015).SamplingfromDirichletprocessmixture models with unknown concentration parameter: mixing issues in large data implementations. Statistics and Computing, 25 (5), 1023–1037. Hirano, K.(2002).SemiparametricBayesianinferenceinautoregressivepaneldatamodels.Econometrica, 70 (2), 781–799. Hjort, N. L., Holmes, C., Mu¨ller, P. and Walker, S. G. (2010). Bayesian Nonparametrics. Cambridge University Press. Hsiao, C. (2014). Analysis of panel data. Cambridge university press. Hu, Y. (2017). The econometrics of unobservables: Applications of measurement error models in empiricalindustrialorganizationandlaboreconomics.JournalofEconometrics,200(2),154–168. Ishwaran, H.andJames, L. F.(2001).Gibbssamplingmethodsforstick-breakingpriors.Journal of the American Statistical Association, 96 (453), 161–173. — and — (2002). Approximate Dirichlet process computing in finite normal mixtures: smoothing and prior information. Journal of Computational and Graphical Statistics, 11 (3), 508–532. James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Berkeley, Calif.: University of California Press, pp. 361–379. Jensen, M. J., Fisher, M. and Tkac, P. (2015). Mutual fund performance when learning the distribution of stock-picking skill. Kalli, M., Griffin, J. E. and Walker, S. G. (2011). Slice sampling mixture models. Statistics and Computing, 21 (1), 93–105. Lancaster, T. (2002). Orthogonal parameters and panel data. The Review of Economic Studies, 69 (3), 647–666. Lee, Y., Amaral, L. A. N., Canning, D., Meyer, M. and Stanley, H. E. (1998). Universal featuresinthegrowthdynamicsofcomplexorganizations.Physical Review Letters, 81(15), 3275. Li, T. and Vuong, Q. (1998). Nonparametric estimation of the measurement error model using multiple indicators. Journal of Multivariate Analysis, 65 (2), 139 – 165. Liu, L., Moon, H. R.andSchorfheide, F.(2017).Forecastingwithdynamicpaneldatamodels. Liverani, S.,Hastie, D.I.,Azizi, L.,Papathomas, M.andRichardson, S.(2015).PReMiuM: anRpackageforprofileregressionmixturemodelsusingDirichletprocesses.Journal of Statistical Software, 64 (7). 47

Llera, A. and Beckmann, C. (2016). Estimating an Inverse Gamma distribution. arXiv preprint arXiv:1605.01019. Marcellino, M., Stock, J. H. and Watson, M. W. (2006). A comparison of direct and iterated multistep AR methods for forecasting macroeconomic time series. Journal of Econometrics, 135 (1), 499–526. Neal, R. M.(2000).MarkovchainsamplingmethodsforDirichletprocessmixturemodels.Journal of Computational and Graphical Statistics, 9 (2), 249–265. Norets, A. (2010). Approximation of conditional densities by smooth mixtures of regressions. The Annals of Statistics, 38 (3), 1733–1766. —andPati, D.(2017).AdaptiveBayesianestimationofconditionaldensities.EconometricTheory, 33 (4), 980–1012. — and Pelenis, J. (2012). Bayesian modeling of joint and conditional distributions. Journal of Econometrics, 168 (2), 332–346. — and — (2014). Posterior consistency in conditional density estimation by covariate dependent mixtures. Econometric Theory, 30, 606–646. Papaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika, 95 (1), 169–186. Pati, D., Dunson, D. B. and Tokdar, S. T. (2013). Posterior consistency in conditional distribution estimation. Journal of Multivariate Analysis, 116, 456–472. Pav, S. E. (2015). Moments of the log non-central chi-square distribution. arXiv preprint arXiv:1503.06266. Pelenis, J. (2014). Bayesian regression with heteroscedastic error density and parametric mean function. Journal of Econometrics, 178, 624–638. Robb, A., Ballou, J., DesRoches, D., Potter, F., Zhao, Z. and Reedy, E. (2009). An overview of the Kauffman Firm Survey: results from the 2004-2007 data. Available at SSRN 1392292. —andSeamans, R.(2014).TheroleofR&Dinentrepreneurialfinanceandperformance.Available at SSRN 2341631. Robbins, H.(1956).AnempiricalBayesapproachtostatistics.InProceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley and Los Angeles. 48

Rossi, P. E. (2014). Bayesian Non- and Semi-parametric Methods and Applications. Princeton University Press. Santarelli, E.,Klomp, L.andThurik, A. R.(2006).Gibrat’slaw: anoverviewoftheempirical literature. In Entrepreneurship, Growth, and Innovation, Springer, pp. 41–73. Sethuraman, J.(1994).AconstructivedefinitionofDirichletpriors.Statistica Sinica,pp.639–650. Shin, M. (2014). Bayesian GMM. Sims, C. A. (2000). Using a likelihood perspective to sharpen econometric discourse: Three examples. Journal of econometrics, 95 (2), 443–462. Tokdar, S. T. (2006). Posterior consistency of Dirichlet location-scale mixture of normals in density estimation and regression. Sankhy¯a: The Indian Journal of Statistics, pp. 90–110. Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics - Simulation and Computation, 36 (1), 45–54. Yau, C., Papaspiliopoulos, O., Roberts, G. O. and Holmes, C. (2011). Bayesian nonparametric hidden Markov models with applications in genomics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 (1), 37–57. Zarutskie, R. and Yang, T. (2015). How did young firms fare during the great recession? EvidencefromtheKauffmanFirmSurvey.InMeasuringEntrepreneurialBusinesses: CurrentKnowledge and Challenges, University of Chicago Press. 49

A Notations U (a,b) represents a uniform distribution with minimum value a and maximum value b. If a = 0 and b = 1, we obtain the standard uniform distribution, U (0,1). N (cid:0) µ,σ2(cid:1) stands for a Gaussian/normal distribution with mean µ and variance σ2. Its probability distribution function (pdf) is given by φ (cid:0) x;µ,σ2(cid:1) . When µ = 0 and σ2 = 1 (i.e. standard normal), we reduce the notation to φ(x). The corresponding cumulative distribution functions (cdf) are denoted as Φ (cid:0) x;µ,σ2(cid:1) and Φ(x), respectively. The same convention holds for multivariate normal, where N (µ,Σ), φ(x;µ,Σ), and Φ(x;µ,Σ) are for the distribution with the mean vector µ and the covariance matrix Σ. The gamma distribution is denoted as Ga(a,b) with pdf being f (x;a,b) = ba xa−1e−bx. Ga Γ(a) The according inverse gamma distribution is given by IG(a,b) with pdf being f (x;a,b) = IG ba x−a−1e−b/x. The Γ(·) in the denominators is the gamma function. Γ(a) The inverse Wishart distribution is a generalization of the inverse gamma distribution to multi-dimensional setups. Let Ω be a d×d positive definite matrix following an inverse Wishart distribution IW(Ψ,ν), then its pdf is f IW (Ω;Ψ,ν) = 2 ν 2 | d Ψ Γ | d ν 2 (ν 2 ) |Ω|−ν+ 2 d+1 e−1 2 tr(ΨΩ−1). When Ω is a scalar, the inverse Wishart distribution is reduced to an inverse gamma distribution with a = ν/2, b = Ψ/2. Let G be a distribution drawn from the Dirichlet Process (DP). Denote G ∼ DP (α,G ), 0 if for any partition (A ,··· ,A ), (G(A ),··· ,G(A )) ∼ Dir(αG (A ),··· ,αG (A )). Dir(·) 1 K 1 K 0 1 0 K stands for the Dirichlet distribution, which is a multivariate generalization of the Beta distribution. An alternative view of DP is given by the stick-breaking process, G = (cid:80)∞ p 1(θ = θ ), where k=1 k k θ ∼ G and p ∼ SB(1,α). 1(·) is an indicator function that equals 1 if the condition in the k 0 k parenthesis is satisfied and equals 0 otherwise. For a generic variable c which can be multi-dimensional, we define a Gaussian process ζ(c) ∼ GP (m(c),V (c,c˜)) as follows: for any finite set of {c ,c ,··· ,c }, [ζ(c ),ζ(c ),··· ,ζ(c )](cid:48) has a 1 2 n 1 2 n joint Gaussian distribution with the mean vector being [m(c ),m(c ),··· ,m(c )](cid:48) and the i,j-th 1 2 n entry of the covariance matrix being V (c ,c ), i,j = 1,··· ,N. i j I is an N ×N identity matrix. N In the panel data setup, for a generic variable z, which can be v, w ,x, or y, z is a d ×1 it z vector, and z = (z ,··· ,z ) is a d ×(t −t +1) matrix. i,t1:t2 it1 it2 z 2 1 (cid:107)·(cid:107) representstheL -norm,e.g.theEuclideannormforan-dimensionalvectorz = [z ,z ,··· ,z ](cid:48) p p 1 2 n (cid:113) is given by (cid:107)z(cid:107) = z2+···+z2, and the L -norm for an integrable function is given by (cid:107)f(cid:107) = ´ 2 1 n 1 1 |f(x)|dx. supp(·) denotes the support of a probability measure. vec(·) denotes matrix vectorization, and ⊗ is the Kronecker product. A-1

B Explanations and Proofs B.1 Intuition: MGLR Prior x Here we give some intuition why the MGLR Prior is general enough to accommodate a broad class x of conditional distributions. Define a generic variable z which can represent either λ or l. By Bayes’ theorem, f(z,c ) 0 f(z|c ) = . 0 f(c ) 0 The joint distribution in the numerator can be approximated by a mixture of normals ∞ f(z,c ) ≈ (cid:88) p˜ φ (cid:16) (cid:2) z(cid:48),c(cid:48)(cid:3)(cid:48) ; µ˜ ,Ω˜ (cid:17) , 0 k 0 k k k=1 whereµ˜ isa(d +d )-elementcolumnvector,andΩ˜ isa(d +d )×(d +d )covariancematrix. k z c0 k z c0 z c0 µ˜ = (cid:2) µ˜(cid:48) ,µ˜(cid:48) (cid:3)(cid:48) , k k,z k,c0 (cid:34) (cid:35) Ω˜ Ω˜ Ω˜ = k,zz k,zc0 . k Ω˜ Ω˜ k,c0z k,c0c0 Applying Bayes’ theorem again to the normal kernel for each component k, (cid:16) (cid:17) (cid:16) (cid:17) (cid:16) (cid:17) φ (cid:2) z(cid:48),c(cid:48)(cid:3)(cid:48) ; µ˜ ,Ω˜ = φ c ; µ˜ ,Ω˜ φ z; µ (cid:2) 1,c(cid:48)(cid:3)(cid:48) ,Ω , 0 k k 0 k,c0 k,c0c0 k 0 k (cid:104) (cid:105) where µ = µ˜ −Ω˜ Ω˜−1 µ˜ , Ω = Ω˜ −Ω˜ Ω˜−1 Ω˜(cid:48) . Combining all the steps k k,z k,zc0 k,c0c0 k,c0 k k,zz k,zc0 k,c0c0 k,zc0 above, the conditional distribution can be approximated as (cid:16) (cid:17) ∞ p˜ φ c ; µ˜ ,Ω˜ φ (cid:0) z; µ [1,c(cid:48)](cid:48),Ω (cid:1) (cid:88) k 0 k,c0 k,c0c0 k 0 k f(z|c ) ≈ 0 f(y ) 0 k=1 ∞ = (cid:88) p (c )φ (cid:16) z; µ (cid:2) 1,c(cid:48)(cid:3)(cid:48) ,Ω (cid:17) , k 0 k 0 k k=1 The last line is given by collecting marginals of c into p (c ) = p˜ k φ(c0;µ˜ k,c0 ,Ω˜ k,c0c0 ) . 0 k 0 f(c0) In summary, the current setup is similar to approximating the conditional density via Bayes’ theorem, but does not explicitly model the distribution of the conditioning variable c , and thus 0 allows for more relaxed assumptions on it. A-2

B.2 Identification Proof. (Proposition 3.3) Part2forcross-sectionalheteroskedasticityσ2isnew. Part3foradditiveindividual-heterogeneity i λ follows Liu et al. (2017), which is based on the early works such as Arellano and Bover (1995) i and Arellano and Bonhomme (2012). 1. Identify common parameters β First, let us perform orthogonal forward differencing, i.e. for t = 1,··· ,T −d , w (cid:32) T (cid:33)−1 T (cid:88) (cid:88) y˜ = y −w(cid:48) w w(cid:48) w y , (B.1) it it i,t−1 i,s−1 i,s−1 i,s−1 is s=t+1 s=t+1 (cid:32) T (cid:33)−1 T (cid:88) (cid:88) x˜ = x −w(cid:48) w w(cid:48) w x . (B.2) i,t−1 i,t−1 i,t−1 i,s−1 i,s−1 i,s−1 i,s−1 s=t+1 s=t+1 Then, β is identified given Assumption 3.1 (2-d) and the following moment conditions E (cid:88) x˜ (cid:0) y˜ −x˜(cid:48) β (cid:1) = 0. i,t−1 it i,t−1 t 2. Identify the distribution of shock sizes fσ2 After orthogonal forward differencing, define u˜ = y˜ −β(cid:48)x˜ , it it i,t−1 T (cid:88) −dw σˆ2 = u˜2 = σ2k2. i it i i t=1 where k2 ∼ χ2(T −d ) follows an i.i.d. chi-squared distribution with (T −d ) degrees of freedom. i w w Note that Fourier transform (i.e. characteristic functions) is not suitable for disentangling products of random variables, so I resort to the Mellin transform (Galambos and Simonelli, 2004). For a generic variable x, the Mellin transform of f(x) is specified as48 ˆ M (ξ) = xiξf(x)dx, x which exists for all ξ ∈ R. Considering that σ2|c and k2 are independent, we have i i0 i M (ξ|c) = M (ξ|c)M (ξ). σˆ2 σ2 k2 Note that the non-vanishing characteristic function of σ2 implies non-vanishing Mellin transform 48See the discussion on page 16 of Galambos and Simonelli (2004) for the generality of this specification. A-3

M (ξ|c) (almost everywhere), so it is legitimate to take the logarithm of both sides, σ2 logM (ξ|c) = logM (ξ|c)+logM (ξ). σˆ2 σ2 k2 Taking the second derivative with respect to ξ, we get ∂2 ∂2 ∂2 logM (ξ|c) = logM (ξ|c)− logM (ξ). ∂ξ∂ξ(cid:48) σ2 ∂ξ∂ξ(cid:48) σˆ2 ∂ξ∂ξ(cid:48) k2 The Mellin transform of chi-squared distribution M (ξ) is a known functional form. In addition, k2 we have logM (0|c) = logM (0|c)−logM (0) = 0, σ2 σˆ2 k2 ∂ ∂ ∂ logM (0|c) = logM (0|c)− logM (0) ∂ξ σ2 ∂ξ σˆ2 ∂ξ k2 = i (cid:0)E(cid:0) logσˆ2(cid:12) (cid:12)c (cid:1) −E(cid:0) logk2(cid:12) (cid:12)c (cid:1)(cid:1) . Based on Pav (2015), (cid:18) (cid:19) E(cid:0) logk2(cid:12) (cid:12)c (cid:1) = log2+ψ T −d w , 2 where ψ(·) is the derivative of the log of the Gamma function. Given logM (0|c), ∂ logM (0|c), and ∂2 logM (ξ|c), we can fully recover logM (ξ|c) σ2 ∂ξ σ2 ∂ξ∂ξ(cid:48) σ2 σ2 and hence uniquely determine fσ2. See Theorem 1.19 in Galambos and Simonelli (2004) for the uniqueness. 3. Identify the distribution of individual effects fλ Define ˚y = y −β(cid:48)x = λ(cid:48)w +u . i,1:T i,1:T i,0:T−1 i i,0:T−1 i,1:T Let Y˚ =˚y , W = w(cid:48) , Λ = λ and U = u . The above expression can be simplified as i,1:T i,0:T−1 i i,1:T Y˚ = WΛ+U. Denote F , F and F as the conditional characteristic functions for Y˚ , Λ and U, respectively. Y˚ Λ U Based on Assumption 3.1 (2-a), F and F are non-vanishing almost everywhere. Then, we obtain Λ U logF (cid:0) W(cid:48)ξ|c (cid:1) = logF (ξ|c)−logF (ξ|c), Λ Y˚ U where F is constructed from the observables and the common parameters identified in part 1, and Y˚ F is based on the fσ2 identified in part 2. Let ζ = W(cid:48)ξ and A = (W(cid:48)W)−1W(cid:48), then the second U W A-4

derivative of logF (ζ|c) is characterized by Λ ∂2 logF (ζ|c) = A (cid:18) ∂2 (cid:0) logF (ξ|c)−logF (ξ|c) (cid:1) (cid:19) A(cid:48) . ∂ζ∂ζ(cid:48) Λ W ∂ξ∂ξ(cid:48) Y˚ U W Moreover, logF (0|c) = 0, Λ ∂ (cid:16) (cid:12) (cid:17) logF (0|c) = iA E Y˚(cid:12)c , ∂ζ Λ W (cid:12) so we can pin down logΛ(ζ|c) and fλ. The proof for unbalanced panels follows in a similar manner. B.3 Posterior Consistency: Random Coefficients Model Proof. (Proposition 3.10) The proof follows ?, which is based on the early work by Barron et al. (1999) and Ghosal and van der Vaart (2007). Now we significantly extend the discussion to take care of the deconvolution and dynamic panel data structure. 1. Random coefficients: cross-sectional homoskedasticity The posterior probability of the alternative region can be decomposed as Π((ϑ,f) : (cid:107)ϑ−ϑ (cid:107) ≥ δ or (cid:107)f −f (cid:107) ≥ (cid:15)|D) 0 2 0 1 (cid:16) (cid:92) (cid:12) (cid:17) ≤Πϑ((cid:107)ϑ−ϑ (cid:107) ≥ δ|D)+Πf {(cid:107)f −f (cid:107) ≥ (cid:15)} Fc (cid:12)D 0 2 0 1 N(cid:12) (cid:16) (cid:92) (cid:12) (cid:17) +Π {(cid:107)f −f (cid:107) ≥ (cid:15)} F ,(cid:107)ϑ−ϑ (cid:107) < δ(cid:12)D . 0 1 N 0 2 (cid:12) It suffices to show that (a) for all δ > 0, Πϑ((cid:107)ϑ−ϑ (cid:107) ≥ δ|D) → 0, (b) for all (cid:15) > 0, 0 2 Πf ({(cid:107)f −f (cid:107) ≥ (cid:15)} (cid:84) Fc |D) → 0,and(c)forall(cid:15) > 0,Π({(cid:107)f −f (cid:107) ≥ (cid:15)} (cid:84) F ,(cid:107)ϑ−ϑ (cid:107) < δ((cid:15))|D) → 0 1 N 0 1 N 0 2 0. We let δ depend on (cid:15) in part (c) because part (a) holds for all δ > 0. (a) For all δ > 0, Πϑ((cid:107)ϑ−ϑ (cid:107) ≥ δ|D) → 0. 0 2 After orthogonal forward differencing in equations (B.1) and (B.2), the posterior of (cid:0) β,σ2(cid:1) is A-5

given by (cid:16) (cid:17) (cid:16) (cid:17) p (cid:0) β,σ2|D (cid:1) ∝ φ β;mβ,ψβσ2 f σ2;aσ2 ,bσ2 dΠϑ(cid:0) β,σ2(cid:1) , IG  −1 (cid:88) ψβ =  x˜ i,t−1 x˜(cid:48) i,t−1 , i,t   (cid:88) mβ = ψβ  x˜ i,t−1 y˜ it, i,t N (T −d ) aσ2 = w 2   bσ2 = 1 2  (cid:88) y˜ i 2 t −mβ(cid:48) (cid:16) ψβ (cid:17)−1 mβ . i,t Then, the traditional posterior consistency argument implies that (cid:0) β,σ2(cid:1)(cid:12) (cid:12)D converges to (cid:0) β 0 ,σ 0 2(cid:1) , given Assumption 3.1 (2-c) (E(cid:2)(cid:80) x˜ x˜(cid:48) (cid:3) has full rank) and Proposition 3.10 condition 3 (ϑ t i,t−1 i,t−1 0 is in the interior of supp (cid:0) Πϑ(cid:1) ). (b) For all (cid:15) > 0, Πf ({(cid:107)f −f (cid:107) ≥ (cid:15)} (cid:84) Fc |D) → 0. 0 1 N Based on Lemma 1 in Canale and De Blasi (2017), Assumption 3.9 (1, 2-a) ensures that the KL property holds for the distribution of λ, i.e. for all (cid:15) > 0, ˆ (cid:18) (cid:19) f (λ) Πf f ∈ F : f (λ)log 0 dλ < (cid:15) > 0. (B.3) 0 f(λ) Now,weneedtoestablishanalteredKLpropertyspecifiedontheobservables. First,theindividualspecific likelihood function is characterized as g (cid:0) y ,x∗ ,wI |D (cid:1) i,0:T i,0:T−1 i,0:T−1 A ˆ = (cid:89) p (cid:0) xP∗ |y ,c (cid:1) p(c∗ |D ) (cid:89) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f(λ )dλ , (B.4) i,t−1 i,t−1 i,0:t−2 i0 A it i,t−1 i i,t−1 i i t t (cid:16) (cid:17) and g y ,x∗ ,wI |D corresponds to the true data generating process (cid:0) β ,σ2,f (cid:1) . 0 i,0:T i,0:T−1 i,0:T−1 A 0 0 0 Then, we would like to prove that for all (cid:15) > 0,  f ∈ F, (cid:0) β,σ2(cid:1) ∈ Rdx ×R+ :   ˆ (cid:16) (cid:17)  Π   g 0 (cid:0) y i,0:T ,x∗ i,0:T−1 ,w i I ,0:T−1 |D A (cid:1) log g 0 (cid:16) y i,0:T ,x∗ i,0:T−1 ,w i I ,0:T−1 |D A (cid:17) dy i,0:T dx∗ i,0:T−1 dw i I ,0:T−1 < (cid:15)    > 0, g y ,x∗ ,wI |D i,0:T i,0:T−1 i,0:T−1 A (B.5) A-6

The KL divergence of g with respect to g can be further decomposed as 0 ˆ (cid:16) (cid:17) g y ,x∗ ,wI |D g (cid:0) y ,x∗ ,wI |D (cid:1) log 0 i,0:T i,0:T−1 i,0:T−1 A dy dx∗ dwI 0 i,0:T i,0:T−1 i,0:T−1 A (cid:16) (cid:17) i,0:T i,0:T−1 i,0:T−1 g y ,x∗ ,wI |D i,0:T i,0:T−1 i,0:T−1 A ˆ ´ (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f (λ )dλ = g (cid:0) y ,x∗ ,wI |D (cid:1) log ´ t it 0 i,t−1 i i,t−1 0 0 i i dy dx∗ dwI 0 i,0:T i,0:T−1 i,0:T−1 A (cid:81) φ(y ;β(cid:48)x +λ(cid:48)w ,σ2)f(λ )dλ i,0:T i,0:T−1 i,0:T−1 ˆ ´ t it i,t−1 i i,t−1 i i (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f (λ )dλ = g (cid:0) y ,x∗ ,wI |D (cid:1) log ´ t it 0 i,t−1 i i,t−1 0 0 i i dy dx∗ dwI 0 i,0:T i,0:T−1 i,0:T−1 A (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2 (cid:1) f(λ )dλ i,0:T i,0:T−1 i,0:T−1 ˆ ´ t it 0 i,t−1 i i,t−1 0 i i (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f(λ )dλ + g (cid:0) y ,x∗ ,wI |D (cid:1) log ´ t it 0 i,t−1 i i,t−1 0 i i dy dx∗ dwI , 0 i,0:T i,0:T−1 i,0:T−1 A (cid:81) φ(y ;β(cid:48)x +λ(cid:48)w ,σ2)f(λ )dλ i,0:T i,0:T−1 i,0:T−1 t it i,t−1 i i,t−1 i i where the first equality is given by crossing out common factors in the numerator and denominator. For the first term, define h(x) = xlogx, a(λ ) = (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f (λ ), A = ´ i t it 0 ´i,t−1 i i,t−1 0 0 i a(λ )dλ , b(λ ) = (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f(λ ), B = b(λ )dλ . We can rewrite the i i i t it 0 i,t−1 i i,t−1 0 i i i integral over λ as i ˆ ´ (cid:89) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f (λ )dλ log ´ (cid:81) t φ (cid:0) y it ;β 0 (cid:48)x i,t−1 +λ(cid:48) i w i,t−1 ,σ 0 2(cid:1) f 0 (λ i )dλ i it 0 i,t−1 i i,t−1 0 0 i i (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2 (cid:1) f(λ )dλ t ˆ t it 0 i,t−1 i i,t−1 0 i i (cid:18) (cid:19) (cid:18) (cid:19) A A b(λ ) f (λ ) i 0 i =A·log = B·g = B·g · dλ i B B B f(λ ) ˆ i (cid:18) (cid:19) f (λ ) 0 i ≤ b(λ )g dλ i i f(λ ) ˆ i = (cid:89) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f (λ )log f 0 (λ i ) dλ , (B.6) it 0 i,t−1 i i,t−1 0 0 i f(λ ) i i t where the inequality is given by Jensen’s inequality. Then, further integrating the above expression (cid:16) (cid:17) over y ,x∗ ,wI , we have i,0:T i,0:T−1 i,0:T−1 ˆ ´ (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f (λ )dλ g (cid:0) y ,x∗ ,wI |D (cid:1) log ´ t it 0 i,t−1 i i,t−1 0 0 i i dy dx∗ dwI 0 i,0:T i,0:T−1 i,0:T−1 A (cid:81) φ(y ;β(cid:48)x +λ(cid:48)w ,σ2)f (λ )dλ i,0:T i,0:T−1 i,0:T−1 ˆ t it i,t−1 i i,t−1 0 i i ≤ (cid:89) p (cid:0) xP∗ |y ,c (cid:1) p(c∗ |D ) (cid:89) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) log f 0 (λ i ) dy dx∗ dw∗ i,t−1 i,t−1 i,0:t−2 i0 A it 0 i,t−1 i i,t−1 0 f(λ ) i,0:T i,0:T−1 i,0:T−1 i ˆ t t f (λ ) 0 i · f (λ )log dλ , 0 i i f(λ ) i where the inequality follows the above expression (B.6). According to the KL property of the dis- ´ (cid:110) (cid:111) tribution of λ in equation (B.3), for all (cid:15)(cid:48) > 0, there exists Gf = f ∈ F : f (λ)log f0(λ) dλ < (cid:15)(cid:48) (cid:15)(cid:48) 0 f(λ) (cid:16) (cid:17) such that f is in the interior of Gf, Πf Gf > 0, and the first term is less than (cid:15)(cid:48) for all f ∈ Gf. 0 (cid:15)(cid:48) (cid:15)(cid:48) (cid:15)(cid:48) A-7

For the second term, we first employ the convexity of KL divergence, ˆ ´ (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f(λ )dλ g (cid:0) y ,x∗ ,wI |D (cid:1) log ´ t it 0 i,t−1 i i,t−1 0 i i dy dx∗ dwI 0 i,0:T i,0:T−1 i,0:T−1 A (cid:81) φ(y ;β(cid:48)x +λ(cid:48)w ,σ2)f(λ )dλ i,0:T i,0:T−1 i,0:T−1 ˆ ´ t it i,t−1 i i,t−1 i i ≤ (cid:89) p (cid:0) xP∗ |y ,c (cid:1) p(c∗ |D ) ´ (cid:81) t φ (cid:0) y it ;β 0 (cid:48)x i,t−1 +λ(cid:48) i w i,t−1 ,σ 0 2(cid:1) f 0 (λ i )dλ i i,t−1 i,t−1 i,0:t−2 i0 A (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2 (cid:1) f(λ )dλ t t it 0 i,t−1 i i,t−1 0 i i · (cid:89) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f(λ )log (cid:81) t φ (cid:0) y it ;β 0 (cid:48)x i,t−1 +λ(cid:48) i w i,t−1 ,σ 0 2(cid:1) dλ dy dx∗ dwI . it 0 i,t−1 i i,t−1 0 i (cid:81) φ(y ;β(cid:48)x +λ(cid:48)w ,σ2) i i,0:T i,0:T−1 i,0:T−1 ˆ t t it i,t−1 i i,t−1 = (cid:89) p (cid:0) xP∗ |y ,c (cid:1) p(c∗ |D ) (cid:89) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f (λ ) i,t−1 i,t−1 i,0:t−2 i0 A it 0 i,t−1 i i,t−1 0 0 i t t (cid:34)ˆ (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f(λ ) (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) (cid:35) · ´ t it 0 i,t−1 i i,t−1 0 i log t it 0 i,t−1 i i,t−1 0 dλ (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2 (cid:1) f(λ )dλ (cid:81) φ(y ;β(cid:48)x +λ(cid:48)w ,σ2) i t it 0 i,t−1 i i,t−1 0 i i t it i,t−1 i i,t−1 ·dλ dy dx∗ dwI . (B.7) i i,0:T i,0:T−1 i,0:T−1 Note that ˆ (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f(λ ) (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) ´ t it 0 i,t−1 i i,t−1 0 i log t it 0 i,t−1 i i,t−1 0 dλ (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2 (cid:1) f(λ )dλ (cid:81) φ(y ;β(cid:48)x +λ(cid:48)w ,σ2) i ˆ t it 0 i,t−1 i i,t−1 0 i i t it i,t−1 i i,t−1 φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) f(λ ) (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) = ´ i 0 0 i log t it 0 i,t−1 i i,t−1 0 dλ , (B.8) φ (cid:0) λ ;m(β ),Σ (cid:0) σ2 (cid:1)(cid:1) f(λ )dλ (cid:81) φ(y ;β(cid:48)x +λ(cid:48)w ,σ2) i i 0 0 i i t it i,t−1 i i,t−1 where (cid:32) (cid:33)−1 m(β) = (cid:88) w w(cid:48) (cid:88) w (cid:0) y −β(cid:48)x (cid:1) , i,t−1 i,t−1 i,t−1 it i,t−1 t t (cid:32) (cid:33)−1 Σ (cid:0) σ2(cid:1) = σ2 (cid:88) w w(cid:48) . i,t−1 i,t−1 t A-8

The log of the ratio of normal distributions has an analytical form, (cid:81) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) log t it 0 i,t−1 i i,t−1 0 (cid:81) φ(y ;β(cid:48)x +λ(cid:48)w ,σ2) t it i,t−1 i i,t−1 (cid:18) (cid:19) = T (cid:0) logσ2−logσ2(cid:1) + 1 (cid:88)(cid:0) y −β(cid:48)x −λ(cid:48)w (cid:1)2 1 − 1 2 0 2 it i,t−1 i i,t−1 σ2 σ2 t 0 + (cid:88) (y it −β(cid:48)x i,t−1 −λ(cid:48) i w i,t−1 )2−(y it −β 0 (cid:48)x i,t−1 −λ(cid:48) i w i,t−1 )2 2σ2 t 0 (cid:18) (cid:19) = T (cid:0) logσ2−logσ2(cid:1) + 1 (cid:88)(cid:0) y −β(cid:48)x −λ(cid:48)w (cid:1)2 1 − 1 2 0 2 it i,t−1 i i,t−1 σ2 σ2 t 0 + (cid:88) (β(cid:48)x i,t−1 )2−(β 0 (cid:48)x i,t−1 )2−2(y it −λ(cid:48) i w i,t−1 )(β−β 0 )(cid:48)x i,t−1 . 2σ2 t 0 For all (cid:15)(cid:48) > 0, there exists Gσ2 = (cid:8) σ2 ∈ σ2[1,1+η) (cid:9) such that σ2 is on the boundary of Gσ2, (cid:15)(cid:48) 0 0 (cid:15)(cid:48) (cid:16) (cid:17) Πσ2 Gσ2 > 0, and the sum of the first two terms is less than (cid:15)(cid:48) = T log(1+η). Given that T is (cid:15)(cid:48) 2 finite and that w is bounded due to Assumption 3.8 (1), the last term is of order i,0:T−1 (cid:32) (cid:33) O (cid:0) σ2(cid:1)−1 (cid:107)β−β (cid:107) (cid:88)(cid:16) y2 +(cid:107)x (cid:107)2+(cid:107)λ (cid:107)2 (cid:17) . (B.9) 0 0 2 it i,t−1 2 i 2 t The term (cid:0) σ2(cid:1)−1 can be ignored in cross-sectional homoskedastic cases, but I keep it here so that 0 the derivation is more comparable to cross-sectional heteroskedastic cases. Considering that f(λ ) i is characterized by an infinite mixture of multivariate normals, we have φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) f(λ ) i 0 0 i =φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1)(cid:88) p φ(λ ;µ ,Ω ) i 0 0 k i k k k = (cid:88) w (cid:0) β ,σ2(cid:1) φ (cid:0) λ ;m (cid:0) β ,σ2(cid:1) ,Σ (cid:0) σ2(cid:1)(cid:1) , k 0 0 i k 0 0 k 0 k where (cid:16) (cid:17) m (β) = Σ (cid:0) σ2(cid:1) Σ (cid:0) σ2(cid:1)−1 m(β)+Ω−1µ , k k k k Σ (cid:0) σ2(cid:1) = (cid:16) Σ (cid:0) σ2(cid:1)−1 +Ω−1 (cid:17)−1 , k k w (cid:0) β,σ2(cid:1) = p (cid:12) (cid:12)Σ k (cid:0) σ2(cid:1)(cid:12) (cid:12) exp (cid:32) − 1 (cid:32) µ(cid:48) k Ω− k 1µ k +m(β)(cid:48)Σ (cid:0) σ2(cid:1)−1 m(β) (cid:33)(cid:33) . k k (2π)dw/2|Ω ||Σ(σ2)| 2 −m (cid:0) β,σ2(cid:1)(cid:48) Σ (cid:0) σ2(cid:1)−1 m (cid:0) β,σ2(cid:1) k k k k A-9

Therefore, ˆ φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) f(λ ) h(λ ) = ´ i 0 0 i i φ (cid:0) λ ;m(β ),Σ (cid:0) σ2 (cid:1)(cid:1) f(λ )dλ i 0 0 i i = (cid:88) w k (cid:0) β 0 ,σ 0 2(cid:1) φ (cid:0) λ ;m (cid:0) β ,σ2(cid:1) ,Σ (cid:0) σ2(cid:1)(cid:1) (B.10) (cid:80) w (cid:0) β ,σ2 (cid:1) i k 0 0 k 0 k k k 0 0 is an infinite mixture of normals as well, and there exists Gf∗ ∈ Gf such that for all f ∈ Gf∗, (cid:15)(cid:48) (cid:15)(cid:48) (cid:15)(cid:48) ˆ (cid:32) (cid:33) (cid:16) (cid:17) (cid:88)(cid:16) (cid:17) h(λ )(cid:107)λ (cid:107)2dλ = O (cid:107)m(β )(cid:107)2 = O y2 +(cid:107)x (cid:107)2 . (B.11) i i 2 i 0 2 it i,t−1 2 t Combining expressions (B.9), (B.10), and (B.11), we see that (B.8) is of the order (cid:32) (cid:33) O (cid:0) σ2(cid:1)−1 (cid:107)β−β (cid:107) (cid:88)(cid:16) y2 +(cid:107)x (cid:107)2 (cid:17) . 0 0 2 it i,t−1 2 t Now let us proceed with the integration in expression (B.7). First, collecting terms related to y |c , iT i,T−1 ˆ (cid:16) (cid:17) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) O (cid:0) σ2(cid:1)−1 (cid:107)β−β (cid:107) y2 dy iT 0 i,T−1 i i,T−1 0 0 0 2 iT iT (cid:16) (cid:16) (cid:17)(cid:17) =O (cid:0) σ2(cid:1)−1 (cid:107)β−β (cid:107) (cid:107)x (cid:107)2+(cid:107)λ (cid:107)2 . (B.12) 0 0 2 i,T−1 2 i 2 Next, collecting terms related to xP∗ |y ,c , i,T−1 i,T−1 i,0:T−2 ˆ p (cid:0) xP i,t ∗ −1 |y i,t−1 ,c i,0:t−2 (cid:1) O (cid:16) (cid:0) σ 0 2(cid:1)−1 (cid:107)β−β 0 (cid:107) 2 (cid:13) (cid:13)xP i,T ∗ −1 (cid:13) (cid:13) 2 2 (cid:17) . Assumption 3.8 (3) ensures that for all (cid:15)(cid:48) > 0, there exists C > 0 such that ˆ (cid:13) (cid:13)xP i,T ∗ −1 (cid:13) (cid:13) 2 2 p (cid:0) xP i,T ∗ −1 |y i,T−1 ,c i,0:T−2 (cid:1) < (cid:15)(cid:48). (cid:107)xP∗ (cid:107) ≥C i,T−1 2 (cid:13) (cid:13) When (cid:13)xP∗ (cid:13) < C, for all (cid:15)(cid:48) > 0, there exists Gβ such that β is in the interior of Gβ , (cid:13) i,T−1(cid:13) (cid:15)(cid:48),T 0 (cid:15)(cid:48),T 2 (cid:16) (cid:17) Πβ Gβ > 0, and the integration with respect to xP∗ |y ,c is less than (cid:15)(cid:48). (cid:15)(cid:48),T i,T−1 i,T−1 i,0:T−2 Now the remaining terms in expression (B.12) are of order O (cid:16) (cid:0) σ 0 2(cid:1)−1 (cid:107)β −β 0 (cid:107) 2 (cid:16) y i 2 ,T−1 + (cid:13) (cid:13)xO i,T−1 (cid:13) (cid:13) 2 2 +(cid:107)λ i (cid:107)2 2 (cid:17)(cid:17) . As T is finite, continuing with t = T − 2,T − 3,··· ,2, we can employ the tail conditions of xP∗ |y ,c to obtain Gβ . Furthermore, when t = 1, Gβ is constructed via the tail i,t−1 i,t−1 i,0:t−2 (cid:15)(cid:48),t (cid:15)(cid:48),1 A-10

conditions of λ , xP, and xO in Assumption 3.9 (1-e) and Assumption 3.8 (2). Hence, the i i0 i,0:T−1 relevant set of β is charaterized by Gβ = (cid:84) Gβ , and we achieve the altered KL property specified (cid:15)(cid:48) t (cid:15)(cid:48),t on the observables in expression (B.5). Following Barron et al. (1999) Lemma 4, the altered KL property in expression (B.5) ensures that for all (cid:15)(cid:48) > 0, ˆ (cid:16) (cid:17) (cid:12)  g y ,x∗ ,wI |D (cid:12) P∞ 0  (cid:89) (cid:16) i,0:T i,0:T−1 i,0:T−1 A (cid:17) dΠ(ϑ,f) ≤ exp (cid:0) −N(cid:15)(cid:48)(cid:1) , infinitely often (cid:12) (cid:12)D A = 0, g y ,x∗ ,wI |D (cid:12) i≤N 0 i,0:T i,0:T−1 i,0:T−1 A (cid:12) (B.13) whereP∞ ischaracterizedbythetruedatageneratingprocesswhenN → ∞. BasedonAssumption 0 3.9(2-b),wecanobtainΠ(Fc ) = O(exp(−βN))forsomeβ > 0. Therefore,Πf ({(cid:107)f −f (cid:107) ≥ (cid:15)} (cid:84) Fc |D) → N 0 1 N 0 in P∞-probability. 0 (c) For all (cid:15) > 0, Π({(cid:107)f −f (cid:107) ≥ (cid:15)} (cid:84) F ,(cid:107)ϑ−ϑ (cid:107) < δ((cid:15))|D) → 0.49 0 1 N 0 2 Note that (cid:107)g−g (cid:107) 0 1 = ˆ (cid:12) (cid:12) (cid:12) (cid:89) p (cid:0) xP∗ |y ,c (cid:1) p(c∗ |D ) ˆ (cid:32) (cid:81) t φ (cid:0) y it ;β(cid:48)x i,t−1 +λ(cid:48) i w i,t−1 ,σ2(cid:1) f(λ i ) (cid:33) dλ (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) t i,t−1 i,t−1 i,0:t−2 i0 A − (cid:81) t φ (cid:0) y it ;β 0 (cid:48)x i,t−1 +λ(cid:48) i w i,t−1 ,σ 0 2(cid:1) f 0 (λ i ) i(cid:12) (cid:12) ·dy dx∗ dwI . i,0:T i,0:T−1 i,0:T−1 Let ˆ A ≡ (cid:89) p (cid:0) xP∗ |y ,c (cid:1) p(c∗ |D ) (cid:89) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) dy dx∗ dwI i,t−1 i,t−1 i,0:t−2 i0 A it i,t−1 i i,t−1 i,0:T i,0:T−1 i,0:T−1 ˆ t t · |f(λ )−f (λ )|dλ i 0 i i = (cid:107)f −f (cid:107) , 0 1 ˆ (cid:34)t−1 B ≡ (cid:89) p (cid:0) xP∗ |y ,c (cid:1) p(c∗ |D ) (cid:88) (cid:89) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) i,t−1 i,t−1 i,0:t−2 i0 A it i,t−1 i i,t−1 t t τ=1 · (cid:89) T φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) (cid:12) (cid:12) (cid:12) φ (cid:0) y it ;β(cid:48)x i,t−1 +λ(cid:48) i w i,t−1 ,σ2(cid:1) (cid:12) (cid:12) (cid:12) (cid:35) τ=t+1 it 0 i,t−1 i i,t−1 0 (cid:12) (cid:12) −φ (cid:0) y it ;β 0 (cid:48)x i,t−1 +λ(cid:48) i w i,t−1 ,σ 0 2(cid:1) (cid:12) (cid:12) ·dy dx∗ dwI ·f (λ )dλ . i,0:T i,0:T−1 i,0:T−1 0 i i Same as the iterated integral argument for the first term in part (b) based on the tail conditions in Assumption 3.8, we can establish that for all (cid:15)(cid:48) > 0, there exists δ > 0, such that B < (cid:15)(cid:48) as long as 49We let δ depend on (cid:15) in part (c) because part (a) holds for all δ>0. A-11

(cid:107)ϑ−ϑ (cid:107) < δ. Note that 0 2 (cid:107)f −f (cid:107) −(cid:15)(cid:48) < A−B ≤ (cid:107)g−g (cid:107) ≤ A+B < (cid:107)f −f (cid:107) +(cid:15)(cid:48), 0 1 0 1 0 1 then the distance between g and g is comparable to the distance between f and f . 0 0 More rigorously, let (cid:15) = (cid:15)/9 and F = {f ∈ F : (cid:107)f −f (cid:107) < (cid:15) = 9(cid:15) }. For small η ∈ (cid:0) 0, 1(cid:1) , 1 1 0 1 1 9 define δ ((cid:15)) such that B < η(cid:15) as long as (cid:107)ϑ−ϑ (cid:107) < δ ((cid:15)), then when f ∈ Fc, η 0 2 η 1 (cid:107)g−g (cid:107) > (cid:107)f −f (cid:107) −η(cid:15) > 8(cid:15) . 0 1 0 1 1 Let G be the space induced by f ∈ F and (cid:107)ϑ−ϑ (cid:107) < δ ((cid:15)) according to the likelihood function in 0 2 η equation (B.4), then the covering number N ((cid:15) ,G) ≤ N ((cid:15) (1−9η),F), 1 1 where G ∈ G induced by F ∈ F. Further define (cid:16) (cid:17) g y ,x∗ ,wI |D (cid:89) i,0:T i,0:T−1 i,0:T−1 A R (ϑ,f) = , N (cid:16) (cid:17) g y ,x∗ ,wI |D i≤N 0 i,0:T i,0:T−1 i,0:T−1 A and H as the event where N ˆ R (ϑ,f)dΠ(ϑ,f) ≥ exp (cid:0) −γ N(cid:15)2(cid:1) , γ = γ(1−9η)2, N 0 1 0 for the γ in Assumption 3.9 (2-b). Equation (B.13) implies that PN (H |D ) → 1, and hence 0 N A (cid:104) (cid:16) (cid:92) (cid:12) (cid:17)(cid:12) (cid:105) PN Π Fc F ,(cid:107)ϑ−ϑ (cid:107) < δ ((cid:15))(cid:12)D (cid:12)D 0 1 N 0 2 η (cid:12) (cid:12) A (cid:104) (cid:16) (cid:92) (cid:12) (cid:17) (cid:12) (cid:105) =PN Π Fc F ,(cid:107)ϑ−ϑ (cid:107) < δ ((cid:15))(cid:12)D 1(H )(cid:12)D +o (1). 0 1 N 0 2 η (cid:12) N (cid:12) A p NotethatbasedonGhosalandvanderVaart(2007)Corollary1,foranysetQwithinf (cid:107)g−g (cid:107) ≥ g∈Q 0 1 8(cid:15) ,50 for any γ ,γ > 0, there exists a test ϕ such that 1 1 2 N (cid:114) (cid:114) EN (ϕ |D ) ≤ γ 2 N ((cid:15) ,Q)exp (cid:0) −N(cid:15)2(cid:1) and supEN (1−ϕ |D ) ≤ γ 1 exp (cid:0) −N(cid:15)2(cid:1) . g0 N A γ 1 1 g N A γ 1 1 g∈Q 2 Let G be the induced set by F , and G be the induced set by F . Therefore, we can construct 1 1 N,j N,j 50The original Ghosal and van der Vaart (2007) Corollary 1 considers the Hellinger distance, which is defined as d (g,g )= (cid:113)´ (cid:0)√ g− √ g (cid:1)2. Note that d2 (g,g )≤(cid:107)g−g (cid:107) ≤2d (g,g ), so inf d (g,g )≥4(cid:15) . H 0 0 H 0 0 1 H 0 g∈Q H 0 1 A-12

tests {ϕ }, such that N,j (cid:104) (cid:16) (cid:92) (cid:12) (cid:17) (cid:12) (cid:105) PN Π Fc F ,(cid:107)ϑ−ϑ (cid:107) < δ ((cid:15))(cid:12)D 1(H )(cid:12)D 0 1 N,j 0 2 η (cid:12) N (cid:12) A ˆ (cid:20) (cid:12) (cid:21) ≤PN 0 (ϕ N,j |D A )+PN 0 (1−ϕ N,j ) R N (ϑ,f)1 (cid:16) F 1 c (cid:92) F N,j ,(cid:107)ϑ−ϑ 0 (cid:107) 2 < δ η ((cid:15)) (cid:17) dΠ(ϑ,f) (cid:12) (cid:12) (cid:12) D A exp (cid:0) γ 0 N(cid:15)2 1 (cid:1) ≤EN (ϕ |D )+ sup EN (1−ϕ |D )·Π(F ,(cid:107)ϑ−ϑ (cid:107) < δ ((cid:15)))exp (cid:0) γ N(cid:15)2(cid:1) g0 N,j A g∈Gc 1 (cid:84)GN,j g N,j A N,j 0 2 η 0 1 ≤EN (ϕ |D )+ sup EN (1−ϕ |D )·Πf (F )exp (cid:0) γ N(cid:15)2(cid:1) g0 N,j A g∈Gc 1 (cid:84)GN,j g N,j A N,j 0 1 (cid:114)γ (cid:114)γ ≤ 2,j N ((cid:15) ,G )exp (cid:0) −N(cid:15)2(cid:1) + 1,j Πf (F )exp (cid:0) −N(cid:15)2(1−γ ) (cid:1) γ 1 N,j 1 γ N,j 1 0 1,j 2,j (cid:113) = N ((cid:15) ,G )Πf (F ) (cid:0) exp (cid:0) −N(cid:15)2(cid:1) +exp (cid:0) −N(cid:15)2(1−γ ) (cid:1)(cid:1) . 1 N,j N,j 1 1 0 Thelastlineisgivenbyplugginginγ = N ((cid:15) ,G )andγ = Πf (F ). Finally, asN ((cid:15) ,G ) 1,j 1 N,j 2,j N,j 1 N,j (cid:16) (cid:17) less than N (cid:15)1 ,F , 1+η N,j (cid:104) (cid:16) (cid:92) (cid:12) (cid:17) (cid:12) (cid:105) PN Π Fc F ,(cid:107)ϑ−ϑ (cid:107) < δ ((cid:15))(cid:12)D 1(H )(cid:12)D 0 1 N 0 2 η (cid:12) N (cid:12) A   ≤O (cid:88)(cid:113) N ((cid:15) 1 ,G N,j )Πf (F N,j )exp (cid:0) −N(cid:15)2 1 (1−γ 0 ) (cid:1)  j   ≤O (cid:88)(cid:113) N ((cid:15) 1 (1−9η),F N,j )Πf (F N,j )exp (cid:0) −N(cid:15)2 1 (1−γ 0 ) (cid:1)  j (cid:16) (cid:16) (cid:17) (cid:17) =o exp (1−γ)(1−9η)2N(cid:15)2 exp (cid:0) −N(cid:15)2(1−γ ) (cid:1) 1 1 0 (cid:16) (cid:16) (cid:16) (cid:17)(cid:17)(cid:17) =o exp −N(cid:15)2 1−(1−9η)2 1 →0. The fourth line follows the summability condition of covering numbers as in Lemma 3.7 condition 2-b, which can be deduced from Assumption 3.9 (2-b). 2. Random coefficients: cross-sectional heteroskedasticity (a) For all δ > 0, Πϑ((cid:107)ϑ−ϑ (cid:107) ≥ δ|D) → 0. 0 2 A-13

After orthogonal forward differencing, the posterior of β is given by ˆ p(β|D) ∝ φ (cid:16) β;mβ,ψβσ2 (cid:17) fσ2(cid:0) σ2(cid:1) dΠβ,fσ2 (cid:16) β,fσ2 (cid:17) dσ2, i i i  −1 (cid:88) ψβ =  x˜ i,t−1 x˜(cid:48) i,t−1 , i,t   (cid:88) mβ = ψβ  x˜ i,t−1 y˜ it. i,t ´ Notethat φ (cid:0) β;mβ,ψβσ2(cid:1) fσ2(cid:0) σ2(cid:1) dΠfσ2 (cid:16) fσ2 (cid:17) dσ2 isascalemixtureofnormals, anditsvariance i i i is ψβEσ2 ≤ ψβσ¯2, σ¯2 is the upper bound specified in Proposition 3.10 condition 4-b. Therefore, i β|D converges to β given Assumption 3.1 (2-c) and Proposition 3.10 condition 3. 0 (b) For all (cid:15) > 0, Πf ({(cid:107)f −f (cid:107) ≥ (cid:15)} (cid:84) Fc |D) → 0. 0 1 N As λ and σ2 are independent, we have (cid:16) (cid:17) (cid:16) (cid:17) (cid:16) (cid:17) d (f ,f) = d fλfσ2 ,fλfσ2 = d fλ,fλ +d fσ2 ,fσ2 . KL 0 KL 0 0 KL 0 KL 0 In addition, since the KL divergence is invariant under variable transformations, (cid:16) (cid:17) (cid:16) (cid:17) d fσ2 ,fσ2 = d fl,fl . KL 0 KL 0 Assumption 3.9 (1, 2-a) ensures that the KL property holds for f. Now the individual-specific likelihood function is g (cid:0) y ,x∗ ,wI |D (cid:1) i,0:T i,0:T−1 i,0:T−1 A ˆ = (cid:89) p (cid:0) xP∗ |y ,c (cid:1) p(c∗ |D ) (cid:89) φ (cid:0) y ;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) f (cid:0) λ ,σ2(cid:1) dλ dσ2, i,t−1 i,t−1 i,0:t−2 i0 A it i,t−1 i i,t−1 i i i i i t t (B.14) and we want to prove the altered KL property specified on the observables in expression (B.5). As in the proof of part (1-b), similar convexity reasoning and tail conditions can be applied to bound the KL divergency of g with respect to g . Note that all bounds are proportional to (cid:0) σ2(cid:1)−1 , which 0 i is further integrated out via fσ2(cid:0) σ2(cid:1) . This integration exists due to the integrability of fl(l ). 0 i 0 i (c) For all (cid:15) > 0, Π({(cid:107)f −f (cid:107) ≥ (cid:15)} (cid:84) F ,(cid:107)ϑ−ϑ (cid:107) < δ((cid:15))|D) → 0.51 0 1 N 0 2 We can show that the distance between g and g is comparable to the distance between f and 0 f in a similar manner to part (1-c). The rest of the proof remains the same. 0 51We let δ depend on (cid:15) in part (c) because part (a) holds for all δ>0. A-14

B.4 Posterior Consistency: Correlated Random Coefficients Model Proof. (Proposition 3.13) The proof builds on Pati et al. (2013)’s study on univariate conditional density estimation and introduces two major extensions: (1) multivariate conditional density estimation based on locationscale mixture, and (2) deconvolution and dynamic panel data structure. Part (a) for common parameters is the same as the random coefficients cases. Parts (b) and (c) for the underlying distribution of individual heterogeneity need more careful treatment. First, we replace f(·) with its conditional counterpart f(·|c ) in the individual-specific i0 likelihoods in equations (B.4) and (B.14). Second, for z = λ (and l), Assumption 3.12 condition 1 (on fz), conditions 2-a,b (on Gz), and 0 0 condition 3-a (on stick breaking process) ensure the induced q -integrated KL property on the 0 conditional distribution of z , i.e. for all (cid:15) > 0, i ˆ ˆ (cid:18) (cid:20) (cid:21) (cid:19) f (z|c ) Πfz fz ∈ Fz : f (z|c )log 0 0 dz q (c )dc < (cid:15) > 0. 0 0 0 0 0 f(z|c ) 0 Pati et al. (2013) Theorem 5.3 proved it for univariate z. For multivariate z, we work with the spectral norm for the positive definite component covariance matrices and consider (cid:107)Ω(cid:107) ∈ [σ,σ¯] 2 as the approximating compact set in the proof of Pati et al. (2013) Lemma 5.5, Theorem 5.6, and Corollary 5.7. Third, Assumption 3.12 condition 2-c (on Gz) and conditions 3-b,c (on stick breaking process) 0 address the sieve property stated in Lemma 3.7 (2). Now the covering number is based on the ´ ´ induced q -integrated L -distance (cid:107)fz −fz(cid:107) ≡ (cid:2) |fz(z|c )−fz(z|c )|dz (cid:3) q (c )dc . Condition 0 1 0 1 0 0 0 0 0 0 2-c resembles the random coefficients cases while expands component means to include coefficients on c . Comparing to Pati et al. (2013) Theorem 5.10, condition 2-c here imposes weaker tail coni0 ditions on Gz and hence is able to accommodate multivariate normal inverse Wishart components. 0 Conditions3-b,constickbreakingprocessdirectlyfollowPatiet al.(2013)Remark5.12andLemma 5.15. The rest of the proof parallels the random coefficients cases. B.5 Density Forecasts Proof. (Proposition 3.14) 1. Random coefficients: cross-sectional homoskedasticity In this part, I am going to prove that for any i and any (cid:15) > 0, as N → ∞, (cid:16)(cid:13) (cid:13) (cid:12) (cid:17) P (cid:13)fcond −foracle(cid:13) < (cid:15)(cid:12)D → 1. (cid:13) i,T+1 i,T+1(cid:13) (cid:12) 1 A-15

Following the definitions in Section 2.2, we have ˆ (cid:12) (cid:12) (cid:12)fcond (y|ϑ,f)−foracle(y)(cid:12)dy (cid:12) i,T+1 i,T+1 (cid:12) ˆ ˆ ˆ (cid:12) (cid:12) (cid:12) (cid:12) = (cid:12) p(y|h i ,ϑ,w iT ,x iT )p(h i |ϑ,f,D i ,D A )dh i − p(y|h i ,ϑ 0 ,w iT ,x iT )p(h i |ϑ 0 ,f 0 ,D i ,D A )dh i(cid:12)dy (cid:12) (cid:12) ˆ ´ (cid:12) (cid:81) = (cid:12) (cid:12) p(y|h i , ´ ϑ, (cid:81) w iT ,x iT ) t p(y it |h i ,ϑ,w i,t−1 ,x i,t−1 )f(h i )dh i (cid:12) p(y |h ,ϑ,w ,x )f(h )dh ´ t it i i,t−1 i,t−1 i i (cid:81) (cid:12) − p(y|h i ,ϑ ´0 (cid:81) ,w iT ,x iT ) t p(y it |h i ,ϑ 0 ,w i,t−1 ,x i,t−1 )f 0 (h i )dh i(cid:12) (cid:12)dy (B.15) p(y |h ,ϑ ,w ,x )f (h )dh (cid:12) ˆ (cid:12) (cid:12) ´ φ (cid:0) y;β(cid:48)x + t λ(cid:48)w it , i σ2(cid:1) 0 φ (cid:0) i λ ,t− ;m 1 (β i,t ) − , 1 Σ (cid:0) 0 σ2(cid:1) i (cid:1) f(λ i )dλ = (cid:12) iT´ i iT i i i (cid:12) φ(λ ;m(β),Σ(σ2))f(λ )dλ (cid:12) i i i ´ φ (cid:0) y;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) φ (cid:0) λ ;m (β ),Σ (cid:0) σ2(cid:1)(cid:1) f (λ )dλ (cid:12) (cid:12) − 0 iT´ i iT 0 i i 0 i 0 0 i i(cid:12)dy. φ (cid:0) λ ;m(β ),Σ (cid:0) σ2 (cid:1)(cid:1) f (λ )dλ (cid:12) i 0 0 0 i i (cid:12) To obtain the last equality, we first rewrite (cid:81) t p (cid:0) y it (cid:12) (cid:12)λ i ,β,σ2,y i,t−1 (cid:1) as a distribution of λ i (cid:89) p (cid:0) y it (cid:12) (cid:12)λ i ,β,σ2,y i,t−1 (cid:1) = C (cid:0) β,σ2(cid:1) φ (cid:0) λ i ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) , t where (cid:32) (cid:33)−1 m(β) = (cid:88) w w(cid:48) (cid:88) w (cid:0) y −β(cid:48)x (cid:1) , i,t−1 i,t−1 i,t−1 it i,t−1 t t (cid:32) (cid:33)−1 Σ (cid:0) σ2(cid:1) = σ2 (cid:88) w w(cid:48) , i,t−1 i,t−1 t (cid:32) (cid:12) (cid:12)(cid:33)−1 C (cid:0) β,σ2(cid:1) = (cid:0) 2πσ2(cid:1)T− 2 dw (cid:12) (cid:12) (cid:88) w w(cid:48) (cid:12) (cid:12) A (cid:12) i,t−1 i,t−1(cid:12) (cid:12) (cid:12) t ·exp (cid:32) − 1 (cid:32)(cid:80) t (y it −β(cid:48)x i,t−1 )2 −m(β)(cid:48)(cid:0) Σ (cid:0) σ2(cid:1)(cid:1)−1 m(β) (cid:33)(cid:33) , 2 σ2 and then cross out the common factor in the numerator and denominator. Set ˆ A = φ (cid:0) λ ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) f(λ )dλ , i i i ˆ B(y) = φ (cid:0) y;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) φ (cid:0) λ ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) f(λ )dλ . iT i iT i i i with A and B (y) being the counterparts for the oracle predictor. Note that both A and B(y) are 0 0 A-16

positive by definition. Then, we want to make sure the following expression is arbitrarily small, ˆ ´ ´ (cid:12) (cid:12) (cid:12)B(y) B 0 (y)(cid:12) B 0 (y)dy·|A−A 0 | |B(y)−B 0 (y)|dy (cid:12) − (cid:12)dy ≤ + , (cid:12) A A (cid:12) A A A 0 0 and it is sufficient to establish the following four statements (with probability approaching one). (a) |A−A | < (cid:15)(cid:48) 0 |A−A | 0 ˆ (cid:12) (cid:12) ≤ (cid:12) (cid:12) φ (cid:0) λ i ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) (f(λ i )−f 0 (λ i ))dλ i (cid:12) (cid:12) (cid:12) (cid:12) ˆ (cid:12) (cid:12) + (cid:12) (cid:12) (cid:0) φ (cid:0) λ i ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) −φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ 0 2(cid:1)(cid:1)(cid:1) f 0 (λ i )dλ i (cid:12) (cid:12). (cid:12) (cid:12) For the first term, ˆ (cid:12) (cid:12) (cid:12) (cid:12) φ (cid:0) λ i ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) (f(λ i )−f 0 (λ i ))dλ i (cid:12) (cid:12) (cid:12) (cid:12) ˆ ≤ φ (cid:0) λ ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) |f(λ )−f (λ )|dλ (B.16) i i 0 i i (cid:12) (cid:12) (cid:12)(cid:80) w w(cid:48) (cid:12) (cid:12) t i,t−1 i,t−1(cid:12) ≤ ·(cid:107)f −f (cid:107) . (2πσ2)dw/2 0 1 It is less than (cid:15)(cid:48)/2 with probability approaching one due to the posterior consistency of f and that φ (cid:0) λ ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) is a bounded function in λ (given that σ2 converges to σ2). For the second i i 0 term, ˆ (cid:12) (cid:12) (cid:12) (cid:12) (cid:0) φ (cid:0) λ i ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) −φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ 0 2(cid:1)(cid:1)(cid:1) f 0 (λ i )dλ i (cid:12) (cid:12) (cid:12) (cid:12) ˆ ≤M (cid:12) (cid:12)φ (cid:0) λ i ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) −φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ 0 2(cid:1)(cid:1)(cid:12) (cid:12)dλ i (cid:113) (cid:0) (cid:0) (cid:0) (cid:1)(cid:1)(cid:1) ≤M 2d φ(λ ;m(β),Σ(σ2)),φ λ ;m(β ),Σ σ2 KL i i 0 0 (cid:118) (cid:117) (cid:117) (cid:18) σ2 σ2(cid:19)  (cid:88) (cid:32) (cid:88) (cid:33)−1 (cid:88)  ≤M(cid:117) (cid:116) d w σ2 −1−ln σ2 +σ 0 −2(β−β 0 )(cid:48)  x i,t−1 w i (cid:48) ,t−1 w i,t−1 w i (cid:48) ,t−1 w i,t−1 x(cid:48) i,t−1 (β −β 0 ). 0 0 t t t where the second inequality follows Pinsker’s inequality that bounds the L -distance by the KL 1 divergence. As (cid:0) β,σ2(cid:1) enjoys posterior consistency, the last expression can be arbitrarily small. Therefore, the second term is less than (cid:15)(cid:48)/2 with probability approaching one. A-17

´ (b) |B(y)−B (y)|dy < (cid:15)(cid:48) 0 ˆ |B(y)−B (y)|dy 0 ˆ ˆ (cid:12) (cid:12) ≤ (cid:12) (cid:12) φ (cid:0) y;β(cid:48)x iT +λ(cid:48) i w iT ,σ2(cid:1) φ (cid:0) λ i ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) (f(λ i )−f 0 (λ i ))dλ i (cid:12) (cid:12)dy (cid:12) (cid:12) + ˆ (cid:12) (cid:12) (cid:12) ˆ (cid:32) φ (cid:0) y;β(cid:48)x iT +λ(cid:48) i w iT ,σ2(cid:1) φ (cid:0) λ i ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) (cid:33) f (λ )dλ (cid:12) (cid:12) (cid:12)dy. (cid:12) (cid:12) −φ (cid:0) y;β 0 (cid:48)x iT +λ(cid:48) i w iT ,σ 0 2(cid:1) φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ 0 2(cid:1)(cid:1) 0 i i(cid:12) (cid:12) Similar to part (a), the first term is small due to the posterior consistency of f and σ2, ˆ ˆ (cid:12) (cid:12) (cid:12) (cid:12) φ (cid:0) y;β(cid:48)x iT +λ(cid:48) i w iT ,σ2(cid:1) φ (cid:0) λ i ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) (f(λ i )−f 0 (λ i ))dλ i (cid:12) (cid:12)dy (cid:12) (cid:12) ˆ ≤ φ (cid:0) y;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) φ (cid:0) λ ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) |f(λ )−f (λ )|dλ dy iT i iT i i 0 i i ˆ = φ (cid:0) λ ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) |f(λ )−f (λ )|dλ , i i 0 i i which is the same as expression (B.16) in part (a). Pinsker’s inequality together with the posterior consistency of (cid:0) β,σ2(cid:1) ensures a small second term, ˆ (cid:12) (cid:12) (cid:12) ˆ (cid:32) φ (cid:0) y;β(cid:48)x iT +λ(cid:48) i w iT ,σ2(cid:1) φ (cid:0) λ i ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) (cid:33) f (λ )dλ (cid:12) (cid:12) (cid:12)dy ˆ (cid:12) (cid:12) −φ (cid:0) y;β 0 (cid:48)x iT +λ(cid:48) i w iT ,σ 0 2(cid:1) φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ 0 2(cid:1)(cid:1) 0 i i(cid:12) (cid:12) ≤ (cid:12) (cid:12)φ (cid:0) y;β(cid:48)x iT +λ(cid:48) i w iT ,σ2(cid:1) −φ (cid:0) y;β 0 (cid:48)x iT +λ(cid:48) i w iT ,σ 0 2(cid:1)(cid:12) (cid:12)φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ 0 2(cid:1)(cid:1) f 0 (λ i )dλ i dy ˆ + φ (cid:0) y;β(cid:48)x iT +λ(cid:48) i w iT ,σ2(cid:1)(cid:12) (cid:12)φ (cid:0) λ i ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) −φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ 0 2(cid:1)(cid:1)(cid:12) (cid:12)f 0 (λ i )dλ i dy (cid:115) σ2 σ2 ≤M −1−ln +σ−2(β −β )(cid:48)x x(cid:48) (β−β ) σ2 σ2 0 0 iT iT 0 0 0 (cid:118) (cid:117) (cid:117) (cid:18) σ2 σ2(cid:19)  (cid:88) (cid:32) (cid:88) (cid:33)−1 (cid:88)  +M(cid:117) (cid:116) d w σ2 −1−ln σ2 +σ 0 −2(β−β 0 )(cid:48)  x i,t−1 w i (cid:48) ,t−1 w i,t−1 w i (cid:48) ,t−1 w i,t−1 x(cid:48) i,t−1 (β−β 0 ). 0 0 t t t (c) There exists A > 0 such that A > A. 0 Let µλ and Vλ be the mean and variance of λ based on the true distribution f . The moment 0 0 i 0 condition in Assumption 3.9 (1-e) ensures the existence of both µλ and Vλ. Following Chebyshev’s 0 0 inequality, we have P (cid:18)(cid:113) (cid:0) λ −µλ (cid:1)(cid:48)(cid:0) Vλ (cid:1)−1(cid:0) λ −µλ (cid:1) > k (cid:19) ≤ d w . f0 i 0 0 i 0 k2 A-18

(cid:26) (cid:113) (cid:27) Define Kλ = λ : (cid:0) λ −µλ (cid:1)(cid:48)(cid:0) Vλ (cid:1)−1(cid:0) λ −µλ (cid:1) ≤ k . Then, i i 0 0 i 0 ˆ A = φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) f (λ )dλ 0 i 0 0 0 i i ˆ ≥ φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) f (λ )dλ i 0 0 0 i i λi∈Kλ (cid:18) (cid:19) ≥ 1− d w min φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) . k2 λi∈Kλ i 0 0 Intuitively, since φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ 0 2(cid:1)(cid:1) and f 0 (λ i ) share the same support on Rdw, the integral is bounded below by some positive A. Moreover, we have |A−A | < (cid:15)(cid:48) from (a), then A > A −(cid:15)(cid:48) > 0 0 A−(cid:15)(cid:48). Therefore, A is also bounded below with probability approaching one. ´ (d) B (y)dy < ∞ 0 ˆ B (y)dy 0 ˆ = φ (cid:0) y;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) f (λ )dλ dy 0 iT i iT 0 i 0 0 0 i i ˆ = φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) f (λ )dλ i 0 0 0 i i (cid:12) (cid:12)(cid:80) w w(cid:48) (cid:12) (cid:12) ˆ (cid:12) t i,t−1 i,t−1(cid:12) ≤ f (λ )dλ (cid:0) 2πσ2 (cid:1)dw/2 0 i i 0 (cid:12) (cid:12) (cid:12)(cid:80) w w(cid:48) (cid:12) (cid:12) t i,t−1 i,t−1(cid:12) = . (cid:0) 2πσ2 (cid:1)dw/2 0 2. Random coefficients: cross-sectional heteroskedasticity Now A and B(y) become ˆ A = C (cid:0) β,σ2(cid:1) φ (cid:0) λ ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) fλ(λ )fσ2(cid:0) σ2(cid:1) dλ , i i i i i i ˆ B(y) = φ (cid:0) y;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) C (cid:0) β,σ2(cid:1) φ (cid:0) λ ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) fλ(λ )fσ2(cid:0) σ2(cid:1) dλ dσ2. iT i iT i i i i i i i i (cid:16) (cid:17) Consider Proposition 3.14 condition 3 (supp fσ2 is bounded below by some σ2 > 0), the above 0 statements can be derived as follows. A-19

(a) |A−A | < (cid:15)(cid:48) 0 |A−A | 0 ˆ (cid:12) (cid:12) ≤ (cid:12) (cid:12) C (cid:0) β,σ i 2(cid:1) φ (cid:0) λ i ;m(β),Σ (cid:0) σ i 2(cid:1)(cid:1) (cid:16) fλ(λ i )fσ2(cid:0) σ i 2(cid:1) −f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) (cid:17) dλ i dσ i 2(cid:12) (cid:12) (cid:12) (cid:12) ˆ (cid:12) (cid:12) + (cid:12) (cid:12) C (cid:0) β,σ i 2(cid:1)(cid:0) φ (cid:0) λ i ;m(β),Σ (cid:0) σ i 2(cid:1)(cid:1) −φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ i 2(cid:1)(cid:1)(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2(cid:12) (cid:12) (cid:12) (cid:12) ˆ (cid:12) (cid:12) + (cid:12) (cid:12) (cid:0) C (cid:0) β,σ i 2(cid:1) −C (cid:0) β 0 ,σ i 2(cid:1)(cid:1) φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ i 2(cid:1)(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2(cid:12) (cid:12). (cid:12) (cid:12) The first term ˆ (cid:12) (cid:12) (cid:12) (cid:12) C (cid:0) β,σ i 2(cid:1) φ (cid:0) λ i ;m(β),Σ (cid:0) σ i 2(cid:1)(cid:1) (cid:16) fλ(λ i )fσ2(cid:0) σ i 2(cid:1) −f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) (cid:17) dλ i dσ i 2(cid:12) (cid:12) (cid:12) (cid:12) ˆ (cid:12) (cid:12) ≤ C (cid:0) β,σ2(cid:1) φ (cid:0) λ ;m(β),Σ (cid:0) σ2(cid:1)(cid:1)(cid:12)fλ(λ )fσ2(cid:0) σ2(cid:1) −fλ(λ )fσ2(cid:0) σ2(cid:1)(cid:12)dλ dσ2 (B.17) i i i (cid:12) i i 0 i 0 i (cid:12) i i ≤ (cid:0) 2πσ2(cid:1)−T/2 ·(cid:107)f −f (cid:107) . 0 1 The second term ˆ (cid:12) (cid:12) (cid:12) (cid:12) C (cid:0) β,σ i 2(cid:1)(cid:0) φ (cid:0) λ i ;m(β),Σ (cid:0) σ i 2(cid:1)(cid:1) −φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ i 2(cid:1)(cid:1)(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2(cid:12) (cid:12) (cid:12) (cid:12) ˆ ≤M C (cid:0) β,σ i 2(cid:1)(cid:12) (cid:12)φ (cid:0) λ i ;m(β),Σ (cid:0) σ i 2(cid:1)(cid:1) −φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ i 2(cid:1)(cid:1)(cid:12) (cid:12)f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2 ˆ (cid:113) =M C (cid:0) β,σ2(cid:1) σ−2(β −β )(cid:48)V (β−β )fσ2(cid:0) σ2(cid:1) dσ2 i i 0 2 0 0 i i ˆ (cid:113) ≤M (cid:0) σ2(cid:1)−T−d 2 w+1 (β −β )(cid:48)V (β−β ) fσ2(cid:0) σ2(cid:1) dσ2 2 0 2 0 0 i i (cid:113) =M (cid:0) σ2(cid:1)−T−d 2 w+1 (β −β )(cid:48)V (β−β ), 2 0 2 0 where (cid:32) (cid:12) (cid:12)(cid:33)−1 M 2 = M (2π) T− 2 dw (cid:12) (cid:12) (cid:12) (cid:88) w i,t−1 w i (cid:48) ,t−1 (cid:12) (cid:12) (cid:12) , (cid:12) (cid:12) t (cid:32) (cid:33)−1 (cid:88) (cid:88) (cid:88) V = x w(cid:48) w w(cid:48) w x(cid:48) . 2 i,t−1 i,t−1 i,t−1 i,t−1 i,t−1 i,t−1 t t t A-20

The third term ˆ (cid:12) (cid:12) (cid:12) (cid:12) (cid:0) C (cid:0) β,σ i 2(cid:1) −C (cid:0) β 0 ,σ i 2(cid:1)(cid:1) φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ i 2(cid:1)(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2(cid:12) (cid:12) (cid:12) (cid:12) ˆ = (cid:12) (cid:12)C (cid:0) β,σ i 2(cid:1) −C (cid:0) β 0 ,σ i 2(cid:1)(cid:12) (cid:12)φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ i 2(cid:1)(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2 (B.18) ˆ ≤M (cid:12) (cid:12)C (cid:0) β,σ i 2(cid:1) −C (cid:0) β 0 ,σ i 2(cid:1)(cid:12) (cid:12)φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ i 2(cid:1)(cid:1) f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2 ˆ ≤M (cid:12) (cid:12)C (cid:0) β,σ i 2(cid:1) −C (cid:0) β 0 ,σ i 2(cid:1)(cid:12) (cid:12)f 0 σ2(cid:0) σ i 2(cid:1) dσ i 2 ˆ ≤M (cid:0) σ2(cid:1)−T− 2 dw |C 3 (β 0 )−C 3 (β)| fσ2(cid:0) σ2(cid:1) dσ2 2 i 2σ2 0 i i i ≤ 1 M (cid:0) σ2(cid:1)−T−d 2 w+2 |C (β )−C (β)|. 2 3 0 3 2 (cid:16) (cid:17) where C (β) = (cid:80) (y −β(cid:48)x )2−m(β) (cid:80) w w(cid:48) m(β) is continuous in β. ´3 t it i,t−1 t i,t−1 i,t−1 (b) |B(y)−B (y)|dy < (cid:15)(cid:48) 0 ˆ |B(y)−B (y)|dy 0 ˆ ˆ (cid:12) (cid:12) ≤ (cid:12) (cid:12) φ (cid:0) y;β(cid:48)x iT +λ(cid:48) i w iT ,σ i 2(cid:1) φ (cid:0) λ i ;m(β),Σ (cid:0) σ i 2(cid:1)(cid:1) C (cid:0) β,σ i 2(cid:1) (cid:16) fλ(λ i )fσ2(cid:0) σ i 2(cid:1) −f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) (cid:17) dλ i dσ i 2(cid:12) (cid:12)dy (cid:12) (cid:12) ˆ (cid:12) (cid:12) ˆ  φ (cid:0) y;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) φ (cid:16) λ ;mλ(β),Σλ(cid:0) σ2(cid:1) (cid:17)  (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12) (cid:12)  −φ (cid:0) y;β iT 0 (cid:48)x iT + i λ iT (cid:48) i w iT i ,σ i 2(cid:1) φ i (cid:16) λ i i ;mλ i (β 0 i ),Σ i λ i (cid:0) σ i 2(cid:1) (cid:17)C (cid:0) β,σ i 2(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2(cid:12) (cid:12) (cid:12) (cid:12) dy ˆ ˆ (cid:12) (cid:12) + (cid:12) (cid:12) φ (cid:0) y;β 0 (cid:48)x iT +λ(cid:48) i w iT ,σ i 2(cid:1)(cid:0) C (cid:0) β,σ i 2(cid:1) −C (cid:0) β 0 ,σ i 2(cid:1)(cid:1) φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ i 2(cid:1)(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2(cid:12) (cid:12)dy. (cid:12) (cid:12) The first term ˆ ˆ (cid:12) (cid:12) (cid:12) (cid:12) φ (cid:0) y;β(cid:48)x iT +λ(cid:48) i w iT ,σ i 2(cid:1) φ (cid:0) λ i ;m(β),Σ (cid:0) σ i 2(cid:1)(cid:1) C (cid:0) β,σ i 2(cid:1) (cid:16) fλ(λ i )fσ2(cid:0) σ i 2(cid:1) −f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) (cid:17) dλ i dσ i 2(cid:12) (cid:12)dy (cid:12) (cid:12) ˆ (cid:12) (cid:12) ≤ φ (cid:0) y;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) φ (cid:0) λ ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) C (cid:0) β,σ2(cid:1)(cid:12)fλ(λ )fσ2(cid:0) σ2(cid:1) −fλ(λ )fσ2(cid:0) σ2(cid:1)(cid:12)dλ dσ2dy iT i iT i i i i (cid:12) i i 0 i 0 i (cid:12) i i ˆ (cid:12) (cid:12) = φ (cid:0) λ ;m(β),Σ (cid:0) σ2(cid:1)(cid:1) C (cid:0) β,σ2(cid:1)(cid:12)fλ(λ )fσ2(cid:0) σ2(cid:1) −fλ(λ )fσ2(cid:0) σ2(cid:1)(cid:12)dλ dσ2, i i i (cid:12) i i 0 i 0 i (cid:12) i i A-21

which is the same as expression (B.17) in part (a). The second term ˆ (cid:12) (cid:12) ˆ  φ (cid:0) y;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) φ (cid:16) λ ;mλ(β),Σλ(cid:0) σ2(cid:1) (cid:17)  (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12)  −φ (cid:0) y;β iT 0 (cid:48)x iT + i λ iT (cid:48) i w iT i ,σ i 2(cid:1) φ i (cid:16) λ i i ;mλ i (β 0 i ),Σ i λ i (cid:0) σ i 2(cid:1) (cid:17)C (cid:0) β,σ i 2(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2(cid:12) (cid:12) (cid:12) (cid:12) dy ˆ (cid:12) (cid:12)φ (cid:0) y;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) φ (cid:16) λ ;mλ(β),Σλ(cid:0) σ2(cid:1) (cid:17) (cid:12) (cid:12) ≤ (cid:12) (cid:12) (cid:12) (cid:12) −φ (cid:0) y;β iT 0 (cid:48)x iT + i λ iT (cid:48) i w iT i ,σ i 2(cid:1) φ i (cid:16) λ i i ;mλ i (β 0 i ),Σ i λ i (cid:0) σ i 2(cid:1) (cid:17) (cid:12) (cid:12) (cid:12) (cid:12) C (cid:0) β,σ i 2(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2dy ˆ (cid:18)(cid:113) (cid:113) (cid:19) =M C (cid:0) β,σ2(cid:1) σ−2(β−β )(cid:48)x x(cid:48) (β−β )+ σ−2(β −β )(cid:48)V (β−β ) fσ2(cid:0) σ2(cid:1) dσ2 i i 0 iT iT 0 i 0 2 0 0 i i ˆ ≤M (cid:0) σ2(cid:1)−T−d 2 w+1 (cid:18)(cid:113) (β−β )(cid:48)x x(cid:48) (β −β )+ (cid:113) (β−β )(cid:48)V (β −β ) (cid:19) fσ2(cid:0) σ2(cid:1) dσ2 2 0 iT iT 0 0 2 0 0 i i =M (cid:0) σ2(cid:1)−T−d 2 w+1 (cid:18)(cid:113) (β−β )(cid:48)x x(cid:48) (β −β )+ (cid:113) (β−β )(cid:48)V (β −β ) (cid:19) . 2 0 iT iT 0 0 2 0 The third term ˆ ˆ (cid:12) (cid:12) (cid:12) (cid:12) φ (cid:0) y;β 0 (cid:48)x iT +λ(cid:48) i w iT ,σ i 2(cid:1)(cid:0) C (cid:0) β,σ i 2(cid:1) −C (cid:0) β 0 ,σ i 2(cid:1)(cid:1) φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ i 2(cid:1)(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2(cid:12) (cid:12)dy (cid:12) (cid:12) ˆ ≤ φ (cid:0) y;β 0 (cid:48)x iT +λ(cid:48) i w iT ,σ i 2(cid:1)(cid:12) (cid:12)C (cid:0) β,σ i 2(cid:1) −C (cid:0) β 0 ,σ i 2(cid:1)(cid:12) (cid:12)φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ i 2(cid:1)(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2dy ˆ = (cid:12) (cid:12)C (cid:0) β,σ i 2(cid:1) −C (cid:0) β 0 ,σ i 2(cid:1)(cid:12) (cid:12)φ (cid:0) λ i ;m(β 0 ),Σ (cid:0) σ i 2(cid:1)(cid:1) f 0 λ(λ i )f 0 σ2(cid:0) σ i 2(cid:1) dλ i dσ i 2 which is the same as expression (B.18) in part (a). (c) There exists A > 0 such that A > A. 0 Let l = log (cid:0) σ2−σ2(cid:1) , µl and Vl be the mean and variance of l based on the true distribution i i 0 0 i (cid:26) (cid:27) fl, and Kl = l : |l √i−µl 0 | ≤ k . Then, 0 i Vl 0 ˆ A = C (cid:0) β ,σ2(cid:1) φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) fλ(λ )fσ2(cid:0) σ2(cid:1) dλ dσ2 0 0 i i 0 i 0 i 0 i i i ˆ = C (cid:0) β ,expl +σ2(cid:1) φ (cid:0) λ ;m(β ),Σ (cid:0) expl +σ2(cid:1)(cid:1) fλ(λ )fl(l )dλ dl 0 i i 0 i 0 i 0 i i i ˆ ≥ C (cid:0) β ,expl +σ2(cid:1) φ (cid:0) λ ;m(β ),Σ (cid:0) expl +σ2(cid:1)(cid:1) fλ(λ )fl(l )dλ dl 0 i i 0 i 0 i 0 i i i λi∈Kλ,li∈Kl (cid:18) (cid:19)(cid:18) (cid:19) ≥ 1− d w 1− 1 min C (cid:0) β ,expl +σ2(cid:1) φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) , k2 k2 λi∈Kλ,li∈Kl 0 i i 0 0 where the second line is given by the change of variables, and the last line follows Chebyshev’s inequality on both λ and l . i i A-22

´ (d) B (y)dy < ∞ 0 ˆ B (y)dy 0 ˆ = φ (cid:0) y;β(cid:48)x +λ(cid:48)w ,σ2(cid:1) C (cid:0) β ,σ2(cid:1) φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) fλ(λ )fσ2(cid:0) σ2(cid:1) dλ dσ2dy 0 iT i iT i 0 i i 0 i 0 i 0 i i i ˆ = C (cid:0) β ,σ2(cid:1) φ (cid:0) λ ;m(β ),Σ (cid:0) σ2(cid:1)(cid:1) fλ(λ )fσ2(cid:0) σ2(cid:1) dλ dσ2 0 i i 0 i 0 i 0 i i i ˆ ≤ (cid:0) 2πσ2(cid:1)−T/2 fλ(λ )fσ2(cid:0) σ2(cid:1) dλ dσ2 0 i 0 i i i = (cid:0) 2πσ2(cid:1)−T/2 . 3. Correlated random coefficients (cid:13) (cid:13) Now we replace f(h ) with f(h |c ) in expression (B.15) that characterizes (cid:13)fcond −foracle(cid:13) . i i i0 (cid:13) i,T+1 i,T+1(cid:13) 1 Given Proposition 3.14 condition 2-b (q (c ) is bounded below by some q > 0), we have 0 0 ˆ ˆ ˆ (cid:20) (cid:21) (cid:20) (cid:21) q |f(z|c )−f (z|c )|dz < |f(z|c )−f (z|c )|dz q (c )dc < (cid:15), 0 0 0 0 0 0 0 0 0 so ˆ |f(z|c )−f (z|c )|dz < (cid:15)/q. 0 0 0 Therefore, we achieve the convergence of conditional distribution for any c and ensure that the 0 first term in part (a) is sufficiently small. The rest of the proof parallels the random coefficients scenarios. C Algorithms C.1 Hyperparameters Let us take the baseline model with random effects as an example, and the priors and hyperparameters for more complicated models can be constructed in a similar way. The prior for the common parameters takes a conjugate norma-inverse-gamma form, (cid:16) (cid:17) (cid:16) (cid:17) (cid:0) β,σ2(cid:1) ∼ N mβ,ψβσ2 IG aσ2 ,bσ2 . 0 0 0 0 A-23

The hyperparameters are chosen in a relatively ignorant sense without inferring too much from the data except aligning the scale according to the variance of the data. aσ2 = 2, (C.1) 0 bσ 0 2 = Eˆi (cid:16) V(cid:100)ar t i (y it ) (cid:17) · (cid:16) aσ 0 2 −1 (cid:17) = Eˆi (cid:16) V(cid:100)ar t i (y it ) (cid:17) , (C.2) mβ = 0.5, (C.3) 0 1 1 ψβ = = . (C.4) 0 bσ 0 2/ (cid:0) aσ 0 2 −1 (cid:1) Eˆi (cid:16) V(cid:100)ar t i (y it ) (cid:17) In equation (C.2) here and equation (C.5) below, Eˆt and V(cid:100)ar t stand for the sample mean and i i variance for firm i over t = 1,··· ,T, and Eˆi and V(cid:100)ar i are the sample mean and variance over the wholecross-sectioni = 1,··· ,N. Equation(C.2)ensuresthatonaveragethepriorandthedatahave asimilarscale. Equation(C.3)conjecturesthattheyoungfirmdynamicsarehighlylikelypersistent andstationary. Sincewedon’thavestrongpriorinformationinthecommonparameters,theirpriors are chosen to be not very restrictive. Equation (C.1) characterizes a rather less informative prior on σ2 with infinite variance, and Equation (C.4) assumes that the prior variance of β is equal to 1 on average. The hyperpriors for the DPM prior are specified as: (cid:16) (cid:17) (cid:16) (cid:17) G (cid:0) µ ,ω2(cid:1) = N mλ,ψλω2 IG aλ,bλ , 0 k k 0 0 k 0 0 α ∼ Ga(aα,bα). 0 0 Similarly, the hyperparameters are chosen to be: aλ 0 = 2, bλ 0 = V(cid:100)ar i(cid:16) Eˆ i t(y it ) (cid:17) · (cid:16) aλ 0 −1 (cid:17) = V(cid:100)ar i(cid:16) Eˆ i t(y it ) (cid:17) , (C.5) mλ = 0, ψλ = 1, 0 0 aα = 2, bα = 2. (C.6) 0 0 where bλ is selected to match the scale, while aλ, mλ, and ψλ yields a relatively ignorant and diffuse 0 0 0 0 prior. Following Ishwaran and James (2001, 2002), the hyperparameters for the DP scale parameter α in equation (C.6) allows for a flexible component structure with a wide range of component numbers. The truncated number of components is set to be K = 50, so that the approximation error is uniformly bounded by Ishwaran and James (2001) Theorem 2: (cid:13) (cid:13) (cid:18) K −1 (cid:19) (cid:13)fλ,K −fλ(cid:13) ∼ 4N exp − ≤ 2.10×10−18, (cid:13) (cid:13) 1 α at the prior mean of α (α¯ = 1) and cross-sectional sample size N = 1000. A-24

I have also examined other choices of hyperparameters, and results are not very sensitive to hyperparameters as long as the implied priors are flexible enough to cover the range of observables. C.2 Random-Walk Metropolis-Hastings When there is no closed-form conditional posterior distribution in some MCMC steps, it is helpful to employ the Metropolis-within-Gibbs sampler and use the random-walk Metropolis-Hastings (RWMH) algorithm for those steps. The adaptive RWMH algorithm below is based on Atchad´e and Rosenthal (2005) and Griffin (2016), who adaptively adjust the random walk step size in order to keep acceptance rates around certain desirable percentage. Algorithm C.1. (Adaptive RWMH) Let us consider a generic variable θ. For each iteration s = 1,··· ,n , sim 1. Draw candidate θ˜from the random-walk proposal density θ˜∼ N (cid:0) θ(s−1),ζ(s)Σ (cid:1) . 2. Calculate the acceptance rate (cid:32) (cid:33) p(θ˜|·) a.r.(θ˜|θ(s−1)) = min 1, , p(θ(s−1)|·) where p(θ|·) is the conditional posterior distribution of interest. 3. Accept the proposal and set θ(s) = θ˜ with probability a.r.(θ˜|θ(s−1)). Otherwise, reject the proposal and set θ(s) = θ(s−1). 4. Update the random-walk step size for the next iteration, (cid:16) (cid:16) (cid:17)(cid:17) logζ(s+1) = ρ logζ(s)+s−c a.r.(θ˜|θ(s−1))−a.r.(cid:63) , where 0.5 < c ≤ 1, a.r.(cid:63) is the target acceptance rate, and ρ(x) = min(|x|,x¯)·sgn(x), where x¯ > 0 is a very large number. Remark C.2. (i) In step 1, since the algorithms in this paper only consider RWMH on conditionally independent scalar variables, Σ is simply taken to be 1. (ii) In step 4, I choose c = 0.55, a.r.(cid:63) = 30% in the numerical exercises, following Griffin (2016). C.3 Details on Posterior Samplers C.3.1 Step 2: Component Parameters (cid:16) (cid:17) Random Coefficients Model For z = λ,l and k = 1,··· ,Kz, draw µ z(s) ,Ω z(s) from a k k multivariate-normal-inverse-Wishart distribution (or a normal-inverse-gamma distribution if z is a A-25

(cid:18) (cid:12) (cid:19) z(s) z(s)(cid:12)(cid:110) (s−1) (cid:111) scalar) p µ ,Ω (cid:12) z : k k (cid:12) i i∈Jz(s−1) k (cid:16) (cid:17) (cid:16) (cid:17) µ z(s) ,Ω z(s) ∼ N mz,ψzΩ z(s) IW(Ψz,νz), k k k k k k k mˆz = 1 (cid:88) z (s−1) , k z(s−1) i n k i∈Jz(s−1) k (cid:16) (cid:17)−1 ψz = (ψz)−1+n z(s−1) , k 0 k   mz = ψz(ψz)−1mz + (cid:88) z (s−1), k k 0 0 i  i∈Jz(s−1) k νz = νz +n z(s−1) , k 0 k Ψz = Ψz + (cid:88) (cid:16) z (s−1) (cid:17)2 +mz(cid:48)(ψz)−1mz −mz(cid:48)(ψz)−1mz. k 0 i 0 0 0 k k k i∈Jz(s−1) k Correlated Random Coefficients Model Due to the complexity arising from the conditional (cid:16) (cid:17) z(s) z(s) structure, I break the updating procedure for µ ,Ω into two steps. For z = λ,l, and k k k = 1,··· ,Kz, (cid:18) (cid:12) (cid:19) (cid:16) z(s) (cid:17) z(s)(cid:12) z(s−1) (cid:110) (s−1) (cid:111) (a)Drawvec µ k fromamultivariatenormaldistributionp µ k (cid:12) (cid:12) Ω k , z i ,c i0 i∈Jz(s−1) : k (cid:16) (cid:17) vec µ z(s) ∼ N (vec(mz),ψz), k k k mˆz,zc = (cid:88) z (s−1)(cid:2) 1,c(cid:48) (cid:3) , k i i0 i∈Jz(s−1) k mˆz,cc = (cid:88) (cid:2) 1,c(cid:48) (cid:3)(cid:48)(cid:2) 1,c(cid:48) (cid:3) , k i0 i0 i∈Jz(s−1) k mˆz = mˆz,zc(cid:0) mˆz,cc(cid:1)−1 , k k k (cid:20) (cid:16) (cid:17)−1 (cid:21)−1 ψz = (ψz)−1+mˆz,cc⊗ Ω z(s−1) , k 0 k k (cid:20) (cid:18) (cid:19) (cid:21) (cid:16) (cid:17)−1 vec(mz) = ψz (ψz)−1vec(mz)+ mˆz,cc⊗ Ω z(s−1) vec(mˆz) . k k 0 0 k k k z(s) (b) Draw Ω from an inverse Wishart distribution (or an inverse gamma distribution if z is a k A-26

(cid:18) (cid:12) (cid:19) z(s)(cid:12) z(s) (cid:110) (s−1) (cid:111) scalar) p Ω k (cid:12) (cid:12) µ k , z i ,c i0 i∈Jz(s−1) : k Ω z(s) ∼ IW(Ψz,νz), k k k νz = νz +n z(s−1) , k 0 k Ψz = Ψz + (cid:88) (cid:16) z (s−1) −µ z(s)(cid:2) 1,c(cid:48) (cid:3)(cid:48) (cid:17)(cid:16) z (s−1) −µ z(s)(cid:2) 1,c(cid:48) (cid:3)(cid:48) (cid:17)(cid:48) . k 0 i k i0 i k i0 i∈Jz(s−1) k C.3.2 Step 4: Individual-specific Parameters (s) For i = 1,··· ,N, draw λ from a multivariate normal distribution (or a normal distribution if λ (cid:16) (cid:12) i (cid:17) is a scalar) p λ (s)(cid:12)µ λ(s) ,Ω λ(s) , (cid:0) σ2(cid:1)(s−1) ,β(s−1),D ,D : i (cid:12) γλ γλ i i A i i (cid:16) (cid:17) λ (s) ∼ N mλ,Σλ , i i i Σλ = (cid:32) (cid:16) Ω λ(s) (cid:17)−1 + (cid:16) (cid:0) σ2(cid:1)(s−1) (cid:17)−1 (cid:88) t1i w w(cid:48) (cid:33)−1 , i γλ i i,t−1 i,t−1 i t=t0i mλ = Σλ (cid:32) (cid:16) Ω λ(s) (cid:17)−1 µ˜λ+ (cid:16) (cid:0) σ2(cid:1)(s−1) (cid:17)−1 (cid:88) t1i w (cid:16) y −β(s−1)(cid:48)x (cid:17) (cid:33) , i i γλ i i i,t−1 it i,t−1 i t=t0i where the conditional“prior”mean is characterized by  λ(s) µ , for the random coefficients model, µ˜λ = γ i λ i µ λ(s) [1,c(cid:48) ](cid:48), for the correlated random coefficients model. γλ i0 i A-27

C.3.3 Step 5: Common parameters Cross-sectional Homoskedasticity Draw (cid:0) β(s),σ2(s)(cid:1) from a linear regression model with (cid:16) (cid:12)(cid:110) (cid:111) (cid:17) “unknown”variance, p β(s),σ2(s)(cid:12) λ (s) ,D : (cid:12) i (cid:16) (cid:17) (cid:16) (cid:17) (cid:16) (cid:17) β(s),σ2(s) ∼ N mβ,ψβσ2(s) IG aσ2 ,bσ2 , ψβ = (cid:32) (cid:16) ψβ (cid:17)−1 + (cid:88) N (cid:88) t1i x x(cid:48) (cid:33)−1 , 0 i,t−1 i,t−1 i=1t=t0i mβ = ψβ (cid:32) (cid:16) ψβ (cid:17)−1 mβ + (cid:88) N (cid:88) t1i x (cid:16) y −λ (s)(cid:48) w (cid:17) (cid:33) , 0 0 i,t−1 it i i,t−1 i=1t=t0i NT aσ2 = aσ2 + 0 2 (cid:32) N T (cid:33) bσ2 = bσ2 + 1 (cid:88)(cid:88)(cid:16) y −λ (s)(cid:48) w (cid:17)2 +mβ(cid:48) (cid:16) ψβ (cid:17)−1 mβ −mβ(cid:48) (cid:16) ψβ (cid:17)−1 mβ . 0 2 it i i,t−1 0 0 0 i=1 t=1 Cross-sectional Heteroskedasticity Draw β(s) from a linear regression model with“known” (cid:16) (cid:12)(cid:110) (cid:111) (cid:17) variance, p β(s)(cid:12) λ (s) , (cid:0) σ2(cid:1)(s) ,D : (cid:12) i i (cid:16) (cid:17) β(s) ∼ N mβ,Σβ , Σβ = (cid:32) (cid:16) Σβ (cid:17)−1 + (cid:16) (cid:0) σ2(cid:1)(s) (cid:17)−1(cid:88) N (cid:88) t1i x x(cid:48) (cid:33)−1 , 0 i i,t−1 i,t−1 i=1t=t0i mβ = Σβ (cid:32) (cid:16) Σβ (cid:17)−1 mβ + (cid:16) (cid:0) σ2(cid:1)(s) (cid:17)−1(cid:88) N (cid:88) t1i x (cid:16) y −λ (s)(cid:48) w (cid:17) (cid:33) . 0 0 i i,t−1 it i i,t−1 i=1t=t0i Remark C.3. For unbalanced panels, the summations and products in steps 4 and 5 (Subsections C.3.2 and C.3.3) are instead over t = t ,··· ,t , the observed periods for individual i. 0i 1i C.4 Slice-Retrospective Sampler The next algorithm borrows the idea from some recent development in DPM sampling strategies (Dunson, 2009; Yau et al., 2011; Hastie et al., 2015), which integrates the slice sampler (Walker, 2007; Kalli et al., 2011) and the retrospective sampler (Papaspiliopoulos and Roberts, 2008). By addingextraauxiliaryvariables, thesamplerisabletoavoidhardtruncationinIshwaranandJames (2001,2002). Iexperimentwithittocheckwhethertheapproximationerrorduetotruncationwould significantly affect the density forecasts or not, and the results do not change much. The following algorithm is designed for the random coefficient case. A corresponding version for the correlated random coefficient case can be constructed in a similar manner. A-28

The auxiliary variables uz, i = 1,··· ,N, are i.i.d. standard uniform random variables, i.e. i uz ∼ U (0,1). Then, the mixture of components in equation (2.7) can be rewritten as i ∞ (cid:88) z ∼ 1(uz < pz )fz(z;θz), i ik k k=1 where z = λ,l. By marginalizing over uz, we can recover equation (2.7). Accordingly, we can define i the number of active components as Kz,A = max γz, i 1≤i≤N and the number of potential components (including active components) as     k  (cid:88)  Kz,P = min k : 1− pz j < min uz i .  1≤i≤N  j=1 Although the number of components is infinite literally, we only need to care about the components that can potentially be occupied. Therefore, Kz,P serves as an upper limit on the number of components that need to be updated at certain iteration. Here I suppress the iteration indicator s for exposition simplicity, but note that both Kz,A and Kz,P can change over iterations; this is indeed the highlight of this sampler. Algorithm C.4. (Slice-Retrospective: Random Coefficients with Cross-sectional Heteroskedasticity) For each iteration s = 1,··· ,n , steps 1-3 in Algorithm 4.1 are modified as follows: sim For z = λ,l, 1. Active components: (a) Number of active components: Kz,A = max γ z(s−1) . i 1≤i≤N (b) Component probabilities: for k = 1,··· ,Kz,A, draw pz∗ from the stick-breaking process (cid:16) (cid:12) (cid:110) (cid:111)(cid:17) k p {pz∗}(cid:12)αz(s−1), n z(s−1) : k (cid:12) k   Kz,A pz∗ ∼ SBn z(s−1) ,αz(s−1)+ (cid:88) n z(s−1) , k = 1,··· ,Kz,A. k k j j=k+1 (cid:18) (cid:12) (cid:19) (c) Component parameters: for k = 1,··· ,Kz,A, draw θz∗ from p θz∗ (cid:12) (cid:12) (cid:110) z (s−1) (cid:111) k k (cid:12) i i∈Jz(s−1) k as in Algorithm 4.1 step 2. (cid:110) (cid:111)Kz,A (cid:110) (cid:111)Kz,A (d) Label switching: jointly update p z(s) ,θ z(s) ,γz∗ based on pz∗,θz∗,γ z(s−1) by k k i k k i k=1 k=1 A-29

three Metropolis-Hastings label-switching moves: i. randomly select two non-empty components, switch their component labels (γz), while i leaving component parameters (θz) and component probabilities (pz) unchanged; k k ii. randomly select two adjacent components, switch their component labels (γz) and i component“stick lengths”(ζz), while leaving component parameters (θz) unchanged; k k iii. randomly select two non-empty components, switch their component labels (γz) and i component parameters (θz), as well as update their component probabilities (pz). k k Then, adjust Kz,A accordingly. (cid:16) (cid:12)(cid:110) (cid:111) (cid:17) 2. Auxiliaryvariables: fori = 1,··· ,N, drawu z(s) fromauniformdistributionp u z(s)(cid:12) p z(s) ,γz∗ : i i (cid:12) k i (cid:16) (cid:17) z(s) z(s) u ∼ U 0,p . i γz∗ i 3. DP scale parameter: (a) Draw the latent variable ξz(s) from a beta distribution p (cid:0) ξz(s) (cid:12) (cid:12)αz(s−1),N (cid:1) : (cid:16) (cid:17) ξz(s) ∼ Beta αz(s−1)+1,N . (b) Drawαz(s) fromamixtureoftwogammadistributionsp (cid:0) αz(s) (cid:12) (cid:12)ξz(s),Kz,A,N (cid:1) :Parametric Prior for Heteroskedastic σ2 i (cid:16) (cid:17) (cid:16) (cid:17) αz(s) ∼ pαz Ga aαz +Kz,A,bαz −logξz(s) + (cid:0) 1−pαz(cid:1) Ga aαz +Kz,A−1,bαz −logξz(s) , aαz +Kz,A−1 pαz = . N (cid:0) bαz −logξz(s) (cid:1) 4. Potential components: (a) Component probabilities: start with Kz∗ = Kz,A, (cid:16) (cid:17) i. if 1− (cid:80)Kz∗ p z(s) < min u z(s) , set Kz,P = Kz∗ and stop; j=1 j 1≤i≤N i (cid:16) (cid:17) ii. otherwise, letKz∗ = Kz∗+1, drawζz ∼ Beta (cid:0) 1,αz(s)(cid:1) , updatep z(s) = ζz (cid:81) 1−ζz , Kz∗ Kz∗ Kz∗ j<Kz∗ j and go to step (a-i). (b) Component parameters: for k = Kz,A+1,··· ,Kz,P, draw θ z(s) from the DP base distrik bution Gz. 0 z(s) 5. Component memberships: For i = 1,···N, draw γ from a multinomial distribution (cid:16)(cid:110) (cid:111)(cid:12)(cid:110) (cid:111) (cid:17) i p γ z(s) (cid:12) p z(s) ,µ z(s) ,Ω z(s) ,u z(s) ,z (s−1) : i (cid:12) k k k i i γ z(s) = k, with probability pz , k = 1,··· ,Kz,P, i ik Kz,P pz ∝ p z(s) φ (cid:16) z (s−1) ;µ z(s) ,Ω z(s) (cid:17) 1 (cid:16) u z(s) < p z(s) (cid:17) , (cid:88) pz = 1. ik k i k k i k ik k=1 A-30

The remaining part of the algorithm resembles steps 4 and 5 in Algorithm 4.1. Remark C.5. Note that: (i) Steps 1-b,c,d are sampling from “marginal” posterior of (pz,θz,γz) for the active components k k i with the auxiliary variables uzs being integrated out. Thus, extra caution is needed in dealing with i the order of the steps. (ii) The label switching moves 1-d-i and 1-d-ii are based on Papaspiliopoulos and Roberts (2008), and 1-d-iii is suggested by Hastie et al. (2015). All these label switching moves aim to improve numerical convergence. (iii) Step 3 for DP scale parameter αz follows Escobar and West (1995). It is different from step 1-a in Algorithm 4.1 due to the unrestricted number of components in the current sampler. (iv) Steps 4-a-ii and 4-b that update the potential components are very similar to steps 1-b and 1-c that update the active components—just take Jz as an empty set and draw directly from the prior. k (v) The auxiliary variable uz also appears in step 5 that updates component memberships. The i inclusionofauxiliaryvariableshelpsdetermineafinitesetofrelevantcomponentsforeachindividual i without mechanically truncating the infinite mixture. C.5 Parametric Specification of Heteroskedasticity For Heterosk-Param, we adopt an inverse gamma prior for σ2, i σ2 ∼ IG(a,b). i TheconjugatepriorsforshapeparameteraandscaleparameterbarebasedonLleraandBeckmann (2016) Sections 2.3.1 and 2.3.2: (cid:16) (cid:17) b ∼ Ga ab,bb , 0 0 (aa)−1−a(b)aca 0 p(a|b,aa,ba,ca) ∝ 0 . (C.7) 0 0 0 Γ(a)ba 0 Following Llera and Beckmann (2016), the hyperparameters are chosen as aa = 1, ba = ca = ab = 0 0 0 0 bb = 0.01, which specifies relatively uninformative priors for a and b. The corresponding segment 0 of the posterior sampler is given as follows. Algorithm C.6. (Parametric Specification: Cross-sectional Heteroskedasticity) For each iteration s = 1,··· ,n , sim 1. Shape parameter: Draw a(s) via the random-walk Metropolis-Hastings approach, (cid:16) (cid:12) (cid:110) (cid:111)(cid:17) (cid:16) (cid:17) p a(s)(cid:12)b(s−1), σ 2(s−1) = p a(s)|b(s−1),aa,ba,ca , (cid:12) i 1 1 0 A-31

which is characterized by the same kernel form as expression (C.7) with N log(aa) = log(aa)+ (cid:88) log(σ 2(s−1) ), 1 0 i i=1 ba = ba+N, 1 0 ca = ca+N. 1 0 (cid:16) (cid:12) (cid:110) (cid:111)(cid:17) 2. Scale parameter: Draw b(s) from a gamma distribution, p b(s)(cid:12)a(s), σ 2(s−1) : (cid:12) i (cid:16) (cid:17) b(s) ∼ Ga ab,bb , 1 1 ab = ab +Na(s), 1 0 N bb = bb + (cid:88)(cid:16) σ 2(s−1) (cid:17)−1 . 1 0 i i=1 2(s) 3. Heteroskedasticity: For i = 1,··· ,N, draw σ from an inverse gamma distribution, (cid:16) (cid:12) (cid:17) i p σ 2(s)(cid:12)a(s),b(s),λ (s) ,β(s−1),D ,D : i (cid:12) i i A 2(s) σ ∼ IG(a ,b ), i i i a = a(s)+T/2, i T b = b(s)+ 1 (cid:88)(cid:16) y −β(s−1)(cid:48)x −λ (s)(cid:48) w (cid:17)2 . i 2 it i,t−1 i i,t−1 t=1 D Simulations and Empirical Application D.1 Simulations A-32

Figure D.1: Convergence Diagnostics: β For each iteration s, rolling mean is calculated over the most recent 1000 draws. A-33

Figure D.2: Convergence Diagnostics: σ2 For each iteration s, rolling mean is calculated over the most recent 1000 draws. A-34

Figure D.3: Convergence Diagnostics: α For each iteration s, rolling mean is calculated over the most recent 1000 draws. A-35

Figure D.4: Convergence Diagnostics: λ 1 For each iteration s, rolling mean is calculated over the most recent 1000 draws. A-36

Figure D.5: f vs Π(f|y ) : Baseline Model, N = 105 0 1:N,0:T The black solid line represents the true λ distribution, f . The blue bands show the posterior distribution i 0 of f, Π(f|y ). 1:N,0:T D.2 Empirical application Other model specifications Followingtheyoungfirmdynamicsliterature, forthekeyvariables with potential heterogeneous effects (w ), I also examined the following two setups beyond the i,t−1 R&D setup in Section 6:52 (i) w = 1, which specifies the baseline model with λ being the individual-specific intercept. i,t−1 i (ii) w = [1, rec ](cid:48). rec is an aggregate dummy variable indicating the recent recession. It i,t−1 t−1 t is equal to 1 for 2008 and 2009, and is equal to 0 for other periods. Details on sample construction After the second step (based on Assumption 3.5 for unbalanced panels), we have the cross-sectional dimension N = 859 for the baseline specification, N = 794 with recession, and N = 677 with R&D. In order to compare forecasting performance across different setups, the sample is further restricted so that all three setups share exactly the same set of firms, and we are left with N = 654 firms. Additional results Common parameter β: In most cases, the posterior means are around 0.4 ∼ 0.6. Point Forecasts: Most of the estimators are comparable according to MSE, with only Flat performing poorly in all three setups. 52I do not jointly incorporate recession and R&D because such specification largely restricts the cross-sectional sample size due to the rank requirement for unbalanced panels. A-37

Density Forecasts: The overall best is the Heterosk-NP-C/R predictor in the R&D setup. Comparing setups, the one with recession produces the worst density forecasts (and point forecasts as well), so the recession dummy with heterogeneous effects does not contribute much to forecasting and may even incur overfitting. Table D.1: Common Parameter β Baseline Recession R&D Mean Std Mean Std Mean Std Heterosk NP-C/R 0.48 0.01 0.46 0.02 0.52 0.01 Homog 0.85 0.02 0.85 0.02 0.89 0.02 Homosk NP-C 0.37 0.02 0.88 0.02 0.51 0.03 Heterosk Flat 0.19 0.02 0.25 0.00 0.50 0.00 Param 0.48 0.03 0.26 0.03 0.56 0.03 NP-disc 0.55 0.02 0.79 0.02 0.84 0.04 NP-R 0.47 0.03 0.30 0.03 0.74 0.04 NP-C 0.38 0.02 0.40 0.06 0.53 0.01 Table D.2: Forecast Evaluation: Young Firm Dynamics Baseline Recession R&D MSE* LPS*N MSE* LPS*N MSE* LPS*N Heterosk NP-C/R 0.20*** -230*** 0.23*** -272*** 0.20*** -228*** Homog 10%*** -81*** -2%*** -41*** 8%*** -74*** Homosk NP-C 7%*** -66*** 2%*** -17*** 9%*** -52*** Heterosk Flat 22%*** -42*** 44%*** -701*** 102%*** -309*** Param 4%*** -60*** 35%*** -135*** 7%*** -52*** NP-disc 1%*** -9*** -7%*** -1*** 2%*** -20*** NP-R 1%*** -5*** 28%*** -63*** 3%*** -16*** NP-C 3%*** -6*** 3%*** -5*** 0.1%*** -5*** See the description of Table 5.2. Here Heterosk-NP-C/R is the benchmark for both normalization and significance tests. A-38

Cite this document

APA

Laura Liu (2018). Density Forecasts in Panel Data Models: A Semiparametric Bayesian Perspective (FEDS 2018-036). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2018-036

BibTeX

@techreport{wtfs_feds_2018_036,
  author = {Laura Liu},
  title = {Density Forecasts in Panel Data Models: A Semiparametric Bayesian Perspective},
  type = {Finance and Economics Discussion Series},
  number = {2018-036},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2018},
  url = {https://whenthefedspeaks.com/doc/feds_2018-036},
  abstract = {This paper constructs individual-specific density forecasts for a panel of firms or households using a dynamic linear model with common and heterogeneous coefficients and cross-sectional heteroskedasticity. The panel considered in this paper features a large cross-sectional dimension N but short time series T. Due to the short T, traditional methods have difficulty in disentangling the heterogeneous parameters from the shocks, which contaminates the estimates of the heterogeneous parameters. To tackle this problem, I assume that there is an underlying distribution of heterogeneous parameters, model this distribution nonparametrically allowing for correlation between heterogeneous parameters and initial conditions as well as individual-specific regressors, and then estimate this distribution by pooling the information from the whole cross-section together. Theoretically, I prove that both the estimated common parameters and the estimated distribution of the heterogeneous parameters achieve posterior consistency, and that the density forecasts asymptotically converge to the oracle forecast. Methodologically, I develop a simulation-based posterior sampling algorithm specifically addressing the nonparametric density estimation of unobserved heterogeneous parameters. Monte Carlo simulations and an application to young firm dynamics demonstrate improvements in density forecasts relative to alternative approaches. Accessible materials (.zip)},
}