feds · June 22, 2023

Finite-State Markov-Chain Approximations: A Hidden Markov Approach

Abstract

This paper proposes a novel finite-state Markov chain approximation method for Markov processes with continuous support, providing both an optimal grid and transition probability matrix. The method can be used for multivariate processes, as well as non-stationary processes such as those with a life-cycle component. The method is based on minimizing the information loss between a Hidden Markov Model and the true data-generating process. We provide sufficient conditions under which this information loss can be made arbitrarily small if enough grid points are used. We compare our method to existing methods through the lens of an asset-pricing model, and a life-cycle consumption-savings model. We find our method leads to more parsimonious discretizations and more accurate solutions, and the discretization matters for the welfare costs of risk, the marginal propensities to consume, and the amount of wealth inequality a life-cycle model can generate.

Finance and Economics Discussion Series Federal Reserve Board, Washington, D.C. ISSN 1936-2854 (Print) ISSN 2767-3898 (Online) Finite-State Markov-Chain Approximations: A Hidden Markov Approach Janssens, Eva F. and McCrary, Sean 2023-040 Please cite this paper as: Janssens, Eva F., and McCrary, Sean (2023). “Finite-State Markov-Chain Approximations: A Hidden Markov Approach,” Finance and Economics Discussion Series 2023-040. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2023.040. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

Finite-State Markov-Chain Approximations: A Hidden Markov Approach Eva F. Janssens† and Sean McCrary‡ May 19, 2023 Abstract Thispaperproposesanovelfinite-stateMarkovchainapproximationmethodforMarkov processeswithcontinuoussupport,providingbothanoptimalgridandtransitionprobabilitymatrix. Themethodcanbeusedformultivariateprocesses,aswellasnon-stationary processes such as those with a life-cycle component. The method is based on minimizing the information loss between a Hidden Markov Model and the true data-generating process. Weprovidesufficientconditionsunderwhichthisinformationlosscanbemade arbitrarily small if enough grid points are used. We compare our method to existing methodsthroughthelensofanasset-pricingmodel,andalife-cycleconsumption-savings model. Wefindourmethodleadstomoreparsimoniousdiscretizationsandmoreaccurate solutions,andthediscretizationmattersforthewelfarecostsofrisk,themarginalpropensitiestoconsume,andtheamountofwealthinequalityalife-cyclemodelcangenerate. Keywords: Numerical methods, Kullback–Leibler divergence, misspecified model, earningsprocess JELclassificationcodes: C63,C68,D15,E21 *Disclaimer:Theviewsexpressedinthispaperaresolelytheresponsibilityoftheauthorsandshouldnotbeinterpretedas reflectingtheviewsoftheBoardofGovernorsoftheFederalReserveSystem. *Acknowledgements: BothauthorsaregratefulfortheinvaluablecommentsfromJosé-VíctorRíos-Rull,FrankKleibergen, ChristianStoltenberg,RobinLumsdaine,aswellasseminarandconferenceparticipantsattheUniversityofZürich,UniversityofAmsterdam,UniversityofMichigan,UniversityofHouston,UniversityofOxford,FederalReserveBoard,EEA/ESEM 2022,CEF2022,CFE2022,andtheNBERSI2022,especiallyJesúsFernández-Villaverde,FrankSchorfheide,BorağanAruoba, MikkelPlagborg-Møller,MichaelWolf,DamianKozbur,FlorianGunsilius,ElisabethProehl,andmanyothers. Janssensis gratefultotheDutchResearchCouncilfortheNWOResearchTalentGrant,projectnumber406.18.514andtotheErasmus TrustfondsfortheProfessorBruinsPrize2018,fundingtheresearchvisittoUniversityofPennsylvaniaduringwhichthis paperwaswritten,aswellastoFrankSchorfheideforhostingthisvisit.WethanktheSocietyofComputationalEconomics fortheCEF2022StudentPrize.Anyerrorsareourown. Contactinformation: †EvaF.Janssens:EconomistattheBoardofGovernorsoftheFederalReserveSystem,e-mail:eva.f.janssens@frb.gov ‡SeanMcCrary:PhDstudentattheUniversityofPennsylvania,e-mail:smccrary@sas.upenn.edu

1 Introduction Numerical methods to solve nonlinear dynamic stochastic models often rely on finite-state Markov chain approximations of continuous stochastic processes. The stochastic process is an important input for these models, and its finite-state Markov chain approximation should therefore resemble the continuous process as closely as possible. This paper proposes a novel full-information method that can be used for the discretization of continuous Markov processes. We show that our method results in more accurate solutions to an asset-pricing model,andabettercharacterizationofearningsriskinalife-cycleconsumption-savingmodel withnon-linearnon-Gaussianearningsprocesses. ApproximatingacontinuousstochasticprocessbyadiscreteMarkovprocess,characterizedby agridofsupportpointsandatransitionprobabilitymatrix,inherentlycomesdowntopicking a misspecified approximating model. Borrowing from the misspecified model literature, we therefore propose a finite-state Markov chain approximation method that minimizes the informationlossbetweenthemisspecifiedprocessandthetruecontinuousprocess. Weassume that the misspecified process is a Hidden Markov Model (HMM), that is, each realization is equaltothesumofastate-dependentlevel(agridpoint)andanerrorterm. Thisstateisunobserved,andtheevolutionoftheunobservedstateisgovernedbyadiscretefirst-orderMarkov process(withatransitionprobabilitymatrix). ThiseffectivelyembedsadiscreteMarkovchain intoacontinuoussupportprocessviaacontinuousmeasurementerror. Thisallowsustouse the standard Kullback-Leibler (KL) divergence between the two processes as our measure of informationloss. Consequently, the practical implementation of our method is simple, because in this setting, minimizing the KL divergence is essentially quasi-maximum likelihood estimation, fitting a HMM on data simulated from the continuous support process.1 What is attractive about our approach is that it results in both an optimal grid and transition probability matrix, and can be applied to multivariate processes, in which case the optimal grid helps limit the curseof-dimensionality issue posed by tensor grids by accounting for the dependency between variables. Our theoretical contribution is to prove that, under some assumptions, as the number of unobserved states (and thus grid points) becomes large, the information loss between the 1As shown by Mevel and Finesso (2004) and later Douc and Moulines (2012), the maximum likelihood estimator of a misspecified HMM is consistent, in the sense that it minimizes the KL divergence between the modelandthetrueprocess. 1

misspecifiedHMMandthetruecontinuousstochasticprocessbecomesarbitrarilysmall. This relates our paper to the literature on universal approximators, where we build on the result by Zeevi and Meir (1997) on (Gaussian) mixtures, and extend this to the non-i.i.d. setting of HMM’s.2 Our proof provides insights into what properties of the process determine how many grid points are needed to obtain a certain information loss. For example, more persistent processes require a larger number of grid points, which is why finite-state Markov chainapproximationsofhighlypersistentprocessestendtobelessaccurate.3 We evaluate the performance of our method in two economic applications: an asset pricingmodel and a life-cycle consumption-saving model. First, in our asset pricing model, we discretizedividendgrowth,whichisassumedtofollowanautoregressive(AR(1))processwith stochasticvolatility,parametrizedasinBansalandYaron(2004). AsshownbyDeGroot(2015), this model has a closed-form solution. We use this solution as a benchmark to compare the performanceofourmethodagainstthestandardsintheliterature,andfindourdiscretization captures higher-order moments of the true continuous process better, is more parsimonious, and results in more accurate model solutions. For example, we analyze the accuracy of these discretization methods for estimates of the certainty equivalent level of consumption (CE) andfindthatourmethoddeviates0.8-1.9%fromtheclosed-formsolutionofDeGroot(2015), whilethecomparisonmethodshavedeviationsrangingfrom4to12%. Theseresultshighlight thestrengthofafull-informationapproach,becauseforanon-linearobjectsuchastheCE,all informationofthestochasticprocessmatters. Second,weanalyzetheperformanceofourmethodthroughthelensofalife-cycleconsumptionsaving model. In this application, we focus on two processes featuring life-cycle dependence; the process proposed in Guvenen, Karahan, Ozkan, and Song (2021) that features non-employment shocks, and innovations with positive skewness, and the non-parametric process in Arellano, Blundell, and Bonhomme (2017). These processes are considered to be atthefrontieroftheearningsdynamicsliterature(Altonji,Hynsjö,andVidangos,2022). Our discretizations better capture the excess skewness and kurtosis of the Guvenen et al. (2021) and Arellano et al. (2017) processes than commonly-used binning-based discretization methods.4 For the Guvenen et al. (2021) process specifically, the binning method fails to capture thelong-rundynamicsofnon-employment. 2TheuniversalapproximatorpropertyhasalsobeenshowntoholdforNeuralNetworks,buttoourknowledge alsoonlyinani.i.d. setting,see,e.g.,theseminalworkbyHornik,Stinchcombe,andWhite(1989). 3Asdiscussedin/shownbyFlodén(2008),KopeckyandSuen(2010)andGalindevandLkhagvasuren(2010). 4ThesebinningmethodsareadaptedfromthetextbooktreatmentofAddaandCooper(2003). 2

In the life-cycle model, we find that the discretization method matters for various economic quantities of interest, including the welfare cost of risk, wealth inequality measures, and marginal propensities to consume (MPC). By failing to fully capture the excess kurtosis and skewness of the processes over the life-cycle, binning-based methods underestimate the welfare cost of risk and the amount of precautionary saving in the economy. For the Guvenen et al. (2021) process, the binning-based method underestimates the welfare cost of risk by 23 percentagepointsrelativetoourmethod,andfortheArellanoetal.(2017)process,thedifference is 3 percentage points. Discretizations also matter for the amount of wealth inequality a life-cycle model can generate. While it is known that life-cycle models struggle to match the wealth distribution in the data (De Nardi and Fella, 2017), we show that more accurate discretizations of the earnings process can generate more wealth inequality. Our discretization of the Arellano et al. (2017) process generates a Wealth Gini of 0.76, close to that of the UnitedStates(0.77-0.78),whilebinningresultsinavaluethatis0.06lower. Similarly,ourdiscretization results in top 1% wealth shares close to those in the data, while the binning-based estimatesunderestimatethismoment. Ourresultsontheimportanceofdiscretizationmethodsforlife-cyclemodelsolutionsextends to simpler processes. For a Gaussian AR(1) and mixture AR(1), we show the life-cycle model solutions differ significantly between discretization methods, although the differences do become smaller when a sufficiently large number of grid points is used. This is an important insight, given the low number of grid points the literature tends to use for these processes. Our solution, on the other hand, changes little when adding more grid points, because our method is more parsimonious and captures more information of the true process than the otherdiscretizationmethods. Fortheseprocesses,thesensitivityofthemarginalpropensities toconsumeoverthelife-cyclestandsout. Otherdiscretizationmethodscanunderestimatethe MPC’sforyoungeragegroupsbyasmuchas20percentagepointswhenusingalownumber ofgridpoints. Finally, we compare the life-cycle implications across different stochastic processes. To our knowledge, this paper is the first to discretize the Guvenen et al. (2021) process and evaluate its implications in an incomplete markets model. Furthermore, representing the Arellano et al.(2017)andGuvenenetal.(2021)processesasdiscreteMarkovchainsallowsforaconsistent comparison between the two processes. We find the largest source of risk in the Guvenen et al.(2021)processcomesfromtheprobabilityofnon-employment,whichisahighlypersistent statewithrisingpersistenceoverthelife-cycle. Incontrast,mostriskintheArellanoetal.(2017) process comes from the highest earnings state, which features a considerable probability of earningslossnextperiod,especiallyatyoungerages,creatingastrongprecautionarysavings 3

motiveamonghighearnersinthemodel. Thesedifferencesbetweenearningsprocessesresult in different dynamics in our life-cycle model. The risk of non-employment in the Guvenen et al.(2021)processgeneratesalife-cycleprofileforMPC’sthatisflatterthanthatoftheArellano etal.(2017)process,andinahigherwelfarecostofrisk(0.69insteadof0.19inthemodelwith Arellano et al. (2017)). The Arellano et al. (2017) process features more earnings inequality andconsequentlyprovidesabetterfittothewealthdistributionthantheGuvenenetal.(2021) process. Thepaperproceedsasfollows. Thenextsubsectiondiscussestherelatedliterature. Section2 discussesourdiscretizationmethodandtheoreticalcontributions. Section3presentstheasset pricing model with stochastic volatility. Section 4 discusses the life-cycle model applications. Section 5 concludes. Appendix A provides the proof of our Main Theorem. Appendix B providesdetailsontheestimationoftheHMM.AppendixCprovidesanadditionalapplication tothediscretizationofvectorautoregressiveprocesses. . Several methods have been proposed to discretize stochastic processes. Related literature Mostofthese,suchasTauchen(1986),Rouwenhorst(1995),TauchenandHussey(1991),Duan and Simonato (2001), Terry and Knotek II (2011), and Gospodinov and Lkhagvasuren (2014), are designed for specific (linear) processes, such as AR(1) or VAR processes. Fella, Gallipoli, and Pan (2019) adapt the methods of Rouwenhorst (1995), Tauchen and Hussey (1991) and AddaandCooper(2003)toprocesseswithalife-cyclecomponent,andanalyzehowitperforms under settings where the innovations are drawn from a mixture of normals. Galindev and Lkhagvasuren (2010) adapt Rouwenhorst (1995) to a setting with highly-persistent correlated AR(1) shocks. Civale, Díez-Catalán, and Fazilet (2016) adapt the Tauchen (1986) method to accommodate autoregressive processes with innovations drawn from a normal mixture. Unlikethesemethods,ourmethodisapplicabletoanyprocess,andprovidesbothanoptimal grid and transition probability matrix, while these methods typically take a grid as input, and/orassumeequal-distantorequal-quantilegrids. Some discretization methods are applicable to a larger class of stochastic processes. Binning methods as in Judd (1998) and Adda and Cooper (2003), that discretize via a partition of the quantile space, are applicable to any stochastic process. However, binning methods only match one-step ahead transitions between bins and take the grid spacing as an input, while our discretization method looks at the full dynamics and provides an optimal grid. Farmer and Toda (2017) propose a method to refine discrete approximations by moment matching. Their method takes as inputs a grid, an initial transition probability matrix, and a set of moments to match, where the goal is to match these moments exactly – if possible – with a 4

transitionmatrixthatisclose,measuredthroughrelativeentropy,totheinitialapproximation. Our method, in contrast, can be seen as a full-information discretization method, rather than moment-matching, that does not rely on prior information (i.e., an initial discretization) to obtainidentification. Formultivariateprocesses,mostexistingmethodsrelyontensorgrids,whichleadstoacurse of dimensionality and is computationally unattractive. As stated by Gordon (2021), tensor grids are inefficient, because many of the grid points will rarely be visited. Gordon (2021) proposestheuseofpruningandsparsegridsforVARmodels. Ourmethodresultsinoptimal grids that limit the curse-of-dimensionality issue when the variables are correlated, and is applicabletoanytypeofprocess. Ourresultsrelatetotheliteratureonmisspecifiedmodels(Gourieroux,Monfort,andTrognon, 1984;White,1982),and,specifically,misspecifiedHiddenMarkovModels(DoucandMoulines, 2012; Mevel and Finesso, 2004). The use of HMM’s is prevalent in economics and machine learning5, but, to our knowledge, the application of HMM’s as a finite-state Markov chain approximationmethodforcontinuousstochasticprocessesisnovel,asisourtheoreticalresult on the ability of HMM’s to approximate such processes.6 In the signal processing literature, Vidyasagar (2005), Finesso, Grassi, and Spreĳ (2010), and others, consider the problem of representing discrete state-space stationary processes as HMM’s, but their results do not extendtocontinuousstochasticprocesses. 2 Discretization using a Hidden Markov Model Let 𝑦 ∈ R𝑘, 𝑖 = 1,...,𝑁, 𝑡 = 1,...,𝑇, denote a random variable for which the data generating 𝑖𝑡 process is a discrete-time continuous-support Markov process. Denote its probability distribution by 𝑓( ). The objective is to approximate the distribution of by a misspecified model, y y with probability distribution 𝑝( ;𝜃), by choosing parameter vector 𝜃 such that the relative y entropy, also known as the information loss, between the approximating distribution and the truedistributionisminimized. Minimizinginformationloss,whichcanbemeasuredthrough 5The interpretation of a HMM as a dimension reduction method for dependent data is common in the statisticsandmachine-learningliterature(McLachlan,Lee,andRathnayake,2019),whereacommonapplication ofHMM’sistextprocessing. ApplicationsofHMM’sineconometricsincludethedetectionofstructuralbreaks (Song, 2014) and modeling of regime switches (starting with Quandt (1958), Goldfeld and Quandt (1973), and Hamilton(1990)). HMM’shavealsobeenusedtoapproximatethedynamicsofthelatentstateinnon-linearstate spacemodelsforthepurposeofestimation,asinKitagawa(1987),Langrock(2011),andFarmer(2021). 6ThisisanintuitionLehéricy(2021)referstobutdoesnotprove. 5

the Kullback-Leibler (KL) divergence, is a common way to think about misspecified models andtheirconsistency. More precisely, let the relative entropy be defined as the logarithmic difference between the distributions 𝑓( )and 𝑝( ;𝜃),wheretheexpectationistakenusingthedistribution 𝑓( ),also y y y knownastheKullback–Leibler(KL)divergence: ∫ 𝑓( ) y 𝐷𝐾𝐿(𝑓( y )||𝑝( y ;𝜃)) = 𝑓( y )log 𝑑 y , (1) 𝑝( ;𝜃) y MinimizingtheKLdivergencewithrespecttoparametervector𝜃requirestakingthederivative ofEquation(1)withrespectto 𝜃: ∫ ∇ 𝜃 log𝑝( y ;𝜃)𝑓( y )𝑑𝑦 = 0 ⇔ E 𝑓 (cid:2)∇ 𝜃 log(𝑝( y ;𝜃)) (cid:3) = 0. Typically,E (·)ishardtoevaluate,andcanbereplacedbyanestimate,bysimulatingdatafrom 𝑓 𝑓( y ),andevaluating∇ 𝜃 log(𝑝(·;𝜃))inthesimulateddata. Thisissimilartoaquasi-maximum likelihood approach, estimating a misspecified model using maximum likelihood estimation (Gourierouxetal.,1984;White,1982). 2.1 HiddenMarkovModel Asourapproximatingmodel,weproposeusingthefollowingHiddenMarkovModel. Denote the latent state by 𝑥 , which lies in a finite discrete set {1,2,...,𝑚}, evolving according to a 𝑖,𝑡 first-orderMarkovprocess:7 𝑦 |𝑥 = 𝜇 (𝑥 )+diag(𝜎 )𝜀 , 𝜀 ∼ 𝑁(0,𝐼 ) (2) 𝑖,𝑡 𝑖,𝑡 𝑡 𝑖,𝑡 𝑡 𝑖,𝑡 𝑖,𝑡 𝑘 𝑥 |𝑥 ∼ Π . (3) 𝑖,𝑡+1 𝑡 𝑖𝑗,𝑡 ThetransitionmatrixΠ 𝑡 hasstationarydistributionδ𝑡 = (𝛿 1,𝑡 ,𝛿 2,𝑡 ,...,𝛿 𝑚,𝑡 ). Parametervector 𝜃 inEquation(1)thusconsistsof: (i) the parameters in transition probability matrix Π 𝑡 , denoted by Π 𝑖𝑗,𝑡 . In the case that there is no time dependence, that is, Π = Π for all 𝑡 = 1,...,𝑇, the number of parameters in 𝑡 7AssumingGaussianityfor𝜀 isconvenient,becausewewillbeusingtheEMalgorithmtoestimate𝜃,and, 𝑖,𝑡 forGaussianerrors,theMstephasaclosed-formsolution. Inaddition,theassumptionofGaussianityisusedin ourproofbelow. 6

Πis 𝑚 ×𝑚,ofwhich 𝑚 ×(𝑚 −1)arelinearlyindependent,giventhateachrowsumsto one. (ii) thegrid𝜇 𝑡 . Whenthereisnotimedependence,𝜇 𝑡 = 𝜇isan 𝑚 × 𝑘 matrix. (iii) the variance of the error term 𝜎2. In the case that there is no time dependence, 𝜎2 = 𝜎2. 𝑡 𝑡 If 𝑦 ∈ R𝑘 has 𝑘 > 1,thevarianceisthediagonalmatrixdiag(𝜎 ,...,𝜎 ). 𝑖,𝑡 𝑡,1 𝑡,𝑘 Theseparameters𝜃 = (𝜇,Π,𝜎)resultinadiscretizationoftheprocess 𝑓( y ),where𝜇isthegrid of the discretized process, and Π governs the transitions between the 𝑚 states. The intuition behindthisHMMisthatitprovidesatime-varying(soft)clusteringofthecontinuousvariable 𝑦 into 𝑚 discretestates 𝑥 thateachcorrespondtoagridpoint𝜇(𝑥). We consider time series settings where 𝑁 = 1, as well as panels with 𝑁 ≥ 2. The inclusion of apaneldimensionallowsfortheestimationofparametersthatvarywith 𝑡,forexample,over thelife-cycle. 2.2 PropertiesoftheKLdivergence Given our objective of minimizing the information loss between the true and approximating process, two questions arise. First, whether there is a consistent estimator of the Hidden Markov model parameters in this setting. In the case of misspecified models, consistency is defined as whether the estimator converges to the value that minimizes the KL divergence. ThishasbeenshowntobetrueformisspecifiedHiddenMarkovModelsbyMevelandFinesso (2004),andlaterinamoregeneralsettingbyDoucandMoulines(2012). Thesecondquestion is whether, with a sufficient number of hidden states (and thus grid points), the information loss between the true and approximating process can be made arbitrarily small. We prove, underasetofassumptions,thattheanswertothesecondquestionispositive. The Main Theorem builds on the results of Zeevi and Meir (1997), who show that a mixture distribution with a sufficient number of components can approximate a large class of distribution functions arbitrarily well. We extend this result to the non-i.i.d. setting of continuous support Markov processes. That is, we show that a Hidden Markov Model in levels (as in Assumption (A5)) can approximate any stationary Markov process satisfying Assumptions (A1)-(A4)arbitrarilywellaslongasenoughhiddenstatesareusedfortheapproximation. AsinZeeviandMeir(1997),denote (cid:26) ∫ (cid:27) ℱ𝑐,𝜂 = {𝑓 ∈ ℱ𝑐 |𝑓 ≥ 𝜂 > 0,∀𝑦 ∈ 𝒴}, withℱ𝑐 = 𝑓|𝑓 ∈ 𝐶𝒴, 𝑓 ≥ 0, 𝑓 = 1 7

whereℱ𝑐 istheclassofcontinuousdensityfunctionswithcompactsupport𝒴 ⊂ R𝑘 fixedand given. ℱ𝑐,𝜂 ⊂ ℱ𝑐 isboundedbelowover𝒴 bysomepositiveconstant,denotedby𝜂. We impose the following assumptions on the true process 𝑓( ) and approximating model y 𝑝( ,𝜃):8 y (A1) y = {𝑦 𝑡 }𝑇 𝑡=1 hasadatageneratingprocesscharacterizedby 𝑓( y ),𝑦 𝑡 ∈ R𝑘,thatisfirst-order Markovandstationary,thatis, 𝑓(𝑦 |𝑦 ,...,𝑦 ) = 𝑓(𝑦 |𝑦 ), 𝑡 𝑡−1 1 𝑡 𝑡−1 and 𝑓(𝑦 |𝑦 ) = 𝑓(𝑦 |𝑦 ) ∀𝑠 ∈ N. 𝑡+𝑠 𝑡+𝑠−1 𝑡 𝑡−1 (A2) 𝑓(𝑦 𝑡 |𝑦 𝑡−1 ) ∈ ℱ𝑐,𝜂 . (A3) log 𝑓(𝑦 𝑡 |𝑦 𝑡−1 )and 𝑓(𝑦 𝑡 |𝑦 𝑡−1 )aredifferentiablein 𝑦 𝑡−1 ∈ 𝒴. (A4) log 𝑓(𝑦 𝑡 |𝑦 𝑡−1 )islocallyLipschitzcontinuousin 𝑦 𝑡−1 ∈ 𝒴. 𝑝( ;𝜃 )ischaracterizedby: (A5) y 𝑚 𝑦 |𝑥 = 𝜇 (𝑥 )+diag(𝜎 )𝜀 , 𝜀 ∼ 𝑁(0,𝐼 ), 𝑡 𝑡 𝑚 𝑡 𝑚 𝑡 𝑡 𝑘 𝑥 |𝑥 ∼ Π 𝑡+1 𝑡 𝑖𝑗,𝑚 with parameters 𝜃 = (𝜇 ,Π ,𝜎 ), and 𝑥 ∈ {1,...𝑚} a latent state evolving accord- 𝑚 𝑚 𝑚 𝑚 𝑡 ing to a first-order Markov process with transition probability matrix Π . Denote the 𝑚 conditionaldistributionby 𝑝(𝑦 𝑡 |𝑦 𝑡−1 ,...,𝑦 1 ;𝜃 𝑚 ) ∈ ℱ𝑐,𝜂 . The first-order Markov assumption (A1) is w.l.o.g., because any (finite) higher-order Markov processcanbewrittenasa(multivariate)first-orderMarkovprocess. Comparedtotheset-up ofSection2.1,Assumption(A5)omitstime-variationintheparametersΠ,𝜇and𝜎. Subscripts 𝑚 areusedtoindicatethenumberofstatesoftheHMM("gridpoints"),alsoreferredtoasthe complexity/sizeoftheapproximatingmodel. 𝑚 Main Theorem. Under assumptions (A1)-(A5), given a sufficiently large number of grid points , there exist a set of grid points 𝜇 𝑚 ∈ 𝒴, variance 𝜎 𝑚 ≥ 𝜏 > 0 and transition probability matrix Π 𝑚, 8Assumption(A4)issatisfiedforsomewell-knownprocesses. Forexample,straightforwardalgebrashows thatforanAR(1)process, 𝑓(𝑦 𝑡 |𝑦 𝑡−1 )= 𝑁(𝜌𝑦 𝑡−1 ,𝜎2)isLipschitz,andlog 𝑓(𝑦 𝑡 |𝑦 𝑡−1 )islocallyLipschitz. 8

collected in 𝜃 𝑚 = (𝜇 𝑚 ,Π 𝑚 ,𝜎 𝑚 ) such that the KL divergence between 𝑓( y ) and 𝑝( y ;𝜃) on the compact subset 𝑦 ∈ 𝒴 ⊂ R𝑘 ,givenby ∫ 𝑓( ) y 𝐷𝐾𝐿(𝑓( y )||𝑝( y ;𝜃)) = 𝑓( y )log 𝑑 y , 𝒴 𝑝( y ;𝜃) 𝒴 canbemadearbitrarilysmall. ThefullproofisgiveninAppendixA. The first step of the proof consists of showing that the conditional distribu- Sketch of proof. tions of our Hidden Markov Model, denoted by 𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃 ), are Gaussian mixtures, 𝑡 𝑡−1 1 𝑚 whose mixture weights converge to a row of the transition probability matrix Π as 𝑚 be- 𝑚 comes large and the filter 𝑝(𝑥 |𝑦 ,...,𝑦 ;𝜃 ) becomes better, such that 𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃 ) 𝑡 𝑡 1 𝑚 𝑡 𝑡−1 1 𝑚 converges to 𝑝0(𝑦 |𝑦 ;𝜃 ) := (cid:205)𝑚 Π 𝜙 (𝑦 ), where 𝜇 (𝑙) denotes the closest grid point to 𝑡 𝑡−1 𝑚 𝑗=1 𝑙𝑗 𝑗 𝑡 𝑚 𝑦 . TheproofthenappliestheresultofZeeviandMeir(1997)to𝑚 conditionaldistributions 𝑡−1 at the same time, that is, to 𝑓(𝑦 |𝑦 = 𝜇 (𝑖)) and 𝑝0(𝑦 |𝑦 = 𝜇 (𝑖);𝜃 ), conditioning on 𝑡 𝑡−1 𝑚 𝑡 𝑡−1 𝑚 𝑚 𝑦 being equal to one of 𝑚 grid points {𝜇 (𝑖)}𝑚 . This results in an additional term in the 𝑡−1 𝑚 𝑖=1 KL divergence compared to the Zeevi and Meir (1997) result, because in our setting, these 𝑚 conditionaldistributions 𝑓(𝑦 |𝑦 = 𝜇 (𝑖))areapproximatedby𝑚 Gaussianmixturesthatall 𝑡 𝑡−1 𝑚 have the same location parameters 𝜇 , as opposed to being freely chosen. However, we do 𝑚 have enough degrees of freedom for 𝑚 different sets of convex mixture weights, because the transition probability matrix has 𝑚 × 𝑚 elements. This is summarized in Lemma 4 and 5 in theAppendix. The rest of the proof consists of three parts. First, we show that the additional term in the KL divergence becomes arbitrarily small when 𝑚 is large. Second, we show that when the KL divergences of these specific 𝑚 conditional distributions are small, the KL divergences for all other potential realizations of {𝑦 }𝑡−1 within the compact set𝒴 are also small. This 𝑡−𝑘 𝑘=1 follows because: (i) the true process is assumed stationary and Markovian, (ii) the local Lipschitz assumption ensures that the KL divergences on the compact set are well behaved and bounded, and (iii) 𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃 ) is Lipschitz in 𝑦 ,...,𝑦 and as 𝑚 increases, 𝑡 𝑡−1 1 𝑚 𝑡−1 1 the filter 𝑝(𝑥 |𝑦 ,...,𝑦 ;𝜃 ) becomes better, and 𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃 ) becomes approximately 𝑡 𝑡 1 𝑚 𝑡 𝑡−1 1 𝑚 forgettingbeyond𝑡−1. Finally,weshowthattheKLdivergencebetween 𝑓( )and 𝑝( ;𝜃)can y y be written as a function of the KL divergences between all conditional distributions, which concludestheproof. 9

The results of Mevel and Finesso (2004) allow us to extend our theorem by Estimation. thefollowinginsight. GiventhattheMaximumLikelihoodEstimator(MLE)ofamisspecified HMMisconsistent,weknowitminimizestheKLdivergenceforagivengridsize𝑚. Therefore, we can use the Expectation-Maximization (EM) algorithm to obtain our grid points and transitionprobabilitymatrix.9 WedescribetheestimationprocedureinAppendixSectionB. . When selecting the number of grid points 𝑚, one faces a trade-off Number of grid points between parsimoniousness for computational efficiency and accuracy of the approximation. In theory, the discretized process becomes arbitrarily accurate as the dimension of the grid goes to infinity. In practice, the grid must always have a finite dimension. One advantage of full-informationdiscretizationisthatwecanassessthefitoftheapproximatingmodelwitha finitenumberofgridpoints,asthisfitisquantifiedbytheKLdivergence. Weproposeusinga scree plot with the KL-divergence on the 𝑦-axis, and the number of grid points on the 𝑥-axis, asvisualizedinFigure1forthreedifferentparameterizationsofanAR(1)process. Thisallows apractitionertovisualizethegaininapproximationaccuracyfromaddinganadditionalgrid point. Figure 1: KL-divergence of the approximating model (in Equations (2)) versus the true process, where the true processisanAR(1)process 𝑦 𝑡 = 𝜌𝑦 𝑡−1 +𝜀 𝑡, 𝜀 𝑡 ∼ 𝑁(0,1)forthreevaluesof 𝜌. 𝑚 isthenumberofgridpoints usedforthediscretization. 0.6 0.5 0.4 0.3 0.2 0.1 0 5 10 15 20 25 AlthoughtheMainTheoremdoesnotstatetherateofconvergence(thatis,thenumberofgrid points needed to achieve a particular information loss), the proof does provide insights on 9This requires additional assumptions on the true stochastic process, including geometric ergodicity, and uniformlyboundedmomentsofsufficientlyhighorder. 10

what properties of the true process matter for how many grid points are needed to obtain a particularprecision. Thiswill,amongotherthings,dependonthelocalLipschitzcoefficientof log 𝑓(𝑦 𝑡 |𝑦 𝑡−1 )and 𝑓(𝑦 𝑡 |𝑦 𝑡−1 ),aswellasthesizeofthecompactset𝒴. Onepropertythataffects the number of grid points is the persistence of the stochastic process. For an AR(1) process, onecanshowtheseLipschitzcoefficientsaswellastheunconditionalvarianceareincreasing in persistence. Consequently, the more persistent a stochastic process, the more grid point are needed to achieve the same information loss. Figure 1 shows how the KL-divergence of our HMM approximating an AR(1) process approaches zero when increasing the number of grid points 𝑚, but does so more slowly when the AR(1) process is more persistent. As such, our results shed some light on why discretizing highly persistent AR(1) processes poses a challenge, as discussed in Flodén (2008), Galindev and Lkhagvasuren (2010), and Kopecky andSuen(2010). 2.3 Imposingstructurethroughrestrictions Onecanimposeadditionalstructureonthediscretizedprocessbyestimatingtheprocessunder a set of restrictions. For example, one might prefer a discretization that does match certain conditionalorunconditionalmomentsofthestochasticprocess,orreflectsthesymmetryinthe underlyingstochasticprocess. InourEMestimationprocedure,thiscanbedonebymodifying theMstep. For symmetric processes, a symmetry restriction can be imposed on 𝜇. In case of a process thatissymmetricaroundzeroandanoddnumberofgridpoints 𝑚,thismeansthat: 𝜇(⌈𝑚/2⌉) = 0, and𝜇(⌈𝑚/2⌉ −𝑟) = −𝜇(⌈𝑚/2⌉ +𝑟), for 𝑟 = 1,....,⌊𝑚/2⌋ (4) Similarly, a process can also be symmetric in its dynamics, as reflected by the transition probabilitymatrix. Inthatcase,therestrictiontakestheform Π = Π . (5) 𝑖,𝑗 (𝑚+1−𝑖),(𝑚+1−𝑗) For the specific restrictions in Equations (4)-(5), a closed-form solution is available for the M step. In other cases, one may want to introduce restrictions through penalty terms rather than hard restrictions. For example, one may want the discretized process to target certain moments. Denote a certain set of moment functions of the discretized process by ℳ(𝑝( ;𝜃)) y and the moments of the continuous process by ℳ(𝑓( )). In that case, instead of maximizing y 11

thelog-likelihoodofthesimulateddata ,maximize: ysim log(ℒ(𝜃| y , x ))−𝜆𝒟(ℳ(𝑓( y )),ℳ(𝑝( y ;𝜃))) (6) where 𝜆 ∈ R+ is a scalar parameter and 𝒟(·,·) a distance measure of choice. 𝜆 is chosen by theresearcher. Ahigher𝜆shouldbechoseniftheresearcherconsidersitmoreimportantthat the discretization matches the moments ℳ. When using this penalty term, the M step is no longeranalyticallytractableandnumericaloptimizationisnecessary. 3 Application I: Asset Pricing Model with Stochastic Volatility In this section, we evaluate the performance of our method in an asset pricing model where dividend growth features stochastic volatility. Most models that involve solving a dynamic stochasticoptimizationproblemwithacontinuous-supportprocessdonothaveaclosed-form solution. As shown by De Groot (2015), however, the model we present below does have a closed-form solution for the price-dividend ratio and the conditional expected return on equity. Theexistenceofananalyticalsolutiongivesusabenchmarkwithwhichtocomparea modelsolvedwithoursandotherdiscretizationmethods. First, we present the analytically tractable asset pricing model of De Groot (2015). Next, we demonstrate how to discretize the AR(1)-SV process using ours and two other methods, and analyze their respective performance at capturing various moments of the stochastic process. Finally, we assess how the numerical solution corresponding to each method differs relative totheanalyticalbenchmarksolution. 3.1 AnalyticallytractableassetpricingmodelwithAR(1)-SVdividendgrowth We use the Lucas tree asset pricing model of De Groot (2015). A representative agent maximizestheexpecteddiscountedstreamofutility: ∞ 𝑐 1−𝛾 (cid:213) E 𝛽𝑡 𝑡 0 1− 𝛾 𝑡=0 s.t. 𝑐 + 𝑠 𝑝 ≤ (𝑑 + 𝑝 )𝑠 , 𝑡 𝑡+1 𝑡 𝑡 𝑡 𝑡 where𝑐 isconsumption,and𝑠 isanassetwithprice𝑝 anddividends𝑑 . Parameter𝛽 ∈ (0,1) 𝑡 𝑡 𝑡 𝑡 denotesthediscountfactorand 𝛾 isthecoefficientofrelativeriskaversion. 12

The growth rate of dividends 𝑦 = ln(𝑑 /𝑑 ) is assumed to follow an AR(1) process with 𝑡 𝑡 𝑡−1 stochasticvolatility:10 √ 𝑦 = 𝑦¯ +𝜌(𝑦 − 𝑦¯)+ 𝜂 𝜀 (7) 𝑡 𝑡−1 𝑡 𝑡 𝜂 = 𝜂¯ +𝜌 (𝜂 −𝜂¯)+ 𝜔𝜀 . (8) 𝑡 𝜂 𝑡−1 𝜂,𝑡 withpersistenceinlevels𝜌 ∈ (−1,1),and𝜀 isi.i.d. 𝑁(0,1). Therandomvariable𝜂 isthetime- 𝑡 𝑡 varying conditional variance of dividend growth. Parameter 𝜌 ∈ (−1,1) is the persistence of 𝜂 thestochasticvolatilityprocess,and 𝜀 isalsoi.i.d. 𝑁(0,1). 𝜂,𝑡 Market clearing, 𝑠 = 1, implies that 𝑐 = 𝑑 . Defining the price-dividend ratio as 𝑣 := 𝑝 /𝑑 , 𝑡 𝑡 𝑡 𝑡 𝑡 𝑡 thefirst-orderconditionoftherepresentativeagent’smaximizationproblemisgivenby: (cid:18)𝑑 (cid:19)1−𝛾 𝑣 = E 𝛽 𝑡+1 (𝑣 +1). (9) 𝑡 𝑡 𝑡+1 𝑑 𝑡 De Groot (2015) derives a closed-form solution for the price-dividend ratio 𝑣 and the condi- 𝑡 tionalexpectedreturnonequity,whichisdefinedas: (cid:18)𝑑 + 𝑝 (cid:19) E 𝑅𝑒 = E 𝑡+1 𝑡+1 . (10) 𝑡 𝑡+1 𝑡 𝑝 𝑡 DetailsontheanalyticalsolutionofDeGroot(2015)andthediscretizedsolutionareprovided inAppendixSectionD. The reason why we are interested in the performance of capturing E 𝑅𝑒 in addition to 𝑣 𝑡 𝑡+1 𝑡 is because of its non-linear dependence on 𝑣 , which is also approximated. The approxima- 𝑡 tion errors will compound in a non-trivial way, and we are interested in how accurate the discretizationmethodsarewhentheseerrorsaccumulate. Anotherobjecteconomistscareaboutisthewelfarecostofrisk. Inthisapplication,wemeasure thisusingthecertaintyequivalentconsumption(CE).Define 𝑉(𝑑) = 𝑢(𝑑)+𝛽E[𝑉(𝑑′)|𝑑], where𝑉(𝑑) is the value to the household of being in state 𝑑, where 𝑑 is the level of aggregate dividends. 𝑉(𝑑)reflectsthepresentdiscountedvalueoftheriskydividend(i.e.,consumption) √ 10Note that for this specification of the AR(1)-SV process, 𝜂 can become negative, in which case 𝜂 is 𝑡 𝑡 imaginary. In the parametrization we use, taken from Bansal and Yaron (2004), the probability of a negative valuefor𝜂isverysmall,andinourlongsampleofsimulations,itdoesn’toccur. 13

stream. Onecouldaskwhatthecertaintyequivalentlevelofconsumptionisthatwouldmake thehouseholdindifferentbetweentheriskyconsumptionstreamandacertain(constant)level ofconsumption. Wedenotethatconstantvalueby 𝑥(𝑑),whichisthesolutionto: 𝑢(𝑥(𝑑)) 𝑉(𝑑) = . 1−𝛽 We solve for 𝑥(1) numerically by simulation using the true stochastic process for dividend growth and the discretized processes. Lower values of 𝑥 indicate a higher willingness to pay, sototheextentthediscretizationsfailtocapturerisk,theywilloverstate 𝑥 relativetothetrue value. . The parametrization used for the results in the tables below are based on the Calibration estimates of the stochastic volatility process in Bansal and Yaron (2004), annualized as in DeGroot(2015),thatis, 𝛾 = 1.5, 𝜌 = 0.855, 𝜔 = 7.4000×10−5,𝜂¯ = 0.0012,𝛽 = 0.95,𝜌 = 0.868, 𝜂 𝑦¯ = 0.0179. We choose risk aversion 𝛾 and the discount factor 𝛽 such that the price-dividend ratioisfiniteandstable.11 3.2 DiscretizingtheAR(1)-SVprocessofDeGroot(2015) The process of Equations (7)-(8) is multivariate, which is why we discretize over both the levels 𝑦 andvariances𝜂 . WecompareourdiscretizationmethodwiththemethodofFarmer 𝑡 𝑡 and Toda and the binning method of Adda and Cooper (2003), both using their standard configurations.12 Bothmethodsuseatensorgridformultivariateprocesses. Figure2visualizestheKLdivergenceofourdiscretizationfordifferentchoicesofgridsize 𝑚 relativetothetrueAR(1)-SVprocess. ThefigurealsovisualizestheKLdivergencesofthetwo otherdiscretizationmethods. Thelikelihoodsforthesemethodsarecomputedbyinterpreting thetransitionprobabilitymatrixandgridofthedifferentdiscretizationmethodsasparameters Π and 𝜇 in our HMM framework, re-estimating the variance of the approximation error. For the discretization methods that rely on tensor grids, we use a three-grid point discretization for 𝜂 and vary the number of grid points for 𝑦 from three to eleven. The figure shows our 𝑡 𝑡 method is more parsimonious; to capture the same amount of information as we do with 15 gridpoints,theFarmerandTodamethodneeds27gridpoints,andthebinningmethodneeds 11DeGroot(2015)providesparameterrestrictionssuchthattheprice-dividendratioisfinite,seeAppendixD. 12WeusethecodesprovidedonthepersonalwebsiteofA.A.Toda,availableathttps://alexisakira.github .io/discretization/fortheimplementationoftheFarmerandTodamethod. WeadapttheFarmerandToda methodforthisspecificationofanAR(1)-SV,settomatchthefirsttwoconditionalmomentsineachgridpoint. 14

morethan33. Thisisduetobothourmethodbeingafull-informationmethod,aswellasour methodnotrelyingontensorgridsbutratherusinganoptimallychosengrid. Figure 2: KL-divergence of the approximating model likelihood versus the likelihood of the true process for the AR(1)-SVprocessinEquations(7)-(8),fordifferentdiscretizationmethodsanddifferentgridsizes𝑚. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 10 15 20 25 30 Notes: Weonlyvisualizeaselectednumberofgridpoints,becausetheothermethodsrelyonatensorgrid,and cannot be computed for all choices of 𝑚. For those methods, we fix the dimension of 𝜂 at three, and vary the dimensionof𝑦,and𝑚istheproductofbothdimensions. Figure 3 illustrates how our method optimally chooses the grid points for a multivariate processliketheAR(1)-SVprocess.13 Figure3(a)showstheoptimalgridfor 𝑚 = 9gridpoints, and shows how our method assigns the grid points in the tails to have higher variances than in the center. This is consistent with the intuition behind an AR(1)-SV process, as it is more likelyahighvalueof 𝑦 isaccompaniedbyahighrealizationofthevariance𝜂 . As𝑚 becomes 𝑡 𝑡 larger, our optimal grid adds what we call ‘double’ or ‘triple’ states. These are grid points with similar levels for 𝑦, but different values for the variance 𝜂. These grid points will have differentdynamicstonextperiod’sstatedespitehavingthesamelevelof 𝑦,aswillbereflected bydifferencesintherowsofthetransitionprobabilitymatrixforthesestates. Table1computesseveralstatisticstocomparetheperformanceofourmethodandtheexisting methods at capturing moments of 𝑦. As can be seen, the Farmer and Toda (2017) method doeswellatthemeanandvariance,asthesearethemomentstheytarget,whilewedowellat higher order moments such as the skewness and kurtosis. The Mean Squared Forecast Error (MSFE) of the other methods is 30-40% larger than ours, supporting that we give an agent a betterprocesstomakeforecastswith.14 13InAppendixC,weshowtheoptimalgridsforVectorAutoregressions,anothermultivariateprocess. 14Themeansquaredforecasterror(MSFE)oftheapproximatingmodelmeasurestheone-stepaheadforecasting error that the agent makes. For this statistic, we assume that an agent assigns the grid point closest to the 15

Figure3: Visualisationoftheoptimalgridforgridsizes𝑚 = (9,18)comparedtoatensorgrid,fortheAR(1)-SV processasinEquation(7)-(8). (a)𝑚 =9 (b)𝑚 =18 1.5 10-3 1.5 10-3 1.4 1.4 1.3 1.3 1.2 1.2 1.1 1.1 1 1 0.9 0.9 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 (c)Tensor: 𝑚 𝑦 =3, 𝑚 𝜂 =3 (d)Tensor: 𝑚 𝑦 =6, 𝑚 𝜂 =3 10-3 10-3 1.5 1.5 1.4 1.4 1.3 1.3 1.2 1.2 1.1 1.1 1 1 0.9 0.9 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 Notes: The grids for the AR(1)-SV process are two-dimensional. The 𝑦-axis depicts the variance, while the positioningonthe𝑥-axisofthediamondsdepictsthelevelof𝑦. 3.3 Accuracyofthemodelssolutions To compare the relative performance of our method versus existing methods at solving the asset pricing model, we compute moments of the discrete solutions (𝑀ˆ ) and the analytical benchmark (𝑀). To assess the accuracy of the different solutions, we compute the following summarystatistic: log (|𝑀ˆ /𝑀 −1|). 10 Lower values of log (|𝑀ˆ /𝑀 − 1|) indicate the moments of the discrete model are closer to 10 those of the benchmark. The results of this analysis are summarized in Table 2. Overall, currentrealizationof 𝑦 𝑡 forforecasting 𝑦 𝑡+1 . DefineMSFE = 𝑇 1 (cid:205)𝑇 𝑡=1 (𝑦 𝑡 −𝑦ˆ 𝑡 )2,where 𝑦ˆ 𝑡 = (cid:205) 𝑗 Π𝑖𝑗 ·𝜇(𝑥 𝑡 = 𝑗)and 𝑖 = argmin|𝑦 𝑡−1 −𝜇(𝑥 𝑡−1 = 𝑖)|. 𝑖∈{1,...,𝑚} 16

Table1: ComparisonforanAR(1)processwithstochasticvolatilityasinEquation(7)-(8)(basedonasimulation of𝑇 = 200,000)parametrizedasinBansalandYaron(2004). Janssens-McCrary Farmer-Toda Binning Method m=15(𝑚 = 5, 𝑚 = 3forFarmer-Todaandbinning) 𝑦 𝜂 Dev. uncond. mean 𝑦 0.018 0.018 0.018 %dev. uncond. variance 𝑦 −15.4 −23.7 0.525 Abs. dev. uncond. skewness 𝑦 0.044 0.023 <0.001 %dev. uncond. kurtosis 𝑦 − -19.2 −37.5 9.42 %dev. autocorrelation 𝑦 3.35 − −5.66 0.053 %abs. dev. cond. mean 𝑦 0.003 0.004 0.002 %abs. dev. cond. variance 𝑦 33.3 26.3 26.0 Abs. dev. cond. skewness 𝑦 1.16 0.580 0.543 %abs. dev. cond. kurtosis 𝑦 51.3 81.1 18.5 MSFE 𝑦 0.0018 0.0017 0.0013 our method always performs best at the mean, and, almost always, at the variance of both statistics. The differences in accuracy between discretization methods for the mean expected return of equity are larger than the differences in accuracy for the mean price-dividend ratio, becauseoftheaccumulationofapproximationerrorsthroughanon-lineartransformation. Table 2: Accuracy of asset pricing model solutions for the price-dividend ratio 𝑣 𝑡 and the conditional expected returnonequity E𝑅𝑒 . 𝑡+1 𝑀 log (|𝑀ˆ /𝑀 −1|) 10 Janssens-McCrary Farmer-Toda Binning 𝑚 =9 𝑚 =3x3 𝑚 =3x3 Mean 𝑣 18.10 − −1.51 −1.13 𝑡 1.67 Variance 𝑣 9.61 − −0.29 −0.07 𝑡 1.33 MeanE (𝑅𝑒 ) 1.08 − −2.37 −2.67 𝑡 𝑡+1 3.10 VarianceE (𝑅𝑒 ) 0.01 − −0.49 −0.28 𝑡 𝑡+1 0.64 𝑚 =15 𝑚 =5x3 𝑚 =5x3 Mean 𝑣 18.10 − −2.23 −1.29 𝑡 2.77 Variance 𝑣 9.61 −0.65 − −0.19 𝑡 2.33 MeanE (𝑅𝑒 ) 1.08 − −2.44 −2.73 𝑡 𝑡+1 4.51 VarianceE (𝑅𝑒 ) 0.01 − −0.49 −0.28 𝑡 𝑡+1 0.64 Notes: Comparisonofthemeanandvarianceofasimulatedtime-seriesfromthediscretizedmodelsolutions (denotedby 𝑀ˆ)andtheanalyticalclosed-formmodelsolution(denotedby 𝑀), wheretherelativeaccuracy ofthesolutionformoment 𝑀 ismeasuredbylog (|𝑀ˆ/𝑀−1|). Thelower(morenegative)thisvalueis,the 10 closerthismomentofthesimulatedtimeseriesofthediscretemodelsolutionistothemomentoftimeseries fromtheexactmodelsolution. Lowestvaluesaremarkedinbold. 17

Table 3: Accuracy of asset pricing model solutions for the certainty equivalent of consumption (CE): true value ofCEcomparedtothosefollowingfromthreedifferentmethods. Thelowerthepercentagedeviation,thecloserthe solutionofthediscretizedmodelistothetruth. Differentgridsizesarepresented. Janssens-McCrary Farmer-Toda Binning CE(true) %dev %dev %dev 𝑚 =9 𝑚 =3x3 𝑚 =3x3 1.65 8.28% 5.41% 0.76% 𝑚 =15 𝑚 =5x3 𝑚 =5x3 1.65 12.22% 3.95% 1.93% Notes: Lowestvaluesaremarkedinbold. Averageistakenover50simulationsoftheCE. In Table 3, we analyze the accuracy of the different methods when computing the CE for two different grid sizes. As follows from Table 3, our method produces the most accurate estimatesoftheCE,withdeviationsinpercentagepoints0.8-2%fromthetruth. Theothertwo methods are at best 4% away from the truth, and at worst 12%, underestimating the amount ofconsumptionthehouseholdiswillingtogiveuptoremoverisk. 4 Application II: Life-cycle Model In this section, we evaluate the quantitative implications of different discretization methods for consumption, wealth and welfare using an incomplete markets life-cycle model. While simple, this model forms the basis for most of the heterogeneous agent quantitative macro literature. We expect that our results on the importance of accurate discretizations also hold in richer models. In addition, this application demonstrates how our discretization method canbeappliedtonon-linearnon-Gaussianprocesseswithlife-cycledynamicswherethegrids and transition probability matrices are allowed to vary by age, using our adapted algorithm laidoutinAppendixSectionB.2. We first discuss the life-cycle model we will use in our analysis. Next, we discuss the different stochastic processes, our performance at discretizing these processes, and what the implicationsareforthemodelsolutions,usingoursandexistingmethods. 4.1 Modelandcalibration We begin by discussing the model environment, followed by the household optimization problem,andthedetailsofthecalibration. 18

Weconsiderapartialequilibriumlife-cycleversionofthecanonicalincomplete- Environment. markets model without aggregate uncertainty. Households live up to 𝑇 periods, where the first 𝑡 < 𝑇 are spent working, and the remaining periods are spent in retirement. Working 𝑟 householdssupplyoneunitoflaborinelasticallywithpre-taxearnings𝔢 thatevolvestochas- 𝑡 tically as described in more detail below. Retired households receive pension 𝑏 and survive with probability 𝑠 each period. Asset markets are incomplete. Agents can borrow and save 𝑡 usinganuncontingentbond,atrisk-freeinterestrate 𝑟,uptoanexogenousborrowinglimit 𝑎. At every age, agents choose consumption 𝑐 and saving 𝑎′ subject to the Household problem. budgetconstraintwhichdependsonthecurrentstateofassets 𝑎 andearnings𝔢. Duringtheir workinglife(𝑡 < 𝑇 ),householdssolvethefollowingoptimizationproblem: 𝑟 (cid:26) (cid:27) 𝑉 (𝑎,𝔢) = max 𝑢(𝑐)+𝛽E 𝑉 (𝑎′,𝔢 ′) , 𝑡 𝑡 𝑡+1 𝑐,𝑎′ s.t. 𝑐 + 𝑎′ = 𝜏(𝔢)+(1+𝑟)𝑎 𝑎′ ≥ 𝑎, whereearningssatisfy 𝔢 = 𝑔 𝑦 . 𝑡 𝑡 𝑡 Thatis,earningsinlevels𝔢 aretheproductofacommondeterministicagecomponent 𝑔 and 𝑡 𝑡 anidiosyncraticstochasticcomponent 𝑦 thatevolvesaccordingtoa(possiblyage-dependent) 𝑡 Markov transition matrix Π . The specification for the deterministic component of earnings 𝑡 𝑔 istakenfromGuvenenetal.(2021). 𝑡 Retiredhouseholdssolvethefollowingproblem: (cid:26) (cid:27) 𝑉 (𝑎) = max 𝑢(𝑐)+𝛽𝑠 𝑉 (𝑎′) , 𝑡 𝑡 𝑡+1 𝑐,𝑎′ s.t. 𝑐 + 𝑎′ = 𝑏 +(1+𝑟)𝑎 𝑎′ ≥ 𝑎. Calibration. Agents enter the model at age 25 and work until age 𝑇 𝑟 = 65 (60 for the ABB process), after which they can be retired up to 25 years. If agents reach age 𝑇 = 𝑇 +25, they 𝑟 die with certainty. The exact year of death after retirement is stochastic, and the survival probabilitiesaretakenfromtheSocialSecurityAdministrationactuariallifetable. Retirement 19

benefit 𝑏 is chosen to match the 45% replacement rate of average earnings, which is a good approximationofthesystemintheUnitedStates(MitchellandPhillips,2006). UtilityhasCRRAform: 𝑢(𝑐) = 𝑐1−𝛾/(1− 𝛾). Thecoefficientofrelativeriskaversion𝛾issetto2. Theriskfreerate𝑟 is4%andtheborrowing limit 𝑎 is12%ofaverageearnings,whichDeNardi,Fella,andPaz-Pardo(2020)findisroughly the ratio of credit card limits to income in the Survey of Consumer Finances. The discount factor 𝛽 is calibrated to match a wealth-to-income ratio of 3.1 for the working age population, andthiswillbere-calibratedforeachprocess,andforeachdiscretizationmethod. FollowingBenabou(2002),thelaborincometaxfunctionhastheform: 𝜏(𝑦) = (1− 𝜒)𝑦1−𝜇. (11) The parameters 𝜒 and 𝜇 govern the level and progressivity of the tax function. Following KruegerandWu(2021),wesettheprogressivityparameterto0.1327,andthelevelparameter to0.1575. ThecalibrationissummarizedinTable4. Table4: Calibrationofthelife-cyclemodelparameters Parameter Description Value Motivation 𝛾 Riskaversion 2.0 DeNardietal.(2020) 𝑏 Retirementbenefits 0.45 MitchellandPhillips(2006) 𝑟 Risk-freeinterestrate 0.04 DeNardietal.(2020) 𝑎 Borrowinglimit -0.12 DeNardietal.(2020) 𝜇 Incometaxprogressivity 0.1327 KruegerandWu(2021) 𝜒 Incometaxlevel 0.1575 KruegerandWu(2021) W/I Wealth-to-incomeratio 3.1 DeNardietal.(2020) When presenting the model solution, we report several statistics, such as Model statistics. correlations and standard deviations of consumption, asset holdings and earnings over the life cycle. In addition, we compute three other statistics. First, we compute the certainty equivalent value (CEV). This is the fraction of lifetime consumption an individual would be willing to give up to live in a world without risk.15 The CEV is commonly used to evaluate policy experiments, so it is important to know its sensitivity to the discretization method. 15Let𝑐1bethesequenceofconsumptionarisinginaneconomywithriskand𝑐0bethesequenceofconsumption withoutrisk. TheCEVisdefinedintermsofwelfare𝑊 as𝑊(cid:0)(1−𝐶𝐸𝑉)𝑐0(cid:1) =𝑊(cid:0)𝑐1(cid:1) (WuandKrueger,2021). 20

Second,wereportthepartialinsurancetopersistentincomeshockscoefficientasinBlundell, Pistaferri,andPreston(2008): cov(Δ𝑐 ,𝑦 − 𝑦 ) 𝜓𝑃 = 1− 𝑖𝑡 𝑖,𝑡+1 𝑖,𝑡−2 . BPP cov(Δ𝑦 ,𝑦 − 𝑦 ) 𝑖𝑡 𝑖,𝑡+1 𝑖,𝑡−2 Thisstatisticmeasurestheextenttowhichconsumptionrespondstounpredictablepersistent changesinincome. Thisstatisticisusedtovalidatethepredictionsoflife-cyclemodelsversus data in practice. Third, we use the model solution to compute the Marginal Propensity to Consume out of transitory income shocks (MPC). We compute the MPC as the change in consumption divided by the (unexpected) increase in cash-on-hand. We study MPC’s over thelife-cycleandacrossthewealthdistribution. MPC’sareacommonobjectofinterestwhen studyingfiscalpolicy. 4.2 DiscretizingGuvenen,Karahan,OzkanandSong(2021) . ThefirstearningsprocessweconsideristheprocessproposedbyGuvenen Stochasticprocess etal.(2021). Thisearningsprocessisgivenby:16 𝑦𝑖 = (1−𝜈𝑖)𝑒(𝑧 𝑡 𝑖+𝜀 𝑡 𝑖) 𝑡 𝑡 𝑧𝑖 = 𝜌𝑧𝑖 +𝜂𝑖 𝑡 𝑡−1 𝑡 𝑧𝑖 ∼ 𝑁(0,𝜎 ) 0 𝑧 0  𝑁(𝜇 ,𝜎 ) withprob. 𝑝 𝜂𝑖 ∼  𝜂,1 𝜂,1 𝑧 𝑡   𝑁(𝜇 𝜂,2 ,𝜎 𝜂,2 ) withprob.1− 𝑝 𝑧 (12)   𝑁(𝜇 ,𝜎 ) withprob. 𝑝 𝜀𝑖 ∼  𝜀,1 𝜀,1 𝜀 𝑡   𝑁(𝜇 𝜀,2 ,𝜎 𝜀,2 ) withprob.1− 𝑝 𝜀   0 withprob.1− 𝑝 (𝑡,𝑧𝑖), 𝑣𝑖 ∼  𝑣 𝑡 𝑡   min{1,exp(𝜆)} withprob. 𝑝 𝑣 (𝑡,𝑧 𝑡 𝑖)  16Weleaveoutthenon-stochasticelementsoftheincome-level,suchasthefixedeffect. FollowingGuvenen etal.(2021),weusethefollowingparametrization: 𝜌 = 0.959,𝑝 𝑧 = 0.407,𝜇 𝜂,1 = −0.085,𝜇 𝜂,2 = 0.085𝑝 𝑧 /(1−𝑝 𝑧 ), 𝜎 𝜂,1 = 0.364, 𝜎 𝜂,2 = 0.069, 𝑝 𝜀 = 0.13,𝜇 𝜀,1 = 0.271,𝜇 𝜀,2 = −0.271𝑝 𝜀 /(1−𝑝 𝜀 ), 𝜎 𝜀,1 = 0.285, 𝜎 𝜀,2 = 0.037,𝜆 = 0.0001. Wehave(𝑎,𝑏,𝑐,𝑑)=(−3.353,−0.859,−5.034,−2.895). 21

where 𝑝 isgivenby 𝑣 𝑒𝜉𝑖 𝑝 (𝑡,𝑧 ) = 𝑡 , where 𝜉𝑖 ≡ 𝑎 +𝑏𝑡 + 𝑐𝑧𝑖 + 𝑑𝑧𝑖𝑡. 𝑣 𝑡 1+ 𝑒𝜉𝑖 𝑡 𝑡 𝑡 𝑡 Here 𝑦𝑖 is the earnings level of individual 𝑖 at time 𝑡, 𝑧𝑖 is the persistent component of 𝑡 𝑡 earnings, 𝜀𝑖 is the transitory component and 𝑣𝑖 is a non-employment shock. The process is 𝑡 𝑡 essentially a persistent-transitory earnings process, where the main features are: (i) the fattailed innovations to the persistent and transitory component, and (ii) the non-employment shocks 𝜈 . 𝑡 For our discretization, we use a multivariate discretization on log(𝑦𝑖 + 1) Discretization. 𝑡 and 𝑧𝑖 jointly. We allow the grid and transition probabilities of our discretization to be age- 𝑡 dependent. Weusetwelvegridpoints,because,asonecanseebelowinthemomentanalysis, twelvegridpointscapturesthemainfeaturesoftheprocesswell. The resulting optimal age-dependent grids are visualized in Figure 4. Figure 4b shows that thegridpointshaveapositivetrendinage,capturingtheincreaseinearningsdispersionover the life-cycle. Figure 4a shows that the discretization method generates a grid with multiple non-employment states. Having multiple states with an earnings level of zero generates heterogeneousjob-findingprobabilities,thatis,non-employmentstatesthatdifferintermsof theirpersistence. ThisisvisualizedinFigureE1intheAppendix,depictingtheage-dependent transition probability matrix. The first three rows represent the zero-earnings states, and by lookingatthediagonal,wecanseethatthesestatesindeeddifferintermsoftheirpersistence, andthatthispersistencechangesoverthelife-cycle. Furthermore,FigureE1showshowinthe Guvenenetal.(2021)process,non-employmentbecomeshighlypersistenttowardstheendof workinglife.17 To the best of our knowledge, our paper is the first to discretize the process in Guvenen et al. (2021). We compare ourselves against a binning method. Standard binning methods, however, do not work for the Guvenen et al. (2021) process, because of the large number of zerosgeneratedbytheprocess. WeadaptAddaandCooper(2003)byaddingazeroearningsstate and then use standard binning on the observations 𝑦 > 0. We allow both the grid and 𝑖𝑡 transitionprobabilitymatrixtovarybyage. 17It should be noted that Guvenen et al. (2021) do not differentiate between unemployment and nonemployment, which explains why these transition probabilities out of the zero-earnings states are different fromthoseweknowfromtheunemploymentdurationliterature. 22

Figure4: VisualisationsoftheoptimalgridofthediscretizationofthestochasticprocessinGuvenenetal.(2021) with𝑚 = 12. Notethatpanel(b)onlyshowstenlines,becausetherearethreegridpointsatzero. (a)Optimalgridatage𝑡 =25 (b)Optimalgridoverthelife-cycle 2 10 1.5 9 9 8 8 1 7 7 0.5 6 6 0 5 5 4 4 -0.5 3 3 -1 2 2 -1.5 1 1 0 0 0 1 2 3 4 5 6 20 30 40 50 60 70 20 30 40 50 60 70 Figure 5 visualizes the unconditional moments of the earnings levels and arc-changes in earnings of the Guvenen et al. (2021) process over the life-cycle, and the extent to which the discretizedprocessescanreplicatethesemoments.18 Asthesefiguresshow,ourdiscretization method captures the unconditional moments of the earnings levels well, and does so better thanthebinningmethod. Thebinningmethodperformssimilartoourmethodatthemoments onarc-changes. InFigure5c,thenon-employmentdynamicsoverthelife-cyclearevisualized for the two different discretizations. Our discretization is able to capture the life-cycle profile of the two- and three-period ahead conditional non-employment probabilities better than the binning method. The binning method by construction performs well at the one-periodahead persistence of non-employment, but fails to capture the longer-run non-employment dynamics. 4.3 DiscretizingArellanoetal.(2017) Next, we consider the nonparametric earnings process in Arellano et al. Stochastic process. (2017). As in Arellano et al. (2017), let 𝑦 be pre-tax labor earnings. Decompose log𝑦 as 𝑖𝑡 𝑖𝑡 follows: log𝑦 = 𝜂 + 𝜀 , 𝑖 = 1,...,𝑁, 𝑡 = 1,...,𝑇, 𝑖𝑡 𝑖𝑡 𝑖𝑡 18Arc-changesareanimportantstatisticintheGuvenenetal.(2021)paper. 23

Figure 5: Age-dependent moments, for two different discretizations of the stochastic process by Guvenen et al. (2021). 𝑚 = 12forbothmethods. (a)Unconditionalmomentsofearnings (b)Unconditionalmomentsofarcchangesinearnings 1.3 2 0.05 0.7 1.25 0.6 1.2 1.5 0 1.15 0.5 1.1 1 0.4 -0.05 20 40 60 20 40 60 20 40 60 20 40 60 2.5 10 10 0 9 2 8 8 -0.05 1.5 7 6 -0.1 6 1 4 -0.15 20 40 60 20 40 60 20 40 60 20 40 60 (c)Non-employmentdynamics 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 40 60 40 60 40 60 The solid red line represents the Guvenen et al. (2021) process, the solid black line is our discretization method,andthebluedash-dotlineisthebinningmethod. Arcchangesaredefinedas 𝑦𝑖,𝑡+1−𝑦𝑖,𝑡 . (𝑦𝑖,𝑡+1+𝑦𝑖,𝑡)/2 where 𝜂 denotes the persistent component and 𝜀 denotes the transitory component. The 𝑖𝑡 𝑖𝑡 transitory component is mean zero and is independent over time and from the persistent component. The persistent component 𝜂 follows a general first-order Markov process, with 𝑖𝑡 its 𝜏th conditional quantile given 𝜂 by 𝑄 (𝜂 ,𝜏) for each 𝜏 ∈ (0,1), that is, without loss 𝑖,𝑡−1 𝑡 𝑖,𝑡−1 ofgenerality: 𝜂 = 𝑄 (𝜂 ,𝑢 ), (𝑢 |𝜂 ,𝜂 ,...) ∼ Uniform(0,1), 𝑡 = 2,...,𝑇. 𝑖𝑡 𝑡 𝑖,𝑡−1 𝑖𝑡 𝑖𝑡 𝑖,𝑡−1 𝑖.𝑡−2 24

This model allows for nonlinear dynamics of earnings, and in particular, generates nonlinear persistence. Arellano et al. (2017) estimate this model non-parametrically, approximating 𝑄 using low-order products of Hermite polynomials and limiting time-dependence to agedependence,thatis, 𝑄 (𝜂 ,𝜏) = 𝑄(𝜂 ,age ,𝜏). 𝑡 𝑖,𝑡−1 𝑖,𝑡−1 𝑖𝑡 Ourmethodonlyrequiresasimulatedsamplefromthetruestochasticprocess, Discretization. and,therefore,canbeappliedtonon-parametricprocesseslikeArellanoetal.(2017). Wefocus onthediscretizationof𝜂 ,becausethetransitorycomponent 𝜀 isi.i.d. Thesimulatedvalues 𝑖𝑡 𝑖𝑡 from the stochastic process are noisy, so we follow Arellano et al. (2017) in truncating the simulationsatfourage-dependentstandarddeviationsaroundthemean.19 We allow the grids and transition probability matrices to vary by age, as visualized in Figure E2 in the Appendix. The grids are more dispersed than those of the Guvenen et al. (2021) process. In addition, while for the Guvenen et al. (2021) process most age-dependence in the transition probabilities is at the low-earnings states, for the Arellano et al. (2017) process this isatthehighearningsstates. Forexample,thehighestearningsstatebecomesmorepersistent fromage35onwards. We compare the performance of our discretization method with the method De Nardi et al. (2020) propose to discretize the Arellano et al. (2017) process.20 In particular, their method adapts Adda and Cooper (2003) and uses simulation-based binning, adding additional bins in the tailsof the process. Their discretizationfor𝜂 uses 18grid points, and we followthem 𝑖𝑡 in this choice. For details we refer to their paper. In what follows below, we refer to this adaptationofbinningas"tail-binning". Figure 6 visualizes the moments of the persistent component 𝜂 and first-differences Δ𝜂 for 𝑡 𝑡 the Arellano et al. (2017) process, our discretization and the tail-binning discretization. Our discretization method does a good job at capturing the first four unconditional moments of thelevelsof𝜂 . Thetail-binningmethodmissesthegradualincreaseinskewnessandkurtosis 𝑡 overthelifecycle,andinsteadcatchesupbyrapidlyincreasingaroundage45-50. Ourmethod doesbetteratcapturingtheskewnessandexcesskurtosisofthefirst-differencesof𝜂 ,butdoes 𝑡 stillmissoutonsomeoftheexcesskurtosistheprocessexhibits. 19Forthesimulationsfromtheirearningsprocess,weusethepublicly-availablecodesthataccompanytheir publication. 20Notethattheirdiscretizationoriginallywasappliedtoare-estimatedversionofArellanoetal.(2017)that usesafter-taxearnings,soourresultsarenotdirectlycomparable. 25

Figure 6: Moments of 𝜂 𝑡 and Δ𝜂 𝑡 for the process of Arellano et al. (2017). The red line is data simulated from the Arellano et al. (2017) process, the black line follows from our discretization method, and the blue dotted line isbasedonthetail-binningmethod. 0.3 4 20 0.2 0.5 3 15 0.4 2 0.1 10 0.3 1 0 5 0.2 0 -0.1 0.1 0 20 40 60 20 40 60 20 40 60 20 40 60 0.03 0.1 0 100 -1 0.02 0.05 -2 50 0.01 -3 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 4.4 Life-cyclemodelwiththeprocessesofGuvenenetal.(2021)andArellanoetal.(2017) Next, we illustrate the importance of the choice of the discretization method for the earnings process through the lens of the life-cycle model. We use the discretizations of the persistent component of the earnings processes as presented above, and separately add a three-gridpoint equal-quantile discretization of the transitory component to the model. Figure 7 plots howassetsandconsumptiondevelopoverthelifecycleofanindividualusingthetwodifferent discretizationmethods. WeobservethatforboththeGuvenenetal.(2021)andArellanoetal. (2017) process, the discretization method has economically meaningful implications for the development ofmean consumptionover thelife-cycle, for themean MPC’s,and forthe mean savingrate. Inaddition,thevarianceofconsumptiongrowthoverthelife-cycleisalsosensitive to the discretization method used; discretizing the Guvenen et al. (2021) process using the binning method results in a larger variance of consumption growth over the life-cycle than our discretization method does. For the Arellano et al. (2017) process, what stands out is the differenceinMPC’soverthelife-cyclebetweenthetwomethods;ourmethodgenerateshigher MPC’sforyoungerindividuals(around0.7)thanthetail-binningmethod(around0.6). Table5summarizesseveralkeystatisticsofthelife-cyclemodel,andhowthesevaryforthetwo different processes and their different discretizations. For the Guvenen et al. (2021) process, 26

the most notable difference between the discretizations is in the welfare cost of risk (CEV). For our discretization, the CEV is 0.69, while the binning-method based solution implies a CEVthatisconsiderablylower(0.46). Webelievethisismainlydrivenbythebinningmethod discretizationunderstatingtheamountoflonger-termnon-employmentrisk. FortheArellano etal.(2017)process,theCEVestimatesalsodifferbetweendiscretizationmethods;ourmethod implies a CEV of 0.19, and tail-binning results in a CEV estimate of 0.16. Given that the CEV measuresthewelfaregaininapplicationswithpolicyexperiments,thesensitivityofCEV’sto thechoiceofdiscretizationmethodisanimportantfinding. Figure7: Simulationsfromthelife-cyclemodelfortwodifferentdiscretizationsoftheearningsprocessofGuvenen etal.(2021)andArellanoetal.(2017). (a)Guvenenetal.(2021)for𝑚 =12 8 1.2 0.4 6 1 0.3 4 0.8 0.2 2 0.6 0 0.1 30 40 50 60 30 40 50 60 30 40 50 60 0.05 0.08 0.8 0.06 0.6 0 0.04 0.4 -0.05 0.02 0.2 -0.1 0 0 30 40 50 60 30 40 50 60 0.2 0.4 0.6 0.8 1 (b)Arellanoetal.(2017)for𝑚 =18 15 1.5 0.8 0.6 10 1 0.4 5 0.2 0 0.5 0 30 40 50 30 40 50 30 40 50 0.3 0.8 0.02 0.2 0.6 0.015 0.1 0.4 0.01 0 0.005 0.2 -0.1 0 0 30 40 50 30 40 50 0.2 0.4 0.6 0.8 1 Notes: Solidlinerepresentsourdiscretizationmethod,andthedashedlinesarethebinningmethods. 27

Table5: Summarystatisticscomputedfromsimulationsfromthelife-cyclemodelfortwodifferentdiscretizations oftheearningsprocessesofGuvenenetal.(2021)andArellanoetal.(2017). Model+Guvenen Model+ABB Method Janssens-McCrary Binning Janssens-McCrary Tail-Binning St.dev.(log𝑐 ) 0.77 0.74 0.46 0.41 𝑖𝑡 St.dev.(Δlog𝑐 ) 0.17 0.19 0.10 0.11 𝑖𝑡 Corr(log𝑐 ,log𝑦 ) 0.91 0.90 0.95 0.93 𝑖𝑡 𝑖𝑡 Corr(Δlog𝑐 ,Δlog𝑦 ) 0.75 0.82 0.78 0.77 𝑖𝑡 𝑖𝑡 CEV 0.69 0.46 0.19 0.16 𝜓𝑃 0.51 0.47 0.66 0.67 BPP MeanMPC 0.22 0.23 0.22 0.21 Discountfactor 𝛽 0.94 0.94 0.97 0.97 Table 6 summarizes several wealth inequality measures as found in the data (obtained from Krueger, Mitman, and Perri (2016)) and compares them to those computed from the lifecycle model solutions. We find that the discretization method matters for the amount of wealth inequality a life-cycle model can generate. Using binning to discretize both earnings processes results in less wealth inequality than when using our method. Most likely this is because binning misses out on the skewness and excess kurtosis present in the process. The differences between methods are largest for the Arellano et al. (2017) process. When using ourdiscretizationmethodfortheArellanoetal.(2017)process,themodelmatchesthewealth distribution of the data fairly well. For example, our discretization results in a wealth Gini indexof0.76,closetothe0.77-0.78inthedata(tail-binning: 0.7),andatop1%wealthshareof 34.5% (tail binning: 27.6), which is actually larger than the 30.9-33.5% reported by Krueger et al.(2016)). Theabilityofourmodelsolutiontomatchtheseaspectsofthewealthdistribution – without targeting it – is notable, given that the literature has documented that simple lifecycle models like this one typically struggle matching the right tail of the empirical wealth distribution(DeNardiandFella,2017). Comparing the earnings process of Guvenen et al. (2021) with Arellano et al. (2017) in the contextofalife-cyclemodel,wefindthatwhileboththemeanMPCandthemeanMPC’sover the wealth distribution are comparable between processes, the Guvenen et al. (2021) process implies a flatter MPC profile over the life-cycle than the Arellano et al. (2017) process. This is becausethepresenceofthenon-employmentshockintheGuvenenetal.(2021)processcreates a strong precautionary savings motive for younger generations, resulting in lower MPC’s for youngagesthanintheArellanoetal.(2017)process(0.35insteadof0.75forthe25yearolds). This large downside risk in the Guvenen et al. (2021) also result in a higher CEV (0.69) than 28

Table6: Wealthinequalitymeasures. DatafromKruegeretal.(2016). Data Model+Guvenen Model+ABB %Shareheldby: PSID,06 SCF,07 Janssens-McCrary Binning Janssens-McCrary Tail-binning Q1 -0.9 -0.2 -0.7 -0.6 -0.4 -0.3 Q2 0.8 1.2 0.9 1.4 1.5 2.3 Q3 4.4 4.6 6.4 7.6 7.1 9.5 Q4 13.0 11.9 19.6 21.1 15.7 19.2 Q5 82.7 82.5 73.0 70.6 76.0 69.3 T1% 30.9 33.5 10.9 8.9 34.5 27.6 Gini 0.77 0.78 0.73 0.69 0.76 0.70 in the Arellano et al. (2017) process (0.19). While the Guvenen et al. (2021) process by and large focuses on downside earnings risk, the Arellano et al. (2017) process features a longer right-tail, resulting in more wealth inequality (as in Table 6) than the Guvenen et al. (2021) process. 4.5 Canonicalstochasticprocesses To illustrate that discretization methods matter beyond the setting of highly non-linear processes like the ones presented above, this section considers the discretization of two simpler persistent-transitory earnings processes in the context of a life-cycle model. Both processes characterize the persistent component as an AR(1) process. The first process uses Gaussian innovations both to the persistent and transitory component (referred to as AR(1) below). The second process has Gaussian mixture innovations for both the persistent and transitory component (henceforth referred to as AR(1)-M). Specifically, for the second process, we use a simplified version of the Guvenen et al. (2021) process in Equation (12), disregarding the non-employment shock, with the same parameters. We parametrize the AR(1) process such that it has the same autocorrelation and variance as the AR(1)-M. In Equations, the AR(1)-M 29

processisgivenby:21 𝑦𝑖 = 𝑒(𝑧 𝑡 𝑖+𝜀 𝑡 𝑖) 𝑡 𝑧𝑖 = 𝜌𝑧𝑖 +𝜂𝑖 𝑡 𝑡−1 𝑡  𝑁(𝜇 ,𝜎 ) withprob. 𝑝 𝜂𝑖 ∼  𝜂,1 𝜂,1 𝑧 (13) 𝑡   𝑁(𝜇 𝜂,2 ,𝜎 𝜂,2 ) withprob.1− 𝑝 𝑧   𝑁(𝜇 ,𝜎 ) withprob. 𝑝 𝜀𝑖 ∼  𝜀,1 𝜀,1 𝜀 𝑡   𝑁(𝜇 𝜀,2 ,𝜎 𝜀,2 ) withprob.1− 𝑝 𝜀 .  For the AR(1) persistent-transitory process with Gaussian innovations, we use 𝜂𝑖 ∼ 𝑁(0,𝜎2) 𝑡 𝜂 and 𝜀𝑖 ∼ 𝑁(0,𝜎2) where 𝜎2 = 𝑝 𝜎2 + (1 − 𝑝 )𝜎2 + 𝑝 𝜇2 + (1 − 𝑝 )𝜇2 and similar for 𝑡 𝜀 𝜂 𝜂 𝜂,1 𝜂 𝜂,2 𝜂 𝜂,1 𝜂 𝜂,2 𝜎2. For both processes, we only discretize the persistent component and separately add a 𝜀 three-grid-pointequal-quantilediscretizationofthetransitorycomponenttothemodel. We compare our discretization method to the methods of Comparison with other methods. Rouwenhorst (1995), Tauchen (1986), and Farmer and Toda (2017) for the AR(1) process and to Farmer and Toda (2017) and the binning method of Judd (1998)/Adda and Cooper (2003) fortheAR(1)-Mprocess.22 Figure 8 presents the information loss of each discretization relative to the true process. To compute the information loss, we interpret the transition probability matrix and grid of the different discretization methods as parameters Π and 𝜇 in our HMM framework, and then re-estimate the variance of the approximation error. This results in a likelihood for each discretization. Wecomputethisstatisticfordifferentgridsizes 𝑚. Given that our method minimizes information loss, it is no surprise that our method results in the lowest losses. Figure 8 shows the Farmer and Toda method is, for larger grids, closest tooursintermsofinformationloss,andthedifferencesininformationlossbetweenoursand the alternative methods are large. For the AR(1) process, we achieve the same information loss as the Farmer and Toda method with 27 grid points using only 19. We achieve the same 21Parameters: 𝜌=0.959,𝑝 𝑧 =0.407,𝜇 𝜂,1 =−0.085,𝜇 𝜂,2 =0.085𝑝 𝑧 /(1−𝑝 𝑧 ),𝜎 𝜂,1 =0.364,𝜎 𝜂,2 =0.069,𝑝 𝜀 =0.13, 𝜇 𝜀,1 =0.271,𝜇 𝜀,2 =−0.271𝑝 𝜀 /(1−𝑝 𝜀 ),𝜎 𝜀,1 =0.285and𝜎 𝜀,2 =0.037. 22All implementations of these methods are standard, except for the grid width we use in the Farmer and Toda (2017) method. For the AR(1) process, we use a grid width equal to max{3,1.2log(𝑚 −1)} times the standarddeviationoftheprocess. ThisgridwidthisbasedontheproposalofFlodén(2008),andwefindthat √ thischoiceworksbetterinthissettingthanthewidthof 𝑚−1thatFarmerandToda(2017)propose. Forthe AR(1)-Mprocess,weusemax{4,1.2log(𝑚−1)}. TheFarmer-Todamethodissettomatchthefirstfourconditional momentsateachgridpoint. 30

Figure 8: KL divergence of the approximating model likelihood versus the likelihood of the true process for an AR(1)andAR(1)-Mprocess,fordifferentdiscretizationmethodsanddifferentgridsizes𝑚. (a)AR(1) (b)AR(1)-M 0.45 1.6 0.4 1.4 0.35 1.2 0.3 1 0.25 0.8 0.2 0.6 0.15 0.4 0.1 0.05 0.2 0 0 10 15 20 25 5 10 15 20 25 30 35 40 information loss as the Tauchen method with 27 using only 15 grid points, and the same loss as the Rouwenhorst method at 27 using only 13 grid points. For the AR(1)-M, we achieve the same information loss as the Farmer and Toda method at 39 grid points using only 25 grid points, and the same information loss as the binning method at 39 using only 11 grid points. BecausetheRouwenhorstmethodisdominatedbytheFarmerandTodamethodandTauchen methodforlargergrids,wedropthismethodintheanalysisthatfollowsbelow. Table7summarizessomeotherstatisticsofthediscretizedprocesses,beingtheunconditional and conditional moments of the distribution. The Farmer-Toda method is based on momentmatching,whichiswhytheytendtoperformwellatmostmoments. However,theirmethodis implementedsuchthatwhenitcannotmatchamomentinoneofthegridpoints,thatspecific moment restriction gets dropped for that grid point. This is why there are cases in which it doesn’t match all moments even when targeting them, and ours or the Binning/Tauchen method may perform better at matching those moments. One statistic where we consistently outperformtheothermethodsistheMeanSquaredForecastError(MSFE),thatis,ifanagent would use the discretized process to make forecasts about the true process, what are the forecasterrorstheagentmakes. Next, we evaluate how the choice of the discretization Implications in a life-cycle model. method affects the solutions of the life-cycle model, where we focus on a selected number of statistics given in Table 8 and Figure 9. Our main conclusion from Table 8 is that for these two stochastic processes, the choice of the discretization method can matter for the model solution, and particularly when using a low number of grid points. For example, with an 31

Table7: SummarystatisticsonunconditionalandconditionalmomentsfordifferentdiscretizationsoftheAR(1) andAR(1)-Mstochasticprocesses. AR(1) AR(1)-M 𝑚=7 𝑚=17 𝑚=17 𝑚=31 JM FT T JM FT T JM FT Bin JM FT Bin Abs. dev. uncond. mean <0.01 <0.01 <0.01 <0.01 <0.01 0.01 0.01 <0.01 <0.01 0.01 <0.01 0.01 %dev. uncond. var. 5.33 18.4 56.6 0.09 <0.01 11.7 2.99 0.70 7.3 2.51 0.70 4.04 %dev. uncond. skew. 0.03 <0.01 0.04 0.02 0.01 0.03 7.88 26.3 20.2 0.81 8.80 15.3 %dev. uncond. kurt. 10.8 61.8 11.3 3.62 1.45 7.83 3.06 6.36 18.8 1.34 1.03 12.7 %dev. autocor. 0.75 0.07 1.65 1.28 0.03 0.09 0.55 <0.01 0.69 0.26 0.01 0.29 Ave. abs. dev. cond. mean 1.07 <0.01 1.49 1.19 <0.01 0.25 0.01 <0.01 <0.01 0.01 <0.01 0.01 Ave. %dev. cond. var. 14.1 79.6 4.17 24.7 <0.01 16.1 19.1 8.74 19.1 26.1 8.74 13.0 Ave. %dev. cond. skew. 1.12 3.47 1.45 0.74 1.38 0.14 1.01 0.60 1.11 1.13 1.40 1.20 Ave. %dev. cond. kurt 336 1628 265 349 907 2.55 122 132 84.0 167 485 84.0 MSFE 0.09 0.12 0.12 0.07 0.07 0.07 0.07 0.08 0.08 0.06 0.07 0.07 Notes: For the skewness moments of the AR(1) process, these are the absolute deviations rather than the % deviation. Fortheconditionalmoments,theaverageiscomputedacrossgridpoints. AR(1) process, the standard deviation of log consumption, the welfare cost of risk (CEV), the wealthGiniindex,andtheWealthShareofthetop20%varyineconomicallysignificantways acrossthedifferentdiscretizationmethodswhenusingagridsizeof 𝑚 = 7. Whenincreasing the grid size to 𝑚 = 17, the differences between the methods are smaller. In general, the solutions that follow from our discretization method change little when adding grid points. For the AR(1)-M method, we look at a larger 𝑚, because it requires more grid points to get to the same information loss (as shown in Figure 8b). At 𝑚 = 17, the solutions differ for, particularly, the Gini Index, the mean MPC’s and Top 1% Wealth Share, but the solutions betweendiscretizationsaremoresimilarat𝑚 = 31. Wethinkthesensitivityofmodelsolutions at low 𝑚 is an important insight, because it is common in the literature to use discretizations ofAR(1)processeswithfewgridpoints. Figure 9 visualizes the mean MPC’s across the life-cycle and the wealth distribution for the differentAR(1)andAR(1)-Mdiscretizations. Weagainseethatthechoiceofthediscretization method matters, mostly for low 𝑚. Interestingly, for the AR(1)-M, the aggregate statistics on consumption and earnings in Table 8 are similar for both 𝑚 = 17 and 𝑚 = 31, and appear insensitive to the choice of discretization, however, we see the life-cycle profile of MPC’s and the MPC’sacross the wealthdistribution differsignificantly for 𝑚 = 17and evenwith 𝑚 = 31 thechoiceofdiscretizationmatters. Forsomeagegroups,themeanMPC’scanvaryasmuch as 20% between methods. Compared with the other discretization methods, the MPC’s that follow from our method change less when adding more grid points, in line with our method 32

Table8: Summarystatisticscomputedfromsimulationsfromthelife-cyclemodelsolvedforanAR(1)andAR(1)- Mearningsprocessdiscretizedusingdifferentmethods. Model+AR(1) Model+AR(1)-M 𝑚 =7 𝑚 =17 𝑚 =17 𝑚 =31 JM FT T JM FT T JM FT Bin JM FT Bin St.dev(log𝑐 ) 0.70 0.79 0.91 0.72 0.72 0.76 0.71 0.71 0.68 0.71 0.71 0.70 𝑖𝑡 St.dev(Δlog𝑐 𝑖𝑡 ) 0.14 0.15 0.15 0.14 0.15 0.16 0.14 0.15 0.15 0.14 0.15 0.15 Corr(log𝑐 ,log𝑦 ) 0.96 0.95 0.97 0.97 0.96 0.96 0.96 0.96 0.95 0.96 0.96 0.96 𝑖𝑡 𝑖𝑡 Corr(Δlog𝑐 𝑖𝑡 ,Δlog𝑦 𝑖𝑡 ) 0.78 0.82 0.82 0.79 0.80 0.82 0.89 0.89 0.89 0.89 0.89 0.89 CEV 0.37 0.42 0.46 0.38 0.38 0.39 0.40 0.40 0.38 0.40 0.40 0.39 𝜓𝑃 0.51 0.55 0.46 0.52 0.51 0.52 0.51 0.51 0.52 0.51 0.51 0.52 BPP MeanMPC 0.26 0.21 0.26 0.25 0.25 0.25 0.21 0.15 0.20 0.20 0.18 0.21 Giniindex 0.78 0.83 0.81 0.79 0.78 0.78 0.75 0.75 0.72 0.75 0.75 0.73 Q5WealthShare 0.80 0.87 0.85 0.82 0.80 0.81 0.76 0.76 0.73 0.76 0.76 0.75 T1%WealthShare 0.12 0.19 0.16 0.15 0.14 0.14 0.12 0.13 0.09 0.12 0.12 0.10 Discountfactor𝛽 0.95 0.93 0.93 0.95 0.94 0.94 0.95 0.94 0.94 0.95 0.94 0.94 Notes: JMstandsforJanssens-McCrary,FTstandsfortheFarmer-Todamethod,Binreferstothebinningmethod, andTstandsfortheTauchenmethod. Figure 9: Mean Marginal Propensities to Consume for three different discretizations of AR(1) and AR(1)-M processesinalife-cyclemodel,computedacrossthewealthdistributionandbyage. (a)AR(1),𝑚 =7 (b)AR(1),𝑚 =17 (c)AR(1)-M,𝑚 =17 (d)AR(1)-M,𝑚 =31 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 30 40 50 60 30 40 50 60 30 40 50 60 30 40 50 60 Notes: Thesolidblacklineisourdiscretizationmethod,andthedashedlineistheFarmer-Todamethod. Forthe AR(1),thedash-dotlineistheTauchenmethod,forAR(1)-M,itisthebinningmethod. being more parsimonious by capturing a larger fraction of the information using fewer grid points. 33

. Finally, we Comparison between an AR(1), AR(1)-M and the Guvenen et al. (2021) process use this model to ask what the differences are between an AR(1) and AR(1)-M process in a life-cycle context, that is, what do excess skewness and kurtosis imply for a life-cycle model. Consider Table 8 and the largest choice of 𝑚. Most notably, the Gaussian mixture leads to a largercorrelationbetweenconsumptionandincomechanges. Inaddition,weseeandecrease of the mean MPC, and a decrease in wealth inequality. The intuition behind these results is that the income distribution in the economy with the mixture distribution is less unequal, because the mixture process is skewed towards lower incomes. This results in a less unequal wealth distribution. In addition, the increased left-tail risk increases the correlation between consumption and income changes, and it lowers MPC’s because of a stronger precautionary savings motive. Comparing this with the Guvenen et al. (2021) process that features nonemploymentshocksinadditiontoGaussianmixtureinnovations,weseethatnon-employment shocks substantially increase the CEV compared to the AR(1)-M process (from 0.40 to 0.69), lowers the wealth Gini index from 0.75 to 0.73, and increases mean MPC’s from 0.20 to 0.22 becauseabout2%morepeoplelivehand-to-mouth. 5 Conclusion Thispaperproposesanovelfinite-stateMarkovchainapproximationmethod,basedonminimizingtheinformationlossbetweenthetruestochasticprocessandaHiddenMarkovModel. A finite-state Markov chain approximation is inherently a misspecified model, and the objective of minimizing the KL divergence is standard in the misspecified model literature. We show that this is a consistent approach in our setting in the sense that under some assumptions, using enough hidden states, the information loss between the approximating Hidden Markov Model and the true stochastic process can be made arbitrarily small. Our discretizationmethodisapplicabletoalargeclassofstochasticprocessesandprovidesbothanoptimally selectedgridandtransitionprobabilitymatrix. Thisoptimalgridisespeciallypowerfulinthe caseofcorrelatedmultivariateprocesses,asitavoidstheuseoftensorgrids. We apply and compare our method in two applications. The first application is an assetpricingmodelwithstochasticvolatility,which,asshownbyDeGroot(2015),hasaclosed-form analytical solution. This analytical solution is our benchmark when comparing the solutions based on different discretization methods, and we find our method results in numerical solutionsclosertothisbenchmark. Thesecondapplicationevaluatestheeffectofthechoiceof the discretization method on the solutions that follow from a life-cycle model with a variety ofdifferentearningsprocesses,includingGuvenenetal.(2021)andArellanoetal.(2017). We 34

find that the discretization method matters for, among other things, the welfare cost of risk, themarginalpropensitytoconsume,andwealthinequalitymeasures. Discretizedstochasticprocesseshavemanymoreapplicationsthantheonesweusetobenchmark our method. The econometric literature has shown stochastic processes featuring nonlinearities,excessskewnessandkurtosisprovideabetterdescriptionofthedata. Ourmethod providesatoolfortheuseofricherstatisticalprocessesinstructuraleconomicmodels. 35

References Adda, J., and Cooper, R. W. (2003). . Dynamic Economics: Quantitative Methods and Applications MITpress. Altonji, J. G., Hynsjö, D. M., and Vidangos, I. (2022, May). "Individual Earnings and Family (Working Paper No. 30095). National Bureau of Income: Dynamics and Distribution" Economic Research. Retrieved from http://www.nber.org/papers/w30095 doi: 10 .3386/w30095 Arellano,M.,Blundell,R.,andBonhomme,S. (2017). “EarningsandConsumptionDynamics: aNonlinearPanelDataFramework”. , (3),693–734. Econometrica 85 Bansal, R., and Yaron, A. (2004). “Risks for the Long Run: A Potential Resolution of Asset PricingPuzzles”. , (4),1481–1509. JournalofFinance 59 Benabou, R. (2002). “Tax and Education Policy in a Heterogenous-Agent Economy: What Levels of Redistribution Maximize Growth and Efficiency?”. , (2), 481- Econometrica 70 517. Blundell, R., Pistaferri, L., and Preston, I. (2008). “Consumption Inequality and Partial Insurance”. , (5),1887–1921. AmericanEconomicReview 98 Civale, S., Díez-Catalán, L., and Fazilet, F. (2016). “Discretizing a Process with Non-Zero SkewnessandHighKurtosis”. . AvailableatSSRN2636485 De Groot, O. (2015). “Solving Asset Pricing Models with Stochastic Volatility”. Journal of , ,308–321. EconomicDynamicsandControl 52 De Nardi, M., and Fella, G. (2017). “Saving and Wealth Inequality”. Review of Economic , ,280–300. Dynamics 26 DeNardi,M.,Fella,G.,andPaz-Pardo,G. (2020). “NonlinearHouseholdEarningsDynamics, Self-Insurance,andWelfare”. , (2),890–926. JournaloftheEuropeanEconomicAssociation 18 Do, M. N. (2003). “Fast Approximation of Kullback-Leibler Distance for Dependence Trees andHiddenMarkovModels”. , (4),115–118. IEEEsignalprocessingletters 10 Douc, R., and Moulines, E. (2012). “Asymptotic Properties of the Maximum Likelihood EstimationinMisspecifiedHiddenMarkovModels”. , (5),2697– TheAnnalsofStatistics 40 2732. Duan,J.-C.,andSimonato,J.-G. (2001). “AmericanOptionPricingunderGARCHbyaMarkov ChainApproximation”. , (11),1689–1718. JournalofEconomicDynamicsandControl 25 Farmer, L. E. (2021). “The Discretization Filter: A Simple Way to Estimate Nonlinear State SpaceModels”. , (1),41–76. QuantitativeEconomics 12 Farmer,L.E.,andToda,A.A.(2017).“DiscretizingNonlinear,Non-GaussianMarkovProcesses withExactConditionalMoments”. , (2),651–683. QuantitativeEconomics 8 36

Fella, G., Gallipoli, G., and Pan, J. (2019). “Markov-Chain Approximations for Life-Cycle Models”. , ,183–201. ReviewofEconomicDynamics 34 Finesso,L.,Grassi,A.,andSpreĳ,P. (2010).“ApproximationofStationaryProcessesbyHidden MarkovModels”. , (1),1–22. MathematicsofControl,Signals,andSystems 22 Flodén, M. (2008). “A Note on the Accuracy of Markov-Chain Approximations to Highly PersistentAR(1)Processes”. , (3),516–520. EconomicsLetters 99 Galindev, R., and Lkhagvasuren, D. (2010). “Discretization of Highly Persistent Correlated AR(1)Shocks”. , (7),1260–1276. JournalofEconomicDynamicsandControl 34 Goldfeld, S. M., and Quandt, R. E. (1973). “A Markov Model for Switching Regressions”. , (1),3–15. JournalofEconometrics 1 Gordon,G. (2021). “EfficientVARDiscretization”. , ,109872. EconomicsLetters 204 Gospodinov,N.,andLkhagvasuren,D.(2014).“AMoment-MatchingMethodforApproximating Vector Autoregressive Processes by Finite-State Markov Chains”. Journal of Applied , (5),843–859. Econometrics 29 Gourieroux,C.,Monfort,A.,andTrognon,A.(1984).“PseudoMaximumLikelihoodMethods: Theory”. ,681–700. Econometrica Guvenen, F., Karahan, F., Ozkan, S., and Song, J. (2021). “What do Data on Millions of US WorkersRevealAboutLifecycleEarningsDynamics?”. , (5),2303–2339. Econometrica 89 Hamilton, J. D. (1990). “Analysis of Time Series Subject to Changes in Regime”. Journal of , (1-2),39–70. Econometrics 45 Hornik, K., Stinchcombe, M., and White, H. (1989). “Multilayer Feedforward Networks are UniversalApproximators”. , (5),359–366. NeuralNetworks 2 Judd,K. (1998). . MITPress. NumericalMethodsinEconomics Kitagawa, G. (1987). “Non-Gaussian State-Space Modeling of Nonstationary Time Series”. , (400),1032–1041. JournaloftheAmericanStatisticalAssociation 82 Kopecky,K.A.,andSuen,R.M. (2010). “FiniteStateMarkov-ChainApproximationstoHighly PersistentProcesses”. , (3),701–714. ReviewofEconomicDynamics 13 Krueger, D., Mitman, K., and Perri, F. (2016). “Macroeconomics and Household Heterogeneity”. In (Vol.2,pp.843–921). Elsevier. HandbookofMacroeconomics Krueger, D., and Wu, C. (2021). “Concumption Insurance against Wage Risk: Family Labor Supply and Optimal Progressive Income Taxation”. American Economics Journal: , (1),79-113. Macroeconomics 13 Langrock, R. (2011). “Some Applications of Nonlinear and Non-Gaussian State-Space Modelling by Means of Hidden Markov Models”. , (12), 2955– Journal of Applied Statistics 38 2970. 37

Lehéricy, L. (2021). “Nonasymptotic Control of the MLE for Misspecified Nonparametric HiddenMarkovModels”. , (2),4916–4965. ElectronicJournalofStatistics 15 McLachlan, G. J., Lee, S. X., and Rathnayake, S. I. (2019). “Finite Mixture Models”. Annual , ,355-378. ReviewofStatisticsandItsApplications 6 Mevel, L., and Finesso, L. (2004). “Asymptotical Statistics of Misspecified Hidden Markov Models”. , (7),1123–1132. IEEETransactionsonAutomaticControl 49 Mitchell, O. S., and Phillips, J. W. (2006). “Social Security Replacement Rates for Alternative EarningsBenchmarks”. , ,37-47. BenefitsQuarterly 4 Quandt, R. E. (1958). “The Estimation of the Parameters of a Linear Regression System Obeying Two Separate Regimes”. , (284), Journal of the American Statistical Association 53 873–880. Rouwenhorst,K.G.(1995).“AssetPricingImplicationsofEquilibriumBusinessCycleModels”. InT.F.Cooley(Ed.), (pp.294–330).PrincetonUniversity FrontiersofBusinessCycleResearch Press. Song,Y. (2014). “ModellingRegimeSwitchingandStructuralBreakswithanInfiniteHidden MarkovModel”. , (5),825–842. JournalofAppliedEconometrics 29 Tauchen, G. (1986). “Finite State Markov-chain Approximations to Univariate and Vector Autoregressions”. , (2),177–181. EconomicsLetters 20 Tauchen,G.,andHussey,R. (1991). “Quadrature-BasedMethodsforObtainingApproximate SolutionstoNonlinearAssetPricingModels”. ,371–396. Econometrica Terry, S. J., and Knotek II, E. S. (2011). “Markov-chain Approximations of Vector Autoregressions: Application of General Multivariate-Normal Integration Techniques”. Economics , (1),4–6. Letters 110 Vidyasagar, M. (2005). “The Realization Problem for Hidden Markov Models: The Complete Realization Problem”. In Proceedings of the 44th IEEE Conference on Decision and Control (pp.6632–6637). White, H. (1982). “Maximum Likelihood Estimation of Misspecified Models”. , Econometrica 1–25. Wu, C., and Krueger, D. (2021). “Consumption Insurance Against Wage Risk: Family Labor Supply and Optimal Progressive Income Taxation”. American Economic Journal: Macroe- , (1),79–113. conomics 13 Zeevi, A. J., and Meir, R. (1997). “Density Estimation through Convex Combinations of Densities: ApproximationandEstimationBounds”. , (1),99–109. NeuralNetworks 10 38

A Proof of Main Theorem A.1 Preliminaries,notationandexistingresults AsinZeeviandMeir(1997),denote ℱ𝑐,𝜂 = {𝑓 ∈ ℱ𝑐 |𝑓 ≥ 𝜂 > 0,∀𝑦 ∈ 𝒴} where (cid:26) ∫ (cid:27) ℱ𝑐 = 𝑓|𝑓 ∈ 𝐶𝒴, 𝑓 ≥ 0, 𝑓 = 1 is the class of continuous density functions with compact support 𝒴 ⊂ R𝑘 fixed and given. ℱ𝑐,𝜂 ⊂ ℱ𝑐 isboundedbelowover𝒴 bysomepositiveconstant,denotedby𝜂. We impose the following assumptions on the true process 𝑓( ) and approximating model y 𝑝( ,𝜃): y (A1) y = {𝑦 𝑡 }𝑇 𝑡=1 hasadatageneratingprocesscharacterizedby 𝑓( y ),𝑦 𝑡 ∈ R𝑘,thatisfirst-order Markovandstationary,thatis, 𝑓(𝑦 |𝑦 ,...,𝑦 ) = 𝑓(𝑦 |𝑦 ), 𝑡 𝑡−1 1 𝑡 𝑡−1 and 𝑓(𝑦 |𝑦 ) = 𝑓(𝑦 |𝑦 ) ∀𝑙 ∈ N. 𝑡+𝑙 𝑡+𝑙−1 𝑡 𝑡−1 (A2) 𝑓(𝑦 𝑡 |𝑦 𝑡−1 ) ∈ ℱ𝑐,𝜂 . (A3) log 𝑓(𝑦 𝑡 |𝑦 𝑡−1 )and 𝑓(𝑦 𝑡 |𝑦 𝑡−1 )aredifferentiablein 𝑦 𝑡−1 ∈ 𝒴. (A4) log 𝑓(𝑦 𝑡 |𝑦 𝑡−1 )islocallyLipschitzcontinuousin 𝑦 𝑡−1 ∈ 𝒴. 𝑝( ;𝜃 )ischaracterizedby: (A5) y 𝑚 𝑦 |𝑥 = 𝜇 (𝑥 )+diag(𝜎 )𝜀 , 𝜀 ∼ 𝑁(0,𝐼 ), 𝑡 𝑡 𝑚 𝑡 𝑚 𝑡 𝑡 𝑘 𝑥 |𝑥 ∼ Π 𝑡+1 𝑡 𝑖𝑗,𝑚 39

with parameters 𝜃 = (𝜇 ,Π ,𝜎 ), and 𝑥 ∈ {1,...𝑚} a latent state evolving accord- 𝑚 𝑚 𝑚 𝑚 𝑡 ing to a first-order Markov process with transition probability matrix Π . Denote the 𝑚 conditionaldistributionby 𝑝(𝑦 𝑡 |𝑦 𝑡−1 ,...,𝑦 1 ;𝜃 𝑚 ) ∈ ℱ𝑐,𝜂 . Denotethe 𝐿 distancebetweentwofunctionsby 𝑝 (cid:18)∫ (cid:19)1/𝑝 𝑑 (𝑓, 𝑔) := |𝑓(𝑥)− 𝑔(𝑥)|𝑝𝑑𝑥 𝑝 andthe 𝑙 distancebetweentwovectorsas 𝑑 (𝑥,𝑥′) = (|𝑥 −𝑥′|𝑝+...+|𝑥 −𝑥′|𝑝)1/𝑝,for 𝑝 ≥ 1. 𝑝 𝑝 1 1 𝑑 𝑑 Wedenotetheclassofbasicdensitiesthatweuseinourapproximationclassas (cid:110) (cid:16)·−𝜇(cid:17) (cid:111) Φ 𝜂,𝜏 = 𝜙 𝜎 ∈ Φ 𝜂 |𝜙 𝜎 = 𝜎−𝑑𝜙 ,𝜇 ∈ 𝒴,𝜎 ∈ Rs.t. 𝜎 ≥ 𝜏 > 0 𝜎 with Φ 𝜂 = {𝜙 ∈ Φ|𝜙 ≥ 𝜂 > 0,∀𝑦 ∈ 𝒴} and Φ = {𝜙|𝜙 ∈ 𝐶(R𝑘),𝜙 > 0, ∫ 𝜙 = 1} the class of continuousdensities. NoteΦ ⊂ Φ ⊂ Φ. Theapproximationclassisgivenby 𝜂,𝜏 𝜂 (cid:40) 𝑛 𝑚 (cid:41) (cid:213) (cid:213) 𝒢𝑛 = 𝑓 𝑚 𝜃|𝑓 𝑚 𝜃(·) = 𝛼 𝑖 𝜙 𝜎 (·;𝜃 𝑖 ),𝜙 𝜎 ∈ Φ 𝜂,𝜏 ,𝛼 𝑖 > 0, 𝛼 𝑖 = 1 𝑖=1 𝑖=1 andwewrite𝜃 = (𝜇,𝜎),where𝜇 = [𝜇(1),...,𝜇(𝑚)]. Thatis,weconsideramixturedistribution where all functions have the same scale parameter but a different location. Unlike Zeevi and Meir (1997), we will strictly refer to 𝜙 as the Gaussian probability density function, which falls into the class of functions they consider. For multivariate distributions, we have 𝜇(𝑖) = (𝜇1(𝑖),...𝜇𝑘(𝑖)), and 𝜎 = (𝜎 ,...,𝜎 ), such that 𝜙 is the product of 𝑘 independent 1 𝑘 𝜎 Gaussianpdf’s,alsoknownasaproductkernel. Define 𝛾 suchthat𝜂 = 1 . 𝛾2 (Eq. 14inZeeviandMeir,1997). 𝑔, 𝑓 𝑔, 𝑓 ≥ 1 > 0 Lemma1 For s.t. , 𝛾2 𝐷𝐾𝐿(𝑓||𝑔) ≤ 𝛾2𝑑2(𝑓, 𝑔). 2 That is, for densities 𝑓 and 𝑔 that are bounded below by 1 , the KL divergence is bounded 𝛾2 fromabovebythesquaredL normbetween 𝑓 and 𝑔 multipliedby 𝛾2. 2 40

Lemma2 (Petersen,1983asinZeeviandMeir,1997). Let 1 ≤ 𝑝 < ∞ andlet 𝜙 ∈ 𝐿 1 (R𝑘) , ∫ 𝜙 = 1 . Letting 𝜙 𝜎 (𝑥) = 𝜎−𝑘𝜙(𝑥/𝜎) ,thenforany 𝑓 ∈ 𝐿 𝑝 (R𝑘) ,wehave 𝜙 𝜎 ∗ 𝑓 → 𝑓 in 𝐿 𝑝 (R𝑘) as 𝜎 → 0 where ∫ (𝜙 ∗ 𝑓)(𝑥) := 𝜙 (𝑥 − 𝑦)𝑓(𝑦)𝑑𝑦. 𝜎 𝜎 Here, 𝐿 (R𝑘) and 𝐿 (R𝑘) denote the space of measurable functions for which ||𝑓|| < ∞ and 1 𝑝 1 ¯ ||𝑓|| 𝑝 < ∞, respectively. If we define 𝑓 := 𝑓 ∗ 𝜙 𝜎 , Lemma 2 implies∀𝜀 > 0 and 𝑓 ∈ ℱ𝑐,𝜂 , there ¯ existsan 𝑓 suchthat 𝑑2(𝑓, 𝑓 ¯ ) ≤ 𝜀. (A.1) 2 ¯ Corollary1 (ZeeviandMeir,1997). Function 𝑓 belongstotheclosureoftheconvexhullofΦ 𝜂,𝜏. ¯ (Barron,1993asinZeeviandMeir,1997). 𝑓 Lemma3 If isintheclosureoftheconvexhullofaset 𝐺 ||𝑔|| ≤ 𝑏 ∀𝑔 ∈ 𝐺 ∀𝑚 ≥ 1 ∀𝑐 > (𝑏2 − ||𝑓 ¯ ||2) ∃ 𝑓0 in Hilbert Space, with 2 , then and 2 , a function 𝑚 in 𝑚 𝐺 theconvexhullof pointsin s.t. 𝑐 𝑑2(𝑓 ¯ , 𝑓0) ≤ . 2 𝑚 𝑚 Corollary 2 (Zeevi and Meir, 1997). For any 𝑓 ∈ ℱ𝑐,𝜂 and 𝜀 > 0, there exists a convex combination 𝑓 𝑚 0 in𝒢𝑚 s.t. 𝑐 𝑑2(𝑓, 𝑓0) ≤ 𝜀+ . 2 𝑚 𝑚 Note that Corollary 2 follows directly from the triangle inequality and Equation A.1 and Lemma 3. One of the implications of Corollary 2 is that the Gaussian mixture model is a universalapproximatorintheL norm. 2 CombiningCorollary2withLemma1,wehave: 𝑐 𝐷𝐾𝐿(𝑓||𝑓0)) ≤ 𝛾2𝜀+ 𝛾2 (A.2) 𝒴 𝑚 𝑚 Note that although the KL divergence is not a strict metric and does not generally satisfy the triangle inequality, the 𝑑2 distance function does. We can use the relationship between the 𝑑2 2 2 metricandtheKLdivergencelaidoutbyLemma1inLemma4below. 41

A.2 BoundontheKLdivergenceofaGaussianMixtureinaGivenGrid In Lemma 4, we provide an upper bound on the L norm and KL divergence between a 2 Gaussian mixture and a function 𝑓, when the Gaussian mixture takes a choice of grid points 𝜇˜ andvarianceand 𝜎˜ thatmaynotbesameas𝜇0 and 𝜎0 ofCorollary2. 𝑚 𝑚 𝑚 𝑚 . 𝜙 𝑘 Lemma 4 Let 𝜎 denote the Gaussian distribution function (or product of independent Gaussian 𝑓𝑚 𝑓 𝑓 ˜ 𝑓0 distribution functions), 0 and are as defined in Corollary 2, and 𝑚 is the same function as 𝑚, (𝛼0 ,𝜇0 ,𝜎0 ) 𝜇 𝜎 characterized by 𝑚 𝑚 𝑚 except it is evaluated in a different and variance , with elements denotedby 𝜇˜ 𝑚 (𝑖) ∈ 𝒴 ⊂ R𝑘 ,and 𝜎˜ 𝑚 ≥ 𝜏 > 0 butwithsamemixtureweights 𝛼0 𝑚. Then 𝑐 1 (cid:18) |Σ ˜ |(cid:19) 𝑑 2 2(𝑓, 𝑓 ˜ 𝑚 ) ≤ 𝜀+ 𝑚 + 4 m 𝑖 ax{(𝜇0 𝑚 (𝑖)−𝜇˜ 𝑚 (𝑖))′(Σ ˜− 𝑚 1)(𝜇0 𝑚 (𝑖)−𝜇˜ 𝑚 (𝑖))}+ tr (Σ ˜− 𝑚 1 Σ 0 𝑚 )− 𝑘 +ln |Σ 0 𝑚 | 𝑚 and (cid:18) 𝑐 1 (cid:18) |Σ ˜ |(cid:19)(cid:19) 𝐷 𝒴 𝐾𝐿(𝑓||𝑓 ˜ 𝑚 ) ≤ 𝛾2 𝜀+ 𝑚 + 4 m 𝑖 ax{(𝜇0 𝑚 (𝑖)−𝜇˜ 𝑚 (𝑖))′(Σ ˜− 𝑚 1)(𝜇0 𝑚 (𝑖)−𝜇˜ 𝑚 (𝑖))}+ tr (Σ ˜− 𝑚 1 Σ 0 𝑚 )− 𝑘 +ln |Σ 0 𝑚 | 𝑚 with 𝛾 , 𝜀 and 𝑐 giveninLemma1,2and3,respectively,andΣ = diag (𝜎 1 ,...,𝜎 𝑘 ) . . We use the L - L norm inequality: 𝑑 (𝑓0, 𝑓 ˜ ) ≤ 𝑑 (𝑓0, 𝑓 ˜ ) and Pinsker’s inequality: Proof 2 1 2 𝑚 𝑚 1 𝑚 𝑚 (cid:114) (cid:16) (cid:17) 𝑑 (𝑓0, 𝑓 ˜ ) ≤ 1𝐷𝐾𝐿 𝑓0||𝑓 ˜ . Given that we are comparing two Gaussian mixtures with the 1 𝑚 𝑚 2 𝑚 𝑚 samemixtureweights,fromDo(2003),weobtainthefollowingupperbound: 𝑚 𝐷𝐾𝐿 (cid:16) 𝑓0||𝑓 ˜ (cid:17) ≤ (cid:213) 𝛼 𝐷𝐾𝐿 (cid:16) 𝑓0,𝑖||𝑓 ˜𝑖 (cid:17) , 𝑚 𝑚 𝑖 𝑚 𝑚 𝑖=1 where we denote the 𝑖th component of the mixture distribution with superscript 𝑖. By propertiesoftheGaussiandistribution,wehave: 𝐷𝐾𝐿 (cid:16) 𝑓0,𝑖||𝑓 ˜𝑖 (cid:17) = 1 (cid:26) (𝜇0 (𝑖)−𝜇˜ (𝑖)) ′ Σ ˜−1(𝜇0 (𝑖)−𝜇˜ (𝑖))+tr(Σ ˜−1 Σ 0 )− 𝑘 +ln |Σ ˜ 𝑚 |(cid:27) . 𝑚 𝑚 2 𝑚 𝑚 𝑚 𝑚 𝑚 𝑚 𝑚 |Σ 0 | 𝑚 Usingthat(cid:205)𝛼 = 1: 𝑖 1 𝑑2(𝑓0, 𝑓 ˜ ) ≤ max𝐷𝐾𝐿(𝑓0,𝑖||𝑓 ˜𝑖 ) 2 𝑚 𝑚 4 𝑚 𝑚 𝑖 1 (cid:18) |Σ ˜ |(cid:19) ≤ max{(𝜇0 (𝑖)−𝜇˜ (𝑖))′(Σ ˜−1)(𝜇0 (𝑖)−𝜇˜ (𝑖))}+tr(Σ ˜−1 Σ 0 )− 𝑘 +ln 𝑚 4 𝑖 𝑚 𝑚 𝑚 𝑚 𝑚 𝑚 𝑚 |Σ 0 | 𝑚 42

Combiningthiswith 𝑑2(𝑓, 𝑓0) ≤ 𝜀+ 𝑐 fromCorollary2,andusingthetriangleinequalityfor 2 𝑚 𝑚 theL normandLemma1,weconclude. □ 2 As long as 𝜎0 and 𝜎˜ go to zero at the same rate, and the distance between 𝜇𝑜 and 𝜇˜ goes 𝑚 𝑚 𝑚 𝑚 tozero,theexpressionsinLemma4willconvergetothoseinCorollary2andEquation(A.2). 𝑚 A.3 GaussianMixtures Lemma 5 extends Lemma 4, applying Lemma 4 to 𝑚 conditional distributions at the same time. Lemma 5 . Let 𝜇˜ 𝑚 (𝑖) ∈ 𝒴 ⊂ R𝑘 , 𝑖 = 1,...,𝑚 and 𝜎˜ 𝑚 ≥ 𝜏 be given. Let 𝑓𝑖 ∈ ℱ𝑐,𝜂, for 𝑖 = 1,...,𝑚 denote 𝑚 distributions and let 𝑓 𝑚 0,𝑖 , 𝑖 = 1,...,𝑚 as in Corollary 2. Let 𝑓 ˜ 𝑚 𝑖 , 𝑖 = 1,...𝑚 be the same as 𝑓0,𝑖 𝜇˜ 𝜎˜ ≥ 𝜏 𝑚 , but all with the same location parameters 𝑚 and scale 𝑚 , but their own mixture weights 𝛼0 𝑚 ,𝑖 . Forevery 𝜀 > 0 ,thereexists 𝑚 > 0 and 𝑚×𝑚 matrixwithmixtureweights 𝐴 𝑚 = [𝛼0 𝑚 ,1,...,𝛼0 𝑚 ,𝑚] suchthatforall 𝑗 = 1,...,𝑚 (cid:18) 𝑐max 𝐷𝐾𝐿(𝑓𝑗||𝑓 ˜𝑗 )) ≤ 𝛾2 𝜀max + +... 𝒴 𝑚 𝑚 (cid:40) (cid:41)(cid:33) 1 |Σ ˜ | 4 m 𝑙 ax m 𝑖 ax{(𝜇0 𝑚 ,𝑙(𝑖)−𝜇˜ 𝑚 (𝑖))′ Σ ˜ 𝑚 )−1(𝜇0 𝑚 ,𝑙(𝑖)−𝜇˜ 𝑚 (𝑖))}+ tr (Σ ˜− 𝑚 1 Σ 0 𝑚 ,𝑙)− 𝑘 +ln |Σ 0 𝑚 ,𝑙| 𝑚 (A.3) and 𝑐max 𝑑2(𝑓𝑗||𝑓 ˜𝑗 )) ≤ 𝜀max + +.... 2 𝑚 𝑚 (cid:40) (cid:41) 1 |Σ ˜ | 4 m 𝑙 ax m 𝑖 ax{(𝜇0 𝑚 ,𝑙(𝑖)−𝜇˜ 𝑚 (𝑖))′ Σ ˜ 𝑚 )−1(𝜇0 𝑚 ,𝑙(𝑖)−𝜇˜ 𝑚 (𝑖))}+ tr (Σ ˜− 𝑚 1 Σ 0 𝑚 ,𝑙)− 𝑘 +ln |Σ 0 𝑚 ,𝑙| 𝑚 (A.4) where 𝜀max = max 𝑙 𝜀𝑙 and 𝑐max = max 𝑙 𝑐𝑙 ,where 𝑐𝑙 , 𝜇0 𝑚 ,𝑙 , 𝜎 𝑚 0,𝑙 and 𝛼0 𝑚 ,𝑙 areasinCorollary2: 𝑐𝑙 𝑑2(𝑓𝑙, 𝑓0,𝑙)) ≤ 𝜀𝑙 + (A.5) 2 𝑚 𝑚 foreach 𝑙 = 1,...,𝑚 . . Equation (A.3) follows from applying Corollary 2 to conditional distribution 𝑓𝑖, Proof each 𝑖 = 1,...,𝑚, such that Equation (A.5) holds for each of these distributions, except with a 43

different 𝜀𝑖 and 𝑐𝑖,andatdifferentgridsandvariances(denoted𝜇0,𝑖 and 𝜎0,𝑖 respectively)for 𝑚 𝑚 each of the 𝑖 = 1,...,𝑚 conditional distributions. This results in 𝑚 sets of 𝑚 mixture weights 𝛼0 . 𝑚 In addition to showing Equation (A.5) holds, where we use a different grid 𝜇0,𝑙 and variance 𝑚 𝜎0,𝑙 to fit each conditional distribution 𝑙 = 1,...,𝑚, we need to argue this result also goes 𝑚 throughwhenevaluatingtheKLdivergenceofthesedistributionsallinthesamegrid𝜇˜ and 𝑚 variance 𝜎˜ . Forthis,weuseLemma4. Thisgivesus 𝑚 𝐷𝐾𝐿(𝑓𝑖||𝑓 ˜𝑖 )) 𝒴 𝑚 (cid:32) (cid:33) 𝑐𝑙 |Σ ˜ | ≤ 𝛾2 𝜀𝑙 + +max{(𝜇0,𝑙(𝑖)−𝜇˜ (𝑖))′ Σ ˜ )−1(𝜇0,𝑙(𝑖)−𝜇˜ (𝑖))}+tr(Σ ˜−1 Σ 0,𝑙)− 𝑘 +ln 𝑚 𝑚 𝑖 𝑚 𝑚 𝑚 𝑚 𝑚 𝑚 𝑚 |Σ 0,𝑙| 𝑚 for each 𝑙, and similar for the 𝐿 norm, such that we can take the maximum over all these 2 𝑙 = 1,...,𝑚,togettheexpressioninEquation(A.3)whichthenholdsforeachconditionalthat isevaluatedinoneofthe 𝑗 = 1,...,𝑚 gridpoints. □ A.4 PropertiesoftheHMM 𝑝( ;𝜃) {𝜇 (𝑖)}𝑚 𝜎 ≥ 𝜏 > 0 ∃ 𝑙 ∈ Lemma 6. If y as in Assumption (A5) and 𝑚 𝑖=1 and 𝑚 are such that {1,...,𝑚} s.t. 𝜙 𝑖 (𝑦 𝑡−1 ) < 𝜂(𝜎 𝑚 )/(𝑚−1) if 𝑖 ≠ 𝑙 ,foreach 𝑦 𝑡−1 ∈ 𝒴,and (cid:205)𝑚 𝑖=1 𝜙 𝑖 (𝑦 𝑡−1 ) = 𝐾(𝜎 𝑚 ) ,where 𝜂(𝜎 ) → 0 𝜎 → 0 𝐾 𝜎 ℎ ≥ 1 log𝑝(𝑦 |𝑦 ,...,𝑦 ,...,𝑦 ;𝜃) 𝑚 as 𝑚 and non-increasing in 𝑚, then for , 𝑡 𝑡−1 𝑡−ℎ 1 𝑦 ℎ ≥ 2 𝑚 isLipschitzcontinuousin 𝑡−ℎ,andfor ,theLipschitzconstantgoestozeroas growslarge. . First of all, log𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃) is everywhere differentiable in 𝑦 . Therefore, to Proof 𝑡 𝑡−1 1 𝑡−1 showLipschitzcontinuity,wehavetoshowitsderivativeisbounded. 𝑚 (cid:34) 𝑚 (cid:35) (cid:213) (cid:213) 𝑝(𝑦 |𝑦 ,𝑦 ,...,𝑦 ;𝜃) = 𝑝(𝑦 |𝑥 = 𝑗) (𝑃(𝑥 = 𝑗|𝑥 = 𝑖)𝑃(𝑥 = 𝑖|𝑦 ,𝑦 ,...,𝑦 ;𝜃)) 𝑡 𝑡−1 𝑡−2 1 𝑡 𝑡 𝑡 𝑡−1 𝑡−1 𝑡−1 𝑡−2 1 𝑗=1 𝑖=1 𝑚 (cid:34) 𝑚 (cid:35) (cid:213) (cid:213) = 𝜙 (𝑦 ) (cid:0) Π 𝑃(𝑥 = 𝑖|𝑦 ,𝑦 ,...,𝑦 )(cid:1) 𝑗 𝑡 𝑖𝑗 𝑡−1 𝑡−1 𝑡−2 1 𝑗=1 𝑖=1 and 𝑝(𝑦 ) = (cid:205)𝑚 (cid:2)𝜙 (𝑦 )𝛿 (cid:3). Here 1 𝑗=1 𝑗 1 1𝑖 𝜙 𝑖 (𝑦 𝑡 )(cid:205)𝑚 𝑗=1 Π 𝑗𝑖 𝑃(𝑥 𝑡−1 = 𝑗|𝑦 𝑡−1 ,...,𝑦 1 ) 𝐴 𝑖𝑡 𝑃(𝑥 = 𝑖|𝑦 ,𝑦 ,...,𝑦 ) = := 𝑡 𝑡 𝑡−1 1 (cid:205)𝑚 𝑖=1 𝜙 𝑖 (𝑦 𝑡 )(cid:205)𝑚 𝑗=1 Π 𝑗𝑖 𝑃(𝑥 𝑡−1 = 𝑗|𝑦 𝑡−1 ,...,𝑦 1 ) 𝐵 𝑡 44

where 𝑃(𝑥 = 𝑖|𝑦 ) = 𝛿 𝜙 (𝑦 )/(cid:205)𝑚 𝛿 𝜙 (𝑦 ). 1 1 1𝑖 𝑖 1 𝑖=1 1𝑖 𝑖 1 Weneedtoevaluate 𝜕log𝑝(𝑦 |𝑦 ,𝑦 ,...,𝑦 )/𝜕𝑦 . For ℎ ≥ 1,wehave: 𝑡 𝑡−1 𝑡−2 1 𝑡−ℎ 𝜕log𝑝(𝑦 |𝑦 ,𝑦 ,...,𝑦 ;𝜃) 1 𝜕𝑝(𝑦 |𝑦 ,𝑦 ,...,𝑦 ;𝜃) 𝑡 𝑡−1 𝑡−2 1 𝑡 𝑡−1 𝑡−2 1 = , (A.6) 𝜕𝑦 𝑝(𝑦 |𝑦 ,𝑦 ,...,𝑦 ;𝜃) 𝜕𝑦 𝑡−ℎ 𝑡 𝑡−1 𝑡−2 1 𝑡−ℎ where 𝑚 𝑚 𝜕𝑝(𝑦 𝑡 |𝑦 𝑡−1 ,𝑦 𝑡−2 ,...,𝑦 1 ;𝜃) (cid:213) (cid:213) 𝜕𝑃(𝑥 𝑡−1 = 𝑗|𝑦 𝑡−1 ,...,𝑦 1 ) = 𝜙 (𝑦 ) Π (A.7) 𝑖 𝑡 𝑗𝑖 𝜕𝑦 𝜕𝑦 𝑡−ℎ 𝑡−ℎ 𝑖=1 𝑗=1 with,for ℎ = 1: 𝜕𝑃(𝑥 = 𝑖|𝑦 ,...,𝑦 ) 𝑡−1 𝑡−1 1 = 𝜕𝑦 𝑡−1 𝐵 𝜙′(𝑦 )(cid:205)𝑚 Π 𝑃(𝑥 = 𝑗|𝑦 ,...,𝑦 )−𝐴 (cid:205)𝑚 𝜙′(𝑦 )(cid:205)𝑚 Π 𝑃(𝑥 = 𝑗|𝑦 ,...,𝑦 ) 𝑡−1 𝑖 𝑡−1 𝑗=1 𝑗𝑖 𝑡−2 𝑡−2 1 𝑖𝑡−1 𝑙=1 𝑙 𝑡−1 𝑗=1 𝑗𝑙 𝑡−2 𝑡−1 1 𝐵2 𝑡−1 (A.8) The expression in Equation (A.6) is bounded and therefore Lipschitz in 𝑦 . First of all, 𝑡−1 1 is bounded from below and finite. 𝐴 and 𝐵 are finite, and 𝜙′(·) is bounded 𝑝(𝑦 |𝑦 ,𝑦 ,...,𝑦 ) 𝑖𝑡 𝑡 𝑡 𝑡−1 𝑡−2 1 because the Gaussian distribution itself is Lipschitz continuous, so boundedness of the expressionsfollows. For ℎ ≥ 2: 𝜕𝑃(𝑥 = 𝑖|𝑦 ,...,𝑦 ) 𝑡−1 𝑡−1 1 = 𝜕𝑦 𝑡−ℎ (A.9) 𝐵 𝜙 (𝑦 )(cid:205)𝑚 Π 𝜕𝑃(𝑥 𝑡−2=𝑗|𝑦 𝑡−2 ,...,𝑦 1 ) −𝐴 (cid:205)𝑚 𝜙 (𝑦 )(cid:205)𝑚 Π 𝜕𝑃(𝑥 𝑡−2=𝑗|𝑦 𝑡−2 ,...,𝑦 1 ) 𝑡−1 𝑖 𝑡−1 𝑗=1 𝑗𝑖 𝜕𝑦 𝑡−ℎ 𝑖𝑡−1 𝑙=1 𝑙 𝑡−1 𝑗=1 𝑗𝑙 𝜕𝑦 𝑡−ℎ 𝐵2 𝑡−1 and,asthisisrecursive,weneedtheexpressionfor 𝜕𝑃(𝑥 = 𝑖|𝑦 )/𝜕𝑦 ,whichisgivenby: 1 1 1 𝛿 𝜙′(𝑦 )(cid:205)𝑚 𝛿 𝜙 (𝑦 )− 𝛿 𝜙 (𝑦 )(cid:205)𝑚 𝛿 𝜙′(𝑦 ) 𝜕𝑃(𝑥 1 = 𝑖|𝑦 1 ) 1𝑖 𝑖 1 𝑗=1 1𝑗 𝑗 1 1𝑖 𝑖 1 𝑗=1 1𝑗 𝑗 1 = 𝜕𝑦 (cid:16) (cid:17)2 1 (cid:205)𝑚 𝛿 𝜙 (𝑦 ) 𝑗=1 1𝑗 𝑗 1 45

Define 𝐶 := 𝜙 (𝑦 )(cid:205)𝑚 Π 𝜕𝑃(𝑥 𝑡−2=𝑗|𝑦 𝑡−2 ,...,𝑦 1 ) and 𝐷 = (cid:205) 𝐶 . Denote 𝑖𝑡−1 𝑖 𝑡−1 𝑗=1 𝑗𝑖 𝜕𝑦 𝑡 𝑖 𝑖𝑡−1 𝑡−ℎ 𝐶 = 𝜂(𝜎 𝑚 ) (cid:205)𝑚 Π 𝜕𝑃(𝑥 𝑡−2=𝑗|𝑦 𝑡−2 ,...,𝑦 1 ) and 𝐴 = 𝜂(𝜎 𝑚 ) (cid:205)𝑚 Π 𝑃(𝑥 = 𝑖|𝑦 ,...,𝑦 ) < low,𝑖 𝑚 𝑗=1 𝑗𝑖 𝜕𝑦 low,𝑖 𝑚−1 𝑗=1 𝑗𝑖 𝑡−1 𝑡−1 1 𝑡−ℎ 𝜂(𝜎 𝑚 ) . WerewriteEquation(A.9)as(𝐵 𝐶 −𝐴 𝐷 )/𝐵2 . 𝑚−1 𝑡−1 𝑖𝑡−1 𝑖𝑡−1 𝑡−1 𝑡−1 By our assumptions, there are two cases. If we are in the case that 𝑖 and 𝑦 are such that 𝑡−1 𝜙 (𝑦 ) < 𝜂(𝜎 )/(𝑚 − 1), we have 𝐵 𝐶 < 𝐵 𝐶 and 𝐴 𝐷 < 𝐴 𝐷 . Both 𝑖 𝑡−1 𝑚 𝑡−1 𝑖𝑡−1 𝑡−1 low,𝑖 𝑖𝑡−1 𝑡−1 low,𝑖 𝑡−1 𝐴 and 𝐶 are decreasing in 𝑚, so in this case Equation (A.9) is decreasing in 𝑚. On low,𝑖 low,𝑖 the other hand, if 𝑖 is such that 𝜙 (𝑦 ) > (𝐾 − 𝜂(𝜎 )), we have 𝐵 𝐶 − 𝐴 𝐷 = 𝑖 𝑡−1 𝑚 𝑡−1 𝑖,𝑡−1 𝑖,𝑡−1 𝑡−1 (𝐵 −𝐴 +𝐴 )𝐶 −𝐴 (𝐷 −𝐶 +𝐶 )= (𝐵 −𝐴 )𝐶 −𝐴 (𝐷 − 𝑡−1 𝑖,𝑡−1 𝑖,𝑡−1 𝑖,𝑡−1 𝑖,𝑡−1 𝑡−1 𝑖,𝑡−1 𝑖,𝑡−1 𝑡−1 𝑖,𝑡−1 𝑖,𝑡−1 𝑖,𝑡−1 𝑡−1 𝐶 ), with 𝐵 − 𝐴 < 𝜂(𝜎 )(cid:205) (cid:205)𝑚 Π 𝑃(𝑥 = 𝑖|𝑦 ,...,𝑦 ) and 𝐷 − 𝐶 < 𝑖,𝑡−1 𝑡−1 𝑖,𝑡−1 𝑚 𝑘≠𝑖 𝑗=1 𝑗𝑘 𝑡−1 𝑡−1 1 𝑡−1 𝑖,𝑡−1 𝜂(𝜎 )(cid:205) (cid:205)𝑚 Π 𝜕𝑃(𝑥 𝑡−2=𝑗|𝑦 𝑡−2 ,...,𝑦 1 ) . Both terms in the numerator are decreasing towards 𝑚 𝑘≠𝑖 𝑗=1 𝑗𝑘 𝜕𝑦 𝑡−ℎ zeroin𝑚. Notethat𝐵 isboundedby𝐾(𝜎 ). Thus,inbothcases,thederivativeinEquation 𝑡−1 𝑚 (A.9) decreases in 𝑚, so the Lipschitz coefficient of log𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃 ) to 𝑦 , ℎ ≥ 2 is 𝑡 𝑡−1 1 𝑚 𝑡−ℎ decreasingin 𝑚. □ ThisresultisrelatedtoLeGlandandMevel(2000)whoshowthatHiddenMarkovModelshave exponential forgetting, which in this context means that 𝜕𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃)/𝜕𝑦 declines 𝑡 𝑡−1 1 𝑡−ℎ in ℎ at an exponential rate. However, for our result, exponential forgetting is not sufficient, becauseweneedtheLipschitzconstantnotonlytodeclineifthehistoryislongerago,butthe Lipschitzconstantalsoneedstobecomesmalleras 𝑚 growslarger,whichiswhatweshowed with Lemma 6. Intuitively, this result says that as the number of states grows large enough, andthefilterbecomesbetter,ourHMMbecomesapproximatelyfirst-orderMarkov. . Corollary 3 Under the assumptions of Lemma 6, the Hellinger distance between the conditional distribution 𝑝(𝑦 𝑡 |𝑦 𝑡−1 ,...,𝑦 1 ;𝜃) and the Gaussian mixture 𝑝0(𝑦 𝑡 |𝑦 𝑡−1 ;𝜃) := (cid:205)𝑚 𝑗=1 𝜙 𝑗 (𝑦 𝑡 )Π 𝑙𝑗 with mixtureweights {Π 𝑙𝑗 }𝑚 𝑗=1 ,with 𝑙 asinLemma6,approacheszeroas 𝑚 becomeslarge. Notethat 𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃)isaGaussianmixturewithconvexmixtureweights 𝑡 𝑡−1 1 (cid:205)𝑚 (cid:0) Π 𝑃(𝑥 = 𝑖|𝑦 ,𝑦 ,...,𝑦 )(cid:1), 𝑗 = 1,...,𝑚. From Lemma 6, the 𝑙 -norm between the 𝑖=1 𝑖𝑗 𝑡−1 𝑡−1 𝑡−2 1 2 mixture weights (cid:205)𝑚 (cid:0) Π 𝑃(𝑥 = 𝑖|𝑦 ,𝑦 ,...,𝑦 )(cid:1) and Π goes to zero when 𝑚 large. 𝑖=1 𝑖𝑗 𝑡−1 𝑡−1 𝑡−2 1 𝑙𝑗 Fromthis,itfollowsthat 𝑑 (𝑝(𝑦 |𝑦 ,....,𝑦 ;𝜃),𝑝0(𝑦 |𝑦 ;𝜃))isdecreasingin 𝑚. 2 𝑡 𝑡−1 1 𝑡 𝑡−1 A.5 TheKLdivergenceisafunctionofallconditionalKLdivergences . 𝐷𝐾𝐿(𝑓(𝑦 |𝑦 )||𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃 )) Lemma 7 Under assumption (A1) and (A5), if 𝑡 𝑡−1 𝑡 𝑡−1 1 𝑚 is bounded 𝒴 {𝑦 }𝑡−1 𝑡 𝐷𝐾𝐿(𝑓( )||𝑝( ;𝜃 )) andcanbemadearbitrarilysmallforanysequences 𝑘 𝑘=1 forall ,then 𝒴 y y 𝑚 isalso 𝑚 boundedandcanbemadearbitrarilysmallbypicking large. 46

Proof. Thefirst-orderMarkovassumptiononthetrueDGPof y implies 𝑓(𝑦 𝑡 |𝑦 0 ,𝑦 1 ,...,𝑦 𝑡−1 ) = 𝑓(𝑦 |𝑦 ),suchthatwecanwrite 𝑡 𝑡−1 𝑇 (cid:214) 𝑓( y ) = 𝑓(𝑦 1 ) 𝑓(𝑦 𝑡 |𝑦 𝑡−1 ) 𝑡=2 where 𝑓(𝑦 )denotessomeinitialdistribution. 1 HiddenMarkovModelsdonotsatisfytheMarkovpropertyfor . Wehave y 𝑇 (cid:214) 𝑝( y ;𝜃) = 𝑝(𝑦 1 ;𝜃) 𝑝(𝑦 𝑡 |𝑦 𝑡−1 ,𝑦 𝑡−2 ,...,𝑦 1 ;𝜃) 𝑡=2 with 𝑝(𝑦 ;𝜃)againtheinitialdistribution. 1 TheKLdivergencefor𝑇 observationsisgivenby ∫ (cid:18) 𝑓( ) (cid:19) y 𝑓( y )log 𝑑 y = 𝑝( ;𝜃) y ∫ ∫ ∫ 𝑇 (cid:32) 𝑓(𝑦 )(cid:206)𝑇 𝑓(𝑦 |𝑦 ) (cid:33) ··· 𝑓(𝑦 ) (cid:214) 𝑓(𝑦 |𝑦 )log 1 𝑡=2 𝑡 𝑡−1 𝑑𝑦 𝑑𝑦 ...𝑑𝑦 1 𝑡 𝑡−1 𝑇 𝑇−1 1 𝑝(𝑦 |𝜃)𝑝(𝑦 |𝑦 ;𝜃)···𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃) 1 2 1 𝑇 𝑇−1 1 𝑡=2 StraightforwardalgebrashowstheKLdivergencecanbewrittenas: ∫ (cid:18) 𝑓( ) (cid:19) y 𝑓( y )log 𝑑 y = 𝑝( ;𝜃) y 𝑇 ∫ (cid:213) 𝐷𝐾𝐿(𝑓(𝑦 )||𝑝(𝑦 |𝜃)))+ 𝑓( )𝐷𝐾𝐿(𝑓(𝑦 |𝑦 )||𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃))𝑑 1 1 y1:𝑡−1 𝑡 𝑡−1 𝑡 𝑡−1 1 y1:𝑡−1 𝒴 𝒴 𝑡=2 Notethat 𝑓(𝑦 )integratesto1and 𝐷𝐾𝐿 isnon-negative. Thisimpliesif 1:𝑡−1 𝒴 𝐷𝐾𝐿(𝑓(𝑦 |𝑦 )||𝑝(𝑦 |𝑦 ,𝑦 ,...,𝑦 ;𝜃)) → 0forall 𝑦 ,...,𝑦 ,andall 𝑡 > 1,then 𝑡 𝑡−1 𝑡 𝑡−1 𝑡−2 1 𝑡 1 𝒴 𝐷𝐾𝐿(𝑝( ;𝜃)||𝑓( )) → 0. □ y y 𝒴 A.6 ProofofMainTheorem 𝑚 Main Theorem. Under assumptions (A1)-(A5), given a sufficiently large number of grid points , there exist a set of grid points 𝜇 𝑚 ∈ 𝒴, variance 𝜎 𝑚 ≥ 𝜏 > 0 and transition probability matrix Π 𝑚, collected in 𝜃 𝑚 = (𝜇 𝑚 ,Π 𝑚 ,𝜎 𝑚 ) such that the KL divergence between 𝑓( y ) and 𝑝( y ;𝜃) on the compact 47

subset 𝑦 ∈ 𝒴 ⊂ R𝑘 ,givenby ∫ 𝑓( ) y 𝐷𝐾𝐿(𝑓( y )||𝑝( y ;𝜃)) = 𝑓( y )log 𝑑 y , 𝒴 𝑝( y ;𝜃) 𝒴 canbemadearbitrarilysmall. . ByCorollary3,as𝑚 becomeslarge, 𝑑 (𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃 ),𝑝0(𝑦 |𝑦 ;𝜃))goestozero. Proof 2 𝑡 𝑡−1 1 𝑚 𝑡 𝑡−1 Next, we apply Lemma 5 to the 𝑚 conditional distribution functions 𝑓(𝑦 |𝑦 = 𝜇 (𝑖)) and 𝑡 𝑡−1 𝑚 𝑝0(𝑦 |𝑦 = 𝜇 (𝑖);𝜃 )for 𝑖 = 1,....𝑚. By Lemma5,the 𝐿 normbetweenthese 𝑚 conditional 𝑡 𝑡−1 𝑚 𝑚 2 distributions is bounded and becomes arbitrarily small as 𝑚 becomes large. This holds for a grid 𝜇 𝑚 on𝒴 for which the grid points get closer together as 𝑚 grows larger, and 𝜎 𝑚 ≥ 𝜏 > 0 approacheszero. Bythetriangleinequality, 𝑑 (𝑓(𝑦 |𝑦 = 𝜇 (𝑖)),𝑝(𝑦 |𝑦 = 𝜇 (𝑖),...,𝑦 ;𝜃 )) ≤ 2 𝑡 𝑡−1 𝑚 𝑡 𝑡−1 𝑚 1 𝑚 𝑑 (𝑝0(𝑦 |𝑦 = 𝜇 (𝑖);𝜃 ),𝑝(𝑦 |𝑦 = 𝜇 (𝑖),...,𝑦 ;𝜃 ))+... 2 𝑡 𝑡−1 𝑚 𝑚 𝑡 𝑡−1 𝑚 1 𝑚 𝑑 (𝑝0(𝑦 |𝑦 = 𝜇 (𝑖);𝜃 ), 𝑓(𝑦 |𝑦 = 𝜇 (𝑖))) 2 𝑡 𝑡−1 𝑚 𝑚 𝑡 𝑡−1 𝑚 Together with Lemma 1, this implies 𝐷𝐾𝐿(𝑓(𝑦 |𝑦 = 𝜇 (𝑖)),𝑝(𝑦 |𝑦 = 𝜇 (𝑖),...,𝑦 ;𝜃 )) 𝑡 𝑡−1 𝑚 𝑡 𝑡−1 𝑚 1 𝑚 𝒴 approacheszerowhen 𝑚 becomeslarge. Next,weshowthatwhentheKL-divergenceofthedistributionconditionalon 𝑦 beingone 𝑡−1 of the 𝑚 gridpoints, i.e., in 𝑦 = 𝜇 (𝑖), becomes arbitrarily small as 𝑚 becomes large, then 𝑡−1 𝑚 the KL divergence of distributions conditional on any 𝑦 𝑡−1 ,𝑦 𝑡−2 ,...,𝑦 1 in the compact set 𝒴 alsobecomessmall. ByAssumptions(A3)-(A4),andLemma6,wehave 𝐷𝐾𝐿(𝑓(𝑦 |𝑦 = 𝑦)||𝑝(𝑦 |𝑦 = 𝑦,𝑦 ,...,𝑦 ;𝜃 )) ≤ 𝑡 𝑡−1 𝑡 𝑡−1 𝑡−2 1 𝑚 𝐷𝐾𝐿(𝑓(𝑦 |𝑦 = 𝜇 (𝑖))||𝑝(𝑦 |𝑦 = 𝜇 (𝑖),𝑦 ,...,𝑦 ;𝜃 ))+... 𝑡 𝑡−1 𝑚 𝑡 𝑡−1 𝑚 𝑡−2 1 𝑚 𝑂(𝐾 |𝑦 −𝜇 (𝑖)|,𝐾 |𝑦 −𝜇 (𝑖)|,𝐾 |𝑦 −𝜇 (𝑖)|) 𝑝 𝑚 𝑓 𝑚 log𝑓 𝑚 Here 𝐾 denotes the Lipschitz coefficient of 𝑝(𝑦 |𝑦 ,...,𝑦 ;𝜃 ) in 𝑦 , 𝐾 denotes the Lip- 𝑝 𝑡 𝑡−1 1 𝑚 𝑡−1 𝑓 schitz coefficient of 𝑓(𝑦 |𝑦 ) in 𝑦 , and 𝐾 the Lipschitz coefficient for log 𝑓(𝑦 |𝑦 ) in 𝑡 𝑡−1 𝑡−1 log𝑓 𝑡 𝑡−1 𝑦 . Noteherethattherelevant𝜇 (𝑖)toconsideristheoneclosestto𝑦. 𝑂(𝐾|𝑦−𝜇 (𝑖)|,𝐾 |𝑦− 𝑡−1 𝑚 𝑚 𝑓 𝜇 (𝑖)|,𝐾 |𝑦−𝜇 (𝑖)|)denotessomefunctionincreasinginthetermsinbetweenbrackets. As 𝑚 log𝑓 𝑚 can be seen, these three terms converge to zero as the grid points are closer together, because thenthemaximumdistance |𝑦 −𝜇 (𝑖)| alsogoestozero,so 𝑂(·)willalsoconvergetozero. 𝑚 48

By Lemma 6, if 𝐷𝐾𝐿(𝑓(𝑦 |𝑦 )||𝑝(𝑦 |𝑦 ,{𝑦 }𝑡−1;𝜃 )) can be made arbitrarily small for 𝑚 𝒴 𝑡 𝑡−1 𝑡 𝑡−1 𝑡−𝑘 𝑘=2 𝑚 large enough, then 𝐷𝐾𝐿(𝑓(𝑦 |𝑦 )||𝑝(𝑦 |𝑦 ,{𝑦˜ }𝑡−1;𝜃 ) is arbitrarily small for any other 𝒴 𝑡 𝑡−1 𝑡 𝑡−1 𝑡−𝑘 𝑘=2 𝑚 sequence {𝑦˜ }𝑡−1, because log𝑝(𝑦 |𝑦 ,{𝑦 }𝑡−1;𝜃 )) is Lipschitz continuous in {𝑦 }𝑡−1 𝑡−𝑘 𝑘=1 𝑡 𝑡−1 𝑡−𝑘 𝑘=2 𝑚 𝑡−𝑘 𝑘=2 withacoefficientthatgoestozeroas 𝑚 becomeslarge. ThisimpliestheKLdivergenceforall 𝐷𝐾𝐿(𝑓(𝑦 |𝑦 )||𝑝(𝑦 |𝑦 ,{𝑦˜ }𝑡−1;𝜃 )goestozerowhen 𝑚 becomeslarge,forall 𝑡 ≥ 2. 𝒴 𝑡 𝑡−1 𝑡 𝑡−1 𝑡−𝑘 𝑘=2 𝑚 For the initial distribution, the parameters 𝛿 function as mixture weights, where 𝑝(𝑦 ) = 1𝑖 1 (cid:205)𝑚 𝜙 (𝑦 )𝛿 isalsoamixtureofGaussians. ApplyingLemma4showsthisKLdivergenceis 𝑗=1 𝑗 1 1𝑖 alsoboundedandcanbemadearbitrarilysmall. ApplyingLemma7totheconditionalKLdivergencesconcludestheproof. □ B Estimation procedures B.1 EstimationofHMM’susingtheEMalgorithm We first discuss the general procedure we use for the estimation of the HMM. We omit the paneldatadimensionandassumeallparametersareconstant. Let𝜙 (𝑦 ) = 𝑃(𝑦 |𝑥 = 𝑗)denote 𝑗 𝑡 𝑡 𝑡 thedensityof 𝑦 conditionalon 𝑥 beinginstate 𝑗. Thatis, 𝑡 𝑡 1 − 1 (𝑦 −𝜇(𝑗))2 𝜙 𝑗 (𝑦 𝑡 ) = √ 𝑒 2𝜎2 𝑡 , (B.1) 𝜎 2𝜋 if 𝑘 = 1, or det(2𝜋Σ)−1 2 𝑒−1 2 (𝑦 𝑡 −𝜇(𝑗))′(Σ)−1(𝑦 𝑡 −𝜇 𝑗 ) for 𝑘 > 1, where Σ 𝑡 = diag(𝜎2). It will be useful to thinkofthefollowingmatrixformfortheobservationdensities: 𝜙 (𝑦 ) 0 1 𝑡 𝚽(𝑦 ) = (cid:169) (cid:173) ... (cid:170) (cid:174), (B.2) 𝑡 (cid:173) (cid:174) (cid:173) (cid:174) 0 𝜙 (𝑦 ) 𝑚 𝑡 (cid:171) (cid:172) thatis,𝚽isan 𝑚 ×𝑚 diagonalmatrixwiththeobservationdensitiesasdiagonalelements. Denote bold variables y = {𝑦 1 ,𝑦 2 ,...,𝑦 𝑇 } and x = {𝑥 1 ,𝑥 2 ,...,𝑥 𝑇 } as realizations of this randomprocess. ThecompletedatalikelihoodofthemodelinEquation(2)isgivenby ℒ(𝜃|y,x) = 𝑝(y,x|𝜃) = 𝑝(y|x,𝜃)𝑝(x|𝜃), (B.3) 49

andthemaximumlikelihoodestimatorisgivenby 𝜃∗ = argmax ℒ(𝜃|y,x). (B.4) 𝜃 Ifthelatentstatesxwereobserved,thelog-likelihoodwouldbestraightforwardtomaximize. Thisisbecausethelog-likelihoodisgivenby log(ℒ(𝜃|y,x)) = log(𝑝(y|x,𝜃))+log(𝑝(x|𝜃)), (B.5) and, conditional on x, the parameters Π do not influence y and, similarly, the parameters (µ,𝜎)donotmatterforx. Together,thisimpliesthelog-likelihoodisgivenby log(ℒ(𝜃|y,x)) = log(𝑝(y|x,µ,𝜎))+log(𝑝(x|Π)). (B.6) Thatis,theparametersgoverningtheobservationequationandstatetransitionequationcould be solved for separately, given x. Intuitively, if the states x are observed, one could estimate Π using only data on transitions from x, estimate 𝜇(𝑗) by averaging the 𝑦 𝑡 that are observed when 𝑥 𝑡 is in state 𝑗, and then estimate Σ using the sample variance of the observations y demeanedbytheestimatesofµ. Inpractice,thelatentstatesxareunobservable,butwecanusetheEMalgorithmtomaximize the likelihood. The EM algorithm iterates between updating the posterior distribution over the latent states 𝑝 = 𝑝(x|y,𝜃) taking the parameters and observations (y,𝜃) as fixed in the x E step, and updating the parameters 𝜃(𝑖) → 𝜃(𝑖+1) taking the latent states and observations (𝑝 ,y)asfixedintheMstep. x We now describe the E-step. Let 𝑦𝑡 = (𝑦 ,𝑦 ,...,𝑦 ), i.e., the observed values up to time 𝑡. 1 2 𝑡 Similarly, let 𝑦𝑇 = (𝑦 ,𝑦 ,...,𝑦 ), i.e., the observed values from time 𝑡 + 1 to 𝑇. The 𝑡+1 𝑡+1 𝑡+2 𝑇 forwardprobabilitiesα𝑡 (𝑗)aregivenby α𝑡 (𝑗) = 𝑝 (cid:0)𝑦𝑡,𝑥 𝑡 = 𝑗|𝜃(cid:1) (B.7) andthebackwardprobabilitiesβ𝑡 (𝑘)aregivenby (cid:16) (cid:17) β𝑡 (𝑘) = 𝑝 𝑦 𝑡 𝑇 +1 |𝑥 𝑡 = 𝑘,𝜃 . (B.8) 50

Thesearedefinedrecursivelyas: α1 (𝑗) = 𝛿 1,𝑗 𝜙 𝑗 (𝑦 1 ) β𝑇 (𝑘) = 1 (B.9) (cid:32) 𝑚 (cid:33) 𝑚 (cid:213) (cid:213) α𝑡+1 (𝑗) = α𝑡 (𝑘)Π 𝑘𝑗 𝜙 𝑗 (𝑦 𝑡+1 ), β𝑡 (𝑘) = Π 𝑘𝑗 𝜙 𝑗 (𝑦 𝑡+1 )β𝑡+1 (𝑗), 𝑘=1 𝑗=1 orinmatrixform α𝑡 = α𝑡−1 Π𝚽(𝑦 𝑡 ) and β 𝑡 ′ = Π𝚽(𝑦 𝑡+1 )β 𝑡 ′ +1 . (B.10) Using these probabilities, we can define the probability of being in state 𝑘 at time 𝑡, and observingy as 𝑝(y,𝑥 𝑡 = 𝑘|𝜃) = α𝑡 (𝑘)β𝑡 (𝑘). (B.11) Thisleadstoaposteriorprobabilityofbeinginstate 𝑘,givenby 𝑝(y,𝑥 𝑡 = 𝑘|𝜃) 𝑝(y,𝑥 𝑡 = 𝑘|𝜃) α𝑡 (𝑘)β𝑡 (𝑘) γ𝑡 (𝑘) = 𝑝(𝑥 𝑡 = 𝑘|y,𝜃) = 𝑝(y|𝜃) = (cid:205)𝑚 𝑗=1 𝑝(y,𝑥 𝑡 = 𝑗|𝜃) = (cid:205)𝑚 𝑗=1 α𝑡 (𝑗)β𝑡 (𝑗) . (B.12) We can also define the posterior transition probability between state 𝑖 at time 𝑡 and state 𝑗 at time 𝑡 +1as 𝜉 𝑡 (𝑘, 𝑗) = 𝑝(𝑥 𝑡+1 = 𝑗,𝑥 𝑡 = 𝑘|y,𝜃) (B.13) ∝ β𝑡+1 (𝑗)𝜙 𝑗 (𝑦 𝑡+1 )Π 𝑘𝑗α𝑡 (𝑘), wherethelastlinefollowsfromthedefinitionofγ𝑡 (𝑘)fromabove. Atlast,the 𝑀 stepisgivenby 𝜇 (𝑗) = (cid:205)𝑇 𝑡=1 𝑦 𝑙,𝑡 𝑝(𝑥 𝑡 = 𝑗|y,𝜃) = (cid:205)𝑇 𝑡=1 𝑦 𝑙,𝑡γ𝑡 (𝑗) (B.14) 𝑙 (cid:205)𝑇 𝑡=1 𝑝(𝑥 𝑡 = 𝑗|y,𝜃) (cid:205)𝑇 𝑡=1 γ𝑡 (𝑗) 𝑇 𝑚 1 (cid:213)(cid:213) (𝜎 𝑙 )2 = (𝑦 𝑙,𝑡 −𝜇 𝑙 (𝑗))2 γ𝑡 (𝑗) (B.15) 𝑇 𝑡=1 𝑗=1 Π = (cid:205)𝑇 𝑡=2 𝑝(𝑥 𝑡 = 𝑗,𝑥 𝑡−1 = 𝑞|y,𝜃) = (cid:205)𝑇 𝑡=2 𝜉 𝑡 (𝑞, 𝑗) , (B.16) 𝑞𝑗 (cid:205)𝑇 𝑡=2 𝑝(𝑥 𝑡−1 = 𝑞|y,𝜃) (cid:205)𝑇 𝑡=2 γ𝑡 (𝑞) for 𝑙 = 1,...,𝑘 subscriptsdenotingdifferentelementsofthevector 𝑦 = (𝑦 ,...,𝑦 ). 𝑡 1𝑡 𝑘𝑡 51

GiventheupdatedtransitionmatrixΠ wecanupdatethestationaryprobabilitiesas 𝑡 δ = 1 ′(𝐼 𝑚 −Π+𝑈)−1. (B.17) Here𝑈 isan 𝑚 ×𝑚 matrixofones. Notethatthissettingcanbeadaptedtoallowforthediscretizationofage-dependentearnings processes, with age-dependent transition probabilities and grid placement. In this case, the asymptoticsdependon 𝑁 andadifferenttransitionprobabilitymatrixandgridareestimated for every age group. However, in practice this implies the estimation of many parameters, which is why, for the estimation of models with rich life-cycle dynamics, we will use an iterativeadaptionofthisalgorithm. ThisalgorithmisdescribedinAppendixSectionB.2. The idea behind this algorithm is that it only uses data on two time periods at a time, but passes theestimatesforthefilteredstatesontothenexttimeperiod. B.2 Amulti-stepEMalgorithmforHMM In this subsection, we outline the multi-step EM algorithm we use for the estimation of the HMM in case of life-cycle dynamics, where the transition matrix Π and grid 𝜇 are allowed 𝑡 𝑡 tovaryoverthelife-cycle. Thelargenumberofparameterstobeestimatedhererequires 𝑁 to belarge,andtheEMalgorithmhastoconvergeformanyparameters. Amulti-stepalgorithm providesmorestability. Assume a panel of 𝑦 ∈ R𝑘, 𝑡 = 1,...,𝑇 and 𝑖 = 1,...,𝑁. Assume a given grid size 𝑚. 𝑖𝑡 Initialization: • Estimate a Gaussian Mixture Model on 𝑦 , 𝑖 = 1,...,𝑁. This gives a grid for the first 𝑖1 timeperiodanditeration,𝜇1,stationaryprobabilities𝛿1 andthefilteredprobabilities𝛼1. 1 1 1 Setiteration 𝑗 = 1. Wehaveaforwardandbackwardstep. Fortheforwardstep,set 𝑡 = 1and: • Estimate the HMM of Section B for (𝑦 ,𝑦 ), 𝑖 = 1,...,𝑁, restricting the grid of time 𝑖𝑡 𝑖𝑡+1 𝑗 𝑗 period𝑡 to𝜇 ,thestationaryprobabilitiesoftimeperiod𝑡 to𝛿 ,theforwardprobabilities 𝑡 𝑡 𝑗 to 𝛼 (except for 𝑡 = 1, in which case they follow from Equation (B.9)). For 𝑗 > 1, also 𝑡 restrict the backward probabilities for 𝑡 + 1 to those obtained from the backward step, 𝑗−1 𝑗 𝑗 𝛽 , else set to 1. Estimate and store the grid 𝜇 , the transition probability matrix Π , 𝑡+1 𝑡+1 𝑡 52

𝑗 𝑗 stationary probabilities 𝛿 , and forward probabilities 𝛼 . Set 𝑡 = 𝑡 +1 and repeat up 𝑡+1 𝑡+1 untilandincluding 𝑡 = 𝑇 −1. Forthebackwardstep,set 𝑡 = 𝑇 and: • Estimate the HMM of Section B for (𝑦 ,𝑦 ), 𝑖 = 1,...,𝑁, restricting the grid of time 𝑖𝑡−1 𝑖𝑡 𝑗 𝑗 period𝑡 to𝜇 ,thestationaryprobabilitiesoftimeperiod𝑡 to𝛿 ,theforwardprobabilities 𝑡 𝑡 𝑗 𝑗 to 𝛼 , the backward probabilities to 𝛽 (for 𝑡 < 𝑇). When 𝑡 = 𝑇, all of these (except the 𝑡 𝑡 backward probabilities) come from the last time period of the forward step. Estimate 𝑗 𝑗 and store the grid 𝜇 , the transition probability matrix Π , stationary probabilities 𝑡−1 𝑡−1 𝑗 𝑗 𝛿 , and backward probabilities 𝛽 . Set 𝑡 = 𝑡 − 1 and repeat up until and including 𝑡−1 𝑡−1 𝑡 = 2. Once can iterate multiple times between the forward and backward step until they stabilize. Inthatcase,update 𝑗 = 𝑗 +1. C Discretization of a VAR process InthisAppendix,wedemonstratetheperformanceofourmethodfordiscretizingabivariate VARmodeloftheform 𝑦 = 𝛽 𝑦 +𝛽 𝑦 + 𝜀 (C.1) 1,𝑡 11 1,𝑡−1 12 1,𝑡−1 1,𝑡 𝑦 = 𝛽 𝑦 +𝛽 𝑦 + 𝜀 , (C.2) 2,𝑡 21 1,𝑡−1 22 2,𝑡−1 2,𝑡 where 𝜀 ∼ 𝑁(0,Σ). 𝑡 We consider two different parametrizations but keep the grid size fixed to 𝑚 = 25 to show how our discretization method optimally selects the grid. The optimal grids are visualized in Figure C1. As can be seen, as opposed to a tensor grid, our optimal grid incorporates the structure of the process into the grid. For example, in a VAR model where both variables are positively correlated (𝛽 = 𝛽 > 0), if 𝑦 is large, 𝑦 is also likely large. Figure C1b shows 12 21 1 2 how this is reflected in our optimal grid, while a standard tensor grid as in Figure C1c does notreflectthisdependence. Table C1 summarizes the performance of our discretization compared to the discretization of Farmer and Toda (2017) for two different parametrizations of the VAR model in Equation (C.1). Intheirdiscretization,FarmerandToda(2017)targetthefirstfourconditionalmoments. 53

Figure C1: Visualisation of optimal grid for two different parametrizations of the data generating process in Equation(C.1),𝑚 = 25. (a)JMgrid: uncorrelated𝑦 1 ,𝑦 2 (b)JMgrid: positivelycorr. 𝑦 1 ,𝑦 2 (c)Tensorgrid 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 -0.5 -0.5 -0.5 -1 -1 -1 -1.5 -1.5 -1.5 -2 -2 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Forbothparametrizations,Σ=diag(0.1). Panel(a)/(c): 𝛽 11 =0.7, 𝛽 12 =0, 𝛽 21 =0, 𝛽 22 =0.7. Panel(b)/(c): 𝛽 11 =0.7𝛽 12 =0.2, 𝛽 21 =0.2, 𝛽 22 =0.7. JMstandsforJanssens-McCrary. As we can see, they outperform our discretization method in the first two conditional and unconditional moments, but for higher order moments, our method tends to be closer to the trueprocess. Ourmethodalsohasasmallermeansquaredforecasterror. 54

TableC1: ComparisonforVARmodelinEquation(C.1)for𝑚 = 25(𝑚 𝑦 = 5, 𝑚 𝑦 = 5forFarmer-Toda). 1 2 Janssens-McCrary Farmer-Toda Method Parametrization1 Abs. dev. uncond. mean 𝑦 0.103 < 0.001 %dev. uncond. variance 𝑦 -0.042 0.083 %dev. autocorrelation 𝑦 -0.264 0.111 Abs. dev. uncond. skewness 𝑦 -0.046 0.009 %dev. uncond. kurtosis 𝑦 0.097 -0.049 Abs. dev. correlation(𝑦 , 𝑦 ) 0.059 1 2 0.007 Abs. dev. cond. mean 𝑦 0.041 < 0.001 %abs. dev. cond. variance 𝑦 37.8 23.3 %abs. dev. cond. skewness 𝑦 0.466 0.256 %abs. dev. cond. kurtosis 𝑦 49.3 12.7 MSFE 𝑦 0.133 0.109 Parametrization2 Abs. dev. uncond. mean 𝑦 0.144 < 0.001 %dev. uncond. variance 𝑦 -0.122 0.034 %dev. autocorrelation 𝑦 -0.110 0.074 Abs. dev. uncond. skewness 𝑦 0.278 -0.046 %dev. uncond. kurtosis 𝑦 -0.030 -0.015 Abs. dev. correlation(𝑦 , 𝑦 ) 0.093 1 2 -0.007 Abs. dev. cond. mean 𝑦 0.030 < 0.001 %abs. dev. cond. variance 𝑦 38.3 9.86 %abs. dev. cond. skewness 𝑦 0.609 0.337 %abs. dev. cond. kurtosis 𝑦 81.7 43.4 MSFE 𝑦 0.211 0.120 Notes: Parametrization1: 𝛽 11 =0.7𝛽 12 =0.2, 𝛽 21 =0.0, 𝛽 22 =0.7. Parametrization2: 𝛽 11 =0.7𝛽 12 =0.2, 𝛽 21 =0.2, 𝛽 22 =0.7. Thestatisticsaverageover𝑦 1 and𝑦 2 . 55

D Asset Pricing Model with Stochastic Volatility D.1 Aclosed-formsolution From De Groot (2015), we obtain closed-form expressions for the asset pricing model with stochasticvolatilitypresentedinEquations(7)-(8). Thesolutionfortheprice-dividendratiois givenby: ∞ (cid:213) 𝑣 = 𝛽𝑖exp(𝐵 𝑦 +𝐶 𝜂¯ +𝐷 (𝜂 −𝜂¯)+𝐻 ), 𝑡 𝑖 𝑡 𝑖 𝑖 𝑡 𝑖 𝑖=1 where (cid:18)1− 𝛾(cid:19) 𝐵 = 𝜌(1−𝜌𝑖) 𝑖 1−𝜌 1 (cid:18)1− 𝛾(cid:19)2 (cid:18) 1−𝜌𝑖 1−𝜌2𝑖(cid:19) 𝐶 = 𝑖 −2𝜌 +𝜌2 𝑖 2 1−𝜌 1−𝜌 1−𝜌2 𝜌 𝜂 (cid:18)1− 𝛾(cid:19)2 (cid:16) (cid:17) 𝐷 = 𝜙 + 𝜙 𝜌 𝜌𝑖−1 + 𝜙 𝜌𝑖−1 + 𝜙 𝜌2(𝑖−1) 𝑖 2 1−𝜌 1 2 𝜂 𝜂 3 4 𝐻 = 𝐹 𝜔2 𝑖 𝑖 where 1 (cid:18)1− 𝛾(cid:19)4 (cid:16) 1−𝜌 𝜂 2𝑖 1−𝜌2𝑖 1−𝜌4𝑖 𝐹 = 𝑖𝜙2 + 𝜙2 + 𝜙2 + 𝜙2 ... 𝑖 8 1−𝜌 1 2 1−𝜌2 3 1−𝜌2 4 1−𝜌4 𝜂 1−𝜌𝑖 1−𝜌𝑖 1−𝜌2𝑖 1−(𝜌 𝜌)𝑖 𝜂 𝜂 ...+2𝜙 𝜙 +2𝜙 𝜙 +2𝜙 𝜙 +2𝜙 𝜙 ... 1 2 1 3 1 4 2 3 1−𝜌 1−𝜌 1−𝜌2 1−𝜌 𝜌 𝜂 𝜂 1−(𝜌 𝜂 𝜌2)𝑖 1−𝜌3𝑖 (cid:17) ...+2𝜙 𝜙 +2𝜙 𝜙 2 4 3 4 1−𝜌 𝜌2 1−𝜌3 𝜂 and 1 −𝜌 (𝜌 +𝜌)(1−𝜌)2 𝜂 𝜂 𝜙 = , 𝜙 = , 1 2 1−𝜌 (𝜌2 −𝜌 )(𝜌−𝜌 )(1−𝜌 ) 𝜂 𝜂 𝜂 𝜂 −2𝜌2 𝜌4 𝜙 = , 𝜙 = . 3 4 𝜌−𝜌 𝜌2 −𝜌 𝜂 𝜂 56

Theconditionalexpectedreturnonequityisdefinedas (cid:18)𝑑 + 𝑝 (cid:19) E exp(𝑦 )+E 𝑣 exp(𝑦 ) E 𝑅𝑒 = E 𝑡+1 𝑡+1 = 𝑡 𝑡+1 𝑡 𝑡+1 𝑡+1 𝑡 𝑡+1 𝑡 𝑝 𝑣 𝑡 𝑡 Thesolutiontothisexpressiongivesthat 1 𝜌 1 E exp(𝑦 ) = exp(cid:0)𝜌𝑦 + 𝜂¯ + 𝜂 (𝜂 −𝜂¯)+ 𝜔2(cid:1) 𝑡 𝑡+1 𝑡 𝑡 2 2 8 and ∞ (cid:213) (cid:16) 1 1 E 𝑣 exp(𝑦 ) = 𝛽𝑖exp (𝐵 +1)𝜌𝑦 +(𝐶 + (𝐵 +1)2)𝜂¯ + (𝐵 +1)2𝜌 (𝜂 −𝜂¯)+... 𝑡 𝑡+1 𝑡+1 𝑖 𝑡 𝑖 𝑖 𝑖 𝜂 𝑡 2 2 𝑖=1 1 1 (cid:17) (𝐹 + ( (𝐵 +1)2 +𝐷 )2)𝜔2 . 𝑖 𝑖 𝑖 2 2 As shown by De Groot (2015), there is a parameter restriction that guarantees a finite pricedividendratio: (cid:32) 1 (cid:18)1− 𝛾(cid:19)2 (1− 𝛾)4 (cid:33) 𝛽exp 𝜂¯ + 𝜔2 < 1. 2 1−𝜌 8(1−𝜌)4(1−𝜌 )2 𝜂 Wechoseourparametrizationof 𝛽 and 𝛾 suchthatthisconditionissatisfied. D.2 Adiscretizedsolution Instead of solving the model using the continuous-support process in Equations (7)-(8), one can discretize the stochastic process and obtain approximate solutions for the price-dividend ratio, the conditional expected return on equity, and other objects of interest. If 𝑦 follows 𝑡 a discrete-state-space first-order Markov process with states 𝑦 , 𝑠 ∈ {1,...,𝑚} and transition 𝑠 probabilitymatrixΠwithelementsΠ 𝑠𝑠′ = 𝑃(𝑦 𝑡+1 = 𝑦 𝑠′ |𝑦 𝑡 = 𝑦 𝑠 ),thenwecanrewriteEquation (9)as 𝑚 (cid:213) 𝑣(𝑦 𝑠 ) = 𝛽 exp((1− 𝛾)𝑦 𝑠′ )(𝑣(𝑦 𝑠′ )+1)Π 𝑠𝑠′ 𝑠′=1 57

whichsolvesto 𝑣 = (cid:0)𝐼 −𝛽Πdiag(exp(1− 𝛾)𝑦)(cid:1)−1 𝛽Πexp((1− 𝜎)𝑦), (D.1) 𝑚 where 𝑚 denotes the number of discrete states of 𝑦 , 𝑦 is an 𝑠 ×1 vector with all the levels 𝑦 𝑡 𝑡 attains,and𝑣 isan 𝑠×1vectorwithalldiscreterealizationsoftheprice-dividendratioineach discrete realization of 𝑦. Similarly, for the vector of conditional expected returns on equity at eachvalueofthegrid 𝑦 ,denoted 𝑅𝑒(𝑦 ),wehave 𝑠 𝑠 (cid:16)(cid:213) (cid:17) 𝑅𝑒(𝑦 𝑠 ) = Π 𝑠𝑠′ exp(𝑦 𝑠 )(1+𝑣(𝑦 𝑠′ )) /𝑣(𝑦 𝑠 ). (D.2) 𝑠′ 58

E Age-dependent transition probabilities and grids Figure E1: Visualisation of the age-dependent transition probabilities for a discretization of the stochastic process in Guvenen et al. (2021), with 𝑚 = 12 grid points. The order of the matrix corresponds with a sorted (low-to-high) earnings grid, where the three lowest states are zero-earnings states. Ageonthex-axisofallfigures. 59

Figure E2: Visualisation of (selected) age-dependent transition probabilities and the age-dependent grid of the 𝑚 = 18-discretizationofthestochasticprocessinArellanoetal.(2017). Ageonthex-axisofallfigures. (a)Transitionprobabilitiesofthefivetopearningsstates (b)Age-dependentgrid(inlogs) 5 4 3 2 1 0 -1 -2 30 35 40 45 50 55 60

Cite this document

APA

Eva F. Janssens and Sean McCrary (2023). Finite-State Markov-Chain Approximations: A Hidden Markov Approach (FEDS 2023-040). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2023-040

BibTeX

@techreport{wtfs_feds_2023_040,
  author = {Eva F. Janssens and Sean McCrary},
  title = {Finite-State Markov-Chain Approximations: A Hidden Markov Approach},
  type = {Finance and Economics Discussion Series},
  number = {2023-040},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2023},
  url = {https://whenthefedspeaks.com/doc/feds_2023-040},
  abstract = {This paper proposes a novel finite-state Markov chain approximation method for Markov processes with continuous support, providing both an optimal grid and transition probability matrix. The method can be used for multivariate processes, as well as non-stationary processes such as those with a life-cycle component. The method is based on minimizing the information loss between a Hidden Markov Model and the true data-generating process. We provide sufficient conditions under which this information loss can be made arbitrarily small if enough grid points are used. We compare our method to existing methods through the lens of an asset-pricing model, and a life-cycle consumption-savings model. We find our method leads to more parsimonious discretizations and more accurate solutions, and the discretization matters for the welfare costs of risk, the marginal propensities to consume, and the amount of wealth inequality a life-cycle model can generate.},
}