feds · August 31, 2007

Combining Forecasts From Nested Models

Abstract

Motivated by the common finding that linear autoregressive models forecast better than models that incorporate additional information, this paper presents analytical, Monte Carlo, and empirical evidence on the effectiveness of combining forecasts from nested models. In our analytics, the unrestricted model is true, but as the sample size grows, the data generating process converges to the restricted model. This approach captures the practical reality that the predictive content of variables of interest is often low. We derive MSE-minimizing weights for combining the restricted and unrestricted forecasts. Monte Carlo and empirical analyses verify the practical effectiveness of our combination approach.

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C. Combining Forecasts From Nested Models Todd E. Clark and Michael W. McCracken 2007-43 NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

Combining Forecasts From Nested Models ∗ Todd E. Clark Federal Reserve Bank of Kansas City Michael W. McCracken Board of Governors of the Federal Reserve System January 2007 Abstract Motivated by the common finding that linear autoregressive models forecast better than models that incorporate additional information, this paper presents analytical, Monte Carlo, and empirical evidence on the effectiveness of combining forecasts from nested models. In our analytics, the unrestricted model is true, but as the sample size grows, the data generating process converges to the restricted model. This approach captures the practical reality that the predictive content of variables of interest is often low. We derive MSE-minimizing weights for combining the restricted and unrestricted forecasts. Monte Carlo and empirical analyses verify the practical effectiveness of our combination approach. JEL Nos.: C53, C52 Keywords: forecast combination, predictability, forecast evaluation ∗Clark (corresponding author): Economic Research Dept.; Federal Reserve Bank of Kansas City; 925 Grand; Kansas City, MO 64198; todd.e.clark@kc.frb.org. McCracken: Board of Governors of the Federal Reserve System; 20th and Constitution N.W.; Mail Stop #61; Washington, D.C. 20551; michael.w.mccracken@frb.gov. Portions of this paper were written while Michael McCracken was on the economicsdepartmentfacultyoftheUniversityofMissouri–Columbia. Wegratefullyacknowledgeexcellent research assistance from Taisuke Nakata and helpful comments from Jan Groen, David Hendry, Jim Stock, seminar participants at the Deutsch Bundesbank and Federal Reserve Bank of Kansas City, and participantsattheBankofEnglandWorkshoponEconometricForecastingModelsandMethods,the2005World CongressoftheEconometricSociety,NBERSummerInstitute,andStanford’sSITEworkshoponeconomic forecasting. The views expressed herein are solely those of the authors and do not necessarily reflect the views of the Federal Reserve Bank of Kansas City, Board of Governors, Federal Reserve System, or any of its staff.

1 Introduction Forecasters are well aware of the so–called principle of parsimony: “simple, parsimonious models tend to be best for out–of–sample forecasting...” (Diebold (1998)). Although an emphasis on parsimony may be justified on various grounds, parameter estimation error is one key reason. In many practical situations, estimating additional parameters can raise the forecast error variance above what might be obtained with a simple model. Such is clearly true when the additional parameters have population values of zero. But the same can apply even when the population values of the additional parameters are non–zero, if the marginal explanatory power associated with the additional parameters is low enough. In such cases, in finite samples the additional parameter estimation noise may raise the forecast error variance more than including information from additional variables lowers it. For example, simulation evidence in Clark and McCracken (2006) shows that even though the true model relates inflation to the output gap, in finite samples a simple AR model for inflation will often forecast as well as or better than the true model.1 Asthisdiscussionsuggests,parameterestimationnoisecreatesaforecastaccuracytradeoff. Excluding variables that truly belong in the model could adversely affect forecast accuracy. Yet including the variables could raise the forecast error variance if the associated parametersareestimatedsufficientlyimprecisely. Inlightofsuchatradeoff,combiningforecasts from the unrestricted and restricted (or parsimonious) models could improve forecast accuracy. Such combination could be seen as a form of shrinkage, which various studies, such as Stock and Watson (2003), have found to be effective in forecasting. Accordingly, this paper presents analytical, Monte Carlo, and empirical evidence on the effectiveness of combining forecasts from nested models. Our analytics are based on models we characterize as “weakly” nested: the unrestricted model is the true model, but as the sample size grows large, the data generating process (DGP) converges to the restricted model. This analytic approach captures the practical reality that the predictive content of some variables of interest is often quite low. Although we focus the presented analysis on nested linear models, our results could be generalized to nested nonlinear models. Under the weak nesting specification, we derive weights for combining the forecasts from estimates of the restricted and unrestricted models that are optimal in the sense of minimizing the forecast mean square error (MSE). We then characterize the settings under 1Clark and West (2006a,b) obtain a similar result for some other applications. 1

which the combination forecast will be better than the restricted or unrestricted forecasts. In the special case in which the coefficients on the extra variables in the unrestricted model are of a magnitude that makes the restricted and unrestricted models equally accurate, the MSE–minimizing forecast is a simple, equally–weighted average of the restricted and unrestricted forecasts. IntheMonteCarloandempiricalanalysis,weshowourproposedapproachofcombining forecasts from nested models to be effective for improving accuracy. To ensure the practical relevance of our results, we base our Monte Carlo experiments on DGPs calibrated to empirical applications, and, in our empirical work, we consider a range of applications. In the applications, our proposed combination approaches work well compared to related alternatives,consistingofBayesian–typeestimationwithpriorsthatpushcertaincoefficients toward zero and Bayesian model averaging of the restricted and unrestricted models. Our results build on much prior work on forecast combination. Research focused on non–nested models ranges from the early work of Bates and Granger (1969) to recent contributions such as Stock and Watson (2003) and Elliott and Timmermann (2004).2 Combinationofnestedmodelforecastshasbeenconsideredonlyoccasionally,insuchstudies as Goyal and Welch (2003) and Hendry and Clements (2004). Forecasts based on Bayesian model averaging as applied in such studies as Wright (2003) and Jacobson and Karlsson (2004) could also combine forecasts from nested models. Of course, such Bayesian methods of combination are predicated on model uncertainty. In contrast, our paper provides a theoretical rationale for nested model combination in the absence of model uncertainty. The paper proceeds as follows. Section 2 provides theoretical results on the possible gains from combination of forecasts from nested models, including the optimal combination weight. In section 3 we present Monte Carlo evidence on the finite sample effectiveness of our proposed forecast combination methods. Section 4 compares the effectiveness of the forecast methods in a range of empirical applications. Section 5 concludes. Additional theoretical details are presented in Appendix 1. 2 Theory We begin by using a simple example to illustrate our essential ideas and results. We then proceed to the more general case. After detailing the necessary notation and assumptions, 2See Timmermann (2006) for a more complete survey of the extensive combination literature. 2

we provide an analytical characterization of the bias-variance tradeoff, created by weak predictability, involved in choosing among restricted, unrestricted, and combined forecasts. In light of that tradeoff, we then derive the optimal combination weights. 2.1 A simple example Suppose we are interested in forecasting y using a simple model relating y to a cont+1 t+1 stant and a strictly exogenous, scalar variable x . Suppose, however, that the predictive t content of x for y may be weak. To capture this possibility, we model the population t t+1 relationship between y and x using local-to-zero asymptotics, such that, as the sample t+1 t size grows large, the predictive content of x shrinks to zero (assume that, apart from the t localelement,themodelfitsintheframeworkoftheusualclassicalnormalregressionmodel, with homoskedastic errors, etc.): β y = β + 1 x +u , E(x u ) = 0, E(u2 ) = σ2. (1) t+1 0 √T t t+1 t t+1 t+1 In light of x’s weak predictive content, the forecast from an estimated model relating y to a constant and x (henceforth, the unrestricted model) could be less accurate than t+1 t a forecast from a model relating y to just a constant (the restricted model). Whether t+1 that is so depends on the “signal” and “noise” associated with x and its estimated coeffit cient. Under the local asymptotics incorporated in the DGP (1), the signal–to–noise ratio is proportional to β2σ2/σ2. Given σ2 and σ2 (or β ), higher values of the coefficient on x 1 x x 1 (or the variance of x) raise the signal relative to the noise; given the other parameters, a higher residual variance σ2 increases the noise, reducing the signal-to-noise ratio. In light of the tradeoff considerations described in the introduction, a combination of the unrestricted and restricted model forecasts could be more accurate than either of the individual forecasts. We consider a combined forecast that puts a weight of α on the ∗t restrictedmodelforecastand1 α ontheunrestrictedmodelforecast. Wethenanalytically − ∗t determinetheweightα thatyieldstheforecastwithlowestexpectedsquarederrorinperiod ∗t t+1. As we establish more formally below, the (estimated) MSE–minimizing combination weight α is a function of the signal–to–noise ratio: ∗t 2 1 √t ˆb σˆ2 − 1 x αˆ = 1+ , (2) ∗t  ’ σˆ2(        3

where ˆb denotes the coefficient on x (√tˆb corresponds to an estimate of the local popula- 1 1 tion coefficient β ), σˆ2 denotes the variance of x, and σˆ2 denotes the residual variance, all 1 x estimated at time t (for forecasting at t+1).3 As this result indicates, if the predictive content of x is such that the signal-to-noise ratio equals 1, then αˆ = .5: the MSE–minimizing ∗t forecast is a simple average of the restricted and unrestricted model forecasts. 2.2 The general case: environment In the general case, the possibility of weak predictors is modeled using a sequence of linear DGPs of the form (Assumption 1) y = x β +u = x β +x (T 1/2β )+u , (3) T,j+τ #T,2,j ∗T T,j+τ #T,1,j ∗1 #T,22,j − ∗22 T,j+τ Ex u Eh = 0 for all j = 1,...t, t = T P +1,...T, T,2,j T,j+τ T,j+τ ≡ − where P denotes the number of predictions considered. Note that we allow the dependent variable y , the predictors x and the error term u to depend upon T, the final T,j+τ T,2,j T,j+τ forecast origin. We make this explicit in the notation to emphasize that as the overall sample size is allowed to increase in our asymptotics, this parameterization affects their marginal distributions. While this is obvious for y it is also true for x if lagged T,j+τ T,2,j values of the dependent variable are used as predictors. As such, our analytical results are based upon assumptions made on the triangular array y ,x T+τ . {{ T,j #T,2,j}j=1 } T ≥ 1 For a fixed value of T, our forecasting agent observes the sequence y ,x t { T,j #T,2,j}j=1 sequentially at each forecast origin t = T P +1,...T. Forecasts of the scalar y , τ 1, T,t+τ − ≥ are generated using a (k × 1,k = k 1 +k 2 ) vector of covariates x T,2,t = (x #T,1,t ,x #T,22,t )#, linear parametric models x β , i = 1,2, and a combination of the two models, α x β + #T,i,t ∗i t #T,1,t ∗1 (1 α )x β . The parameters are estimated using OLS (Assumption 2) and hence − t #T,2,t ∗2 βˆ = argmint 1 t τ (y x β )2, i = 1,2, for the restricted and unrestricted i,t − j−=1 T,j+τ − #T,i,j i models, respectivel/y. We denote the loss associated with the τ-step ahead forecast errors as uˆ2 = (y x βˆ )2, i = 1,2, and uˆ2 = (y α x βˆ (1 T,i,t+τ T,t+τ − #T,i,t i,t T,W,t+τ T,t+τ − t #T,1,t 1,t − − α )x βˆ )2 for the restricted, unrestricted, and combined, respectively. t #T,2,t 2,t The following additional notation will be used. Let H (t) = (t 1 t τ x u ) = T,i − j−=1 T,i,j T,j+τ (t 1 t τ h ), B (t) = (t 1 t τ x x ) 1, and B = lim/ (Ex x ) 1 − j−=1 T,i,j+τ T,i − j−=1 T,i,j #T,i,j − i T T,i,j #T,i,j − →∞ for i/= 1,2 . For U T,j = (h #T,2,j+τ ,/vec(x T,2,j x #T,2,j )#)#, let V = l τ =− 1 τ+1 Ω 11,l , where Ω 11,l − 3Clements and Hendry (1998) derive a similar result, for the combinat/ion of a forecast based on the unconditional mean and a forecast based on an AR(1) model without intercept, the model assumed to generate the data. 4

is the upper block-diagonal element of Ω defined below. For any (m n) matrix A with l × elements a and column vectors a , let: vec(A) denote the (mn 1) vector [a ,a ,...,a ]; i,j j × #1 #2 #n # A denote the max norm; and tr(A) denote the trace. Let sup = sup and let | | t T − P+1 ≤ t ≤ T denote weak convergence. Finally, we define a variable selection matrix and a coefficient ⇒ vector that appears directly in our key combination results: J = (I ,0 ) and k1 k1 k1 k2 # × × δ = (0 1 k1 ,β ∗22 ")#. × Toderiveourgeneralresults,weneedtwomoreassumptions(inadditiontoourassumptions (1 and 2) of a DGP with weak predictability and OLS–estimated linear forecasting models). Assumption 3: (a)T 1 [rT] U U rΩ whereΩ = lim T 1 T E(U U ) − j=1 T,j T#,j − l ⇒ l l T →∞ − t=1 T,j T#,j − l for all l 0, (b) Ω =/0 all l τ, (c) sup E U 2q < /for some q > 1, (d) ≥ 11,l ≥ T − P+1 ≥ 1,s ≤ T | T,s | ∞ U T,j − EU T,j = (h #T,2,j+τ ,vec(x T,2,j x #T,2,j − Ex T,2,j x #T,2,j )#)# is a zero mean triangular array satisfying Theorem 3.2 of De Jong and Davidson (2000). Assumption 4: For s (1 λ ,1], (a) α α(s) [0,1], (b) lim P/T = λ (0,1). P t T P ∈ − ⇒ ∈ →∞ ∈ Assumption 3 imposes three types of conditions. First, in (a) and (c) we require that theobservables,whilenotnecessarilycovariancestationary,areasymptoticallymeansquare stationary with finite second moments. We do so in order to allow the observables to have marginal distributions that vary as the weak predictive ability strengthens along with the sample size but are ‘well-behaved’ enough that, for example, sample averages converge in probability to the appropriate population means. Second, in (b) we impose the restriction that the τ-step ahead forecast errors are MA(τ 1). We do so in order to emphasize the − role that weak predictors have on forecasting without also introducing other forms of model misspecification. Finally, in (d) we impose the high level assumption that, in particular, h satisfies Theorem 3.2 of De Jong and Davidson (2000). By doing so we not only T,2,j+τ insure (results needed in Appendix 1) that certain weighted partial sums converge weakly to standard Brownian motion, but also allow ourselves to take advantage of various results pertaining to convergence in distribution to stochastic integrals. Our final assumption is unique: we permit the combining weights to change with time. In this way, we allow the forecasting agent to balance the bias-variance tradeoff differently across time as the increasing sample size provides stronger evidence of predictive ability. Finally, we impose the requirement that lim P/T = λ (0,1) and hence the duration T P →∞ ∈ 5

of forecasting is finite but non-trivial. 2.3 Theoretical results on the tradeoff Our characterization of the bias-variance tradeoff associated with weak predictability is based on T (uˆ2 uˆ2 ), the difference in the (normalized) MSEs of the t=T − P+1 T,2,t+τ − T,W,t+τ unrestricte/d and combined forecasts. In Appendix 1, we provide a general characterization of the tradeoff, in Theorem 1. But in the absence of a closed form solution for the limiting distribution of the loss differential (the distribution provided in Appendix 1), we proceed in this section to focus on the mean of this loss differential. From the general case proved in Appendix 1, we first establish the expected value of the loss differential, in the following corollary. Corollary 1: E T (uˆ2 uˆ2 ) 1 Eξ (s) = t=T − P+1 T,2,t+τ − T,W,t+τ → 1 − λP W 1 (1 (1 α/(s))2)s 1tr(( JB J +B )V)ds 0 1 − λP − − − − 1 # 2 − 01 α2(s)δ B 1( JB J +B )B 1δds. 1 − λP # 2− − 1 # 2 2− 0 This decomposition implies that the bias-variance tradeoff depends on: (1) the duration of forecasting (λ ), (2) the dimension of the parameter vectors (through the dimension of P δ), (3) the magnitude of the predictive ability (as measured by quadratics of δ), (4) the forecast horizon (via V, the long-run variance of h ), and (5) the second moments of T,2,t+τ the predictors (B = lim (Ex x ) 1). i T T,i,t #T,i,t − →∞ The first term on the right-hand side of the decomposition can be interpreted as the pure“variance”contributiontothemeandifferenceintheunrestrictedandcombinedMSEs. The second term can be interpreted as the pure “bias” contribution. Clearly, when δ = 0 and thus there is no predictive ability associated with the predictors x , the expected T,22,t difference in MSE is positive so long as α(s) = 0. Since the goal is to choose α(s) so + that 1 Eξ (s) is maximized, we immediately reach the intuitive conclusion that we 1 − λP W shoul0d always forecast using the restricted model and hence set α(s) = 1. When δ = 0, + and hence there is predictive ability associated with the predictors x , forecast accuracy T,22,t is maximized by combining the restricted and unrestricted model forecasts. The following corollary provides the optimal combination weight. Note that, to simplify notation in the presented results, from this point forward we omit the subscript T from the predictors, so that, e.g., x is simply denoted x . T,22,t 22,t 6

Corollary 2: The pointwise optimal combining weights satisfy α (s) = 1+s β #22 (Ex 22,t x #22,t− Ex 22,t x #1,t (Ex 1,t x #1,t ) − 1Ex 1,t x #22,t )β 22 − 1 . (4) ∗ 1 2 tr(( JB 1 J # +B 2 )V) 34 − Theoptimalcombinationweightisderivedbymaximizingtheargumentsoftheintegrals in Corollary 1 that contribute to the average expected mean square differential over the duration of forecasting — hence our “pointwise optimal” characterization of the weight. In particular, the results of Corollary 2 follow from maximizing (1 (1 α(s))2)s 1tr(( JB J +B )V) α2(s)δ B 1( JB J +B )B 1δ (5) − − − − 1 # 2 − # 2− − 1 # 2 2− with respect to α(s) for each s. As is apparent from the formula in Corollary 2, the combining weight is decreasing in the marginal ‘signal to noise’ ratio sβ (Ex x Ex x (Ex x ) 1Ex x )β /tr(( JB J +B )V). #22 22,t #22,t− 22,t #1,t 1,t #1,t − 1,t #22,t 22 − 1 # 2 As the marginal ‘signal’, sβ (Ex x Ex x (Ex x ) 1Ex x )β , increases, #22 22,t #22,t− 22,t #1,t 1,t #1,t − 1,t #22,t 22 weplacemoreweightontheunrestrictedmodelandlessontherestrictedone. Conversely,as the marginal ‘noise’, tr(( JB J +B )V), increases, we place more weight on the restricted 1 # 2 − model and less on the unrestricted model. Finally, as forecasting moves forward in time and the estimation sample (represented by s) increases, we place increasing weight on the unrestricted model. In the special case in which the signal–to–noise ratio equals 1, the optimal combination weight is 1/2. That is, for a given time period s, when sβ (Ex x Ex x (Ex x ) 1Ex x )β = tr(( JB J +B )V), (6) #22 22,t #22,t− 22,t #1,t 1,t #1,t − 1,t #22,t 22 − 1 # 2 and hence the restricted and unrestricted models are expected to be equally accurate, α (s) = 1/2. ∗ Abitmorealgebraestablishesthedeterminantsofthesizeofthebenefitstocombination. If we substitute α (s) into (5), we find that Eξ (s) takes the easily interpretable form ∗ ∗W tr(( JB J +B )V)2 1 # 2 − . (7) s(sβ (Ex x Ex x (Ex x ) 1Ex x )β +tr(( JB J +B )V)) #22 22,t #22,t− 22,t #1,t 1,t #1,t − 1,t #22,t 22 − 1 # 2 This simplifies even more in the conditionally homoskedastic case, in which tr(( JB J + 1 # − B )V) = σ2k . In either case, it is clear that we expect the optimal combination to pro- 2 2 vide the most benefit when the marginal ‘noise’, tr(( JB J + B )V), is large or when 1 # 2 − 7

the marginal ‘signal’, sβ (Ex x Ex x (Ex x ) 1Ex x )β , is small. And #22 22,t #22,t− 22,t #1,t 1,t #1,t − 1,t #22,t 22 again, we obtain the result that, as the estimation sample grows, any benefits from combination vanish as the parameter estimates become increasingly accurate. Note, however, that the term β (Ex x Ex x (Ex x ) 1Ex x )β is a #22 22,t #22,t− 22,t #1,t 1,t #1,t − 1,t #22,t 22 function of the local-to-zero parameters β . Moreover, note that these optimal combining 22 weights are not presented relative to an environment in which agents are forecasting in ‘real time’. Therefore, for practical use, we suggest a transformed formula. Let Bˆ and i Vˆ denote estimates of B and V, respectively, based on data through period t. If we let i T1/2βˆ denote an estimate of the local-to-zero parameter β and set s = t/T, we obtain 22 ∗22 the following real time estimate of the pointwise optimal combining weight:4 β ˆ (t 1 t τ x x (t 1 t τ x x )Bˆ (t 1 t τ x x ))βˆ − 1 αˆ = 1+t #22 − j−=1 22,j #22,j − − j−=1 22,j #1,j 1 − j−=1 1,j #22,j 22 . ∗t   / tr(( /JBˆ 1 J # +Bˆ 2 )Vˆ) /  −   (8) In doing so, though, we acknowledge that the parameter estimates are not consistent for the local-to-zero parameters on which our theoretical derivations (Corollary 2) are based. Thelocal-to-zeroasymptoticsallowustoderiveclosed–formsolutionsfortheoptimalcombination weights, but require knowledge of local-to-zero parameters that cannot be estimated consistently. We therefore simply use rescaled OLS magnitudes to estimate (inconsistently) the assumed local-to-zero values and subsequent optimal combining weights. Below we use Monte Carlo experiments and empirical examples to determine whether the estimated quantities perform well enough to be a valuable tool for forecasting. Conceptually, our proposed combination (8) might be seen as a variant of a Stein rule estimator.5 With conditionally homoskedastic, 1–step ahead forecast errors, the signal-tonoise ratio in our combination coefficient αˆ is the conventional F–statistic for testing the t null of coefficients of 0 on the x variables. With additional (and strong) assumptions 22 of normality and strict exogeneity of the regressors, the F–statistic has a non–central F distribution, with a mean that is a linear function of the population signal-to-noise ratio. 4WeestimateB i withBˆ i =(t− 1 t j − = τ 1 x i,j x! i,j )− 1,wherex i,t isthevectorofregressorsintheforecasting model(supposingtheMSEstationarityassumedinthetheoreticalanalysis). Ataforecasthorizon(τ)ofone period, we estimate V using Vˆ =t− / 1 t j − = τ 1 uˆ2 1,j x 2,j x! 2,j . At longer forecast horizons, we similarly compute V with the Newey and West (1987) estimator (again, using the residual from the restricted model) and / 2(τ 1) lags. In all cases, we use the restricted model residual in computing V, in light of the evidence − in such studies as Godfrey and Orme (2004) that imposing such restrictions improves the small sample properties of heteroskedasticity–robust variances. 5Our optimal, but infeasible, combining weights are closely related to the minimum-MSE estimator providedinTheil(1971). Ourresultsprimarilydifferinthatwepermitseriallycorrelatedandconditionally heteroskedastic errors, and don’t require strict exogeneity of the regressors. 8

Based on that mean, the population–level signal-to-noise ratio can be alternatively estimated as F-statistic 1. A combination forecast based on this estimate is exactly the same − as the forecast that would be obtained by applying conventional Stein rule estimation to the unrestricted model. This Stein rule result suggests an alternative estimate of the optimal combination coefficient α with potentially better small sample properties. Specifically, based on (i) the ∗t equivalence of the directly estimated signal-to-noise ratio and the conventional F-statistic result and (ii) the centering of the F distribution at a linear transform of the population signal-to-noise ratio, we might consider replacing the signal-to-noise ratio estimate in (8) with the signal-to-noise ratio estimate less 1. However, under this estimation approach, the combination forecast could put a weight of more than 1 on the restricted model and a negative weight on the unrestricted. As a result, we might consider a truncation that bounds the weight between 0 and 1: signal 1 − αˆ = 1+max 0, 1 , (9) ∗t noise − 5 6 78 signal wherethe termisthesameasthatinthebaselineestimator(8)). Inlightofpotential noise concerns about the small sample properties of the estimator (8), we include a forecast combination based on (9) in our Monte Carlo and empirical analyses. More generally, in cases in which the marginal predictive content of the x variables 22 is small or modest, a simple average forecast might be more accurate than our proposed estimated combinations based on (8) or (9). With β coefficients sized such that the 22 restrictedandunrestrictedmodelsarenearlyequallyaccurate, thepopulation–leveloptimal combination weight will be close to 1/2. As a result, forecast accuracy could be enhanced by imposing a combination weight of 1/2 instead of estimating it, in light of the potential for noise in the combination coefficient estimate. A parallel result is well–known in the non–nested combination literature: simple averages are often more accurate than estimated optimal combinations Our proposed combination (8) might also be expected to have some relationship to Bayesian methods. In the very simple case of the example of section 2.1, the proposed combination forecast corresponds to a forecast from an unrestricted model with Bayesian posterior mean coefficients estimated with a prior mean of 0 and variance proportional to the signal–noise ratio.6 More generally, our proposed combination could correspond to the 6Specifically, using a prior variance of the signal–noise ratio times the OLS variance yields a posterior 9

Bayesian model averaging considered in such studies as Wright (2003), Koop and Potter (2004), and Stock and Watson (2005). Indeed, in the scalar environment of Stock and Watson (2005), setting their weighting function to t-stat2/(1+t-stat2) yields our combination forecast. In the more general case, there may be some prior that makes a Bayesian average of the restricted and unrestricted forecasts similar to the combination forecast based on (8). Note, however, that the underlying rationale for Bayesian averaging is quite different from the combination rationale developed in this paper. Bayesian averaging is generally founded on model uncertainty. In contrast, our combination rationale is based on the bias– variance tradeoff associated with parameter estimation error, in an environment without model uncertainty. 3 Monte Carlo Evidence We use Monte Carlo simulations of several multivariate data-generating processes to evaluate the finite–sample performance of the combination methods described above. In these experiments, the DGPs relate the predictand y to lagged y and lagged x, with the coefficients on lagged x set at various values. Forecasts of y are generated with the combination approaches considered above. Performance is evaluated using simple summary statistics of the distribution of each forecast’s MSE: the average MSE across Monte Carlo draws and the probability of equaling or beating the restricted model’s forecast MSE. 3.1 Experiment design In light of the considerable practical interest in the out–of–sample predictability of inflation (see,forexample,StockandWatson(1999,2003),AtkesonandOhanian(2001),Orphanides and van Norden (2005), and Clark and McCracken (2006)), we present results for DGPs based on estimates of quarterly U.S. inflation models. In particular, we consider models based on the relationship of the change in core PCE inflation to (1) lags of the change in inflation and the output gap, (2) lags of the change in inflation, the output gap, and food and energy price inflation, and (3) lags of the change in inflation and five common business cycle factors, estimated as in Stock and Watson (2005).7 We consider various mean forecast equivalent to the combination forecast. 7See Section 4’s description of the applications for data details. The DGP coefficients are based on models estimated with quarterly data from 1961:Q1 through 2006:Q2. For convenient scaling of the DGP parameters, the common factors estimated from the data were multiplied by 10 prior to the estimation of the regression models underlying the DGP specifications. 10

combinations of forecasts from an unrestricted model that includes all variables in the DGP to forecasts from a restricted model that takes an AR form (that is, a model that drops from the unrestricted model all but the constant and lags of the dependent variable). For each experiment, we conduct 10,000 simulations. With quarterly data in mind, we evaluate forecast accuracy over forecast periods of various lengths: P = 1, 20, 40, and 80. In our baseline results, the size of the sample used to generate the first (in time) forecast at horizon τ is 80 τ +1 (the estimation sample expands as forecasting moves forward in − time). In light of the potential for forecast combination to yield larger gains with smaller model estimation samples, we also report selected results for experiments in which the size of the sample used to generate the first (in time) forecast at horizon τ is 40 τ +1. − The first DGP, based on the empirical relationship between the change in core inflation (∆y ) and the output gap (x ), takes the form t 1,t ∆y = .40∆y .18∆y .09∆y .04∆y +b x +u t t 1 t 2 t 3 t 4 11 1,t 1 t − − − − − − − − − x = 1.15x .05x .20x +v (10) 1,t 1,t 1 1,t 2 1,t 3 1,t − − − − − u .72 var t = . v .02 .57 6 1,t7 6 7 We consider experiments with two different settings of b , the x coefficient, which cor- 11 1 responds to our theoretical construct β /√T. The baseline value of b is the one that, 22 11 in population, makes the null and alternative models equally accurate (in expectation, at the 1–step ahead horizon) in the first forecast period, period T P +2 — the value that − satisfies (6). Given the population moments implied by the DGP parameterization, this value is b = .042. The second setting we consider is the empirical value: b = .10. 11 11 The second DGP, based on estimated relationships among inflation (∆y ), the output t gap (x ), and food and energy price inflation (x ), takes the form: 1,t 2,t ∆y = .47∆y .24∆y .15∆y .10∆y +b x +b x +b x +u t t 1 t 2 t 3 t 4 11 1,t 1 21 2,t 1 22 2,t 2 t − − − − − − − − − − − x = 1.15x .05x .20x +v (11) 1,t 1,t 1 1,t 2 1,t 3 1,t − − − − − x = .06x +.40x +.28x .13x +v 2,t 1,t 1 2,t 1 2,t 3 2,t 4 2,t − − − − − u .62 t var v = .03 .57 . 1,t     v .06 .06 .70 2,t −     As with DGP 1, we consider experiments with two settings of the set of b coefficients, ij which correspond to the elements of β /√T. One setting is based on empirical estimates: 22 b = .07, b = .27, b = .10. We take as the baseline experiment one in which all of these 11 21 22 11

empirical values of the b coefficients are multiplied by a constant less than one, such that, ij in population, the null and alternative models are expected to be equally accurate (at the 1–step ahead horizon) in (the first) forecast period T P +2. In our baseline experiments, − this multiplying constant is .370. ThethirdDGP,basedonestimatedrelationshipsamonginflation(∆y )andfivebusiness t cycle factors estimated as in Stock and Watson (2005) (x ,i = 1,...,5), takes the form: i,t 5 ∆y = .40∆y .19∆y .10∆y .04∆y + b x +u , var(u ) = .67 t t 1 t 2 t 3 t 4 i1 i,t 1 t t − − − − − − − − − i=1 9 4 x = a x +v , i = 1,...,5. (12) i,t ij i,t 1 i,t − j=1 9 As with DGPs 1 and 2, we consider experiments with two different settings of the set of b coefficients. One setting is based on empirical estimates: b = .04, b = .09, ij 11 21 b = .16, b = .04, b = .08.8 We take as the baseline experiment one in which all of 31 41 51 these empirical values of the b coefficients are multiplied by a constant less than one, such ij that, in population, the null and alternative models are expected to be equally accurate (at the 1-step horizon) in forecast period T P + 2. In our baseline experiments, this − multiplying constant is .748. 3.2 Forecast approaches Following practices common in the literature from which our applications are taken (see, e.g., Stock and Watson (2003)), direct multi–step forecasts one and four steps ahead are formed from various combinations of estimates of the following forecasting models: (τ) y y = δ +δ ∆y +δ ∆y +δ ∆y +δ ∆y +u (13) t+τ − t 0 1 t 2 t − 1 3 t − 2 4 t − 3 1,t+τ (τ) y y = γ +γ ∆y +γ ∆y +γ ∆y +γ ∆y +Γ x +u , (14) t+τ − t 0 1 t 2 t − 1 3 t − 2 4 t − 3 #22 22,t 2,t+τ where y (τ) = (1/τ) τ y and y (1) y . In the actual inflation data underlying t+τ s=1 t+s t+1 ≡ t+1 (τ) the DGP specificatio/n, y corresponds to the average annual rate of price increase from t+τ period t to t+τ. Across DGPs 1-3, the vector x consists of, respectively, (1) (x ), (2) 22,t 1,t (x ,x ,x ), and (3) (x ,x ,x ,x ,x ). 1,t 2,t 2,t 1 # 1,t 2,t 3,t 4,t 5,t # − We examine the accuracy of forecasts from: (1) OLS estimates of the restricted model (13); (2) OLS estimates of the unrestricted model (14); (3) the “known” optimal linear 8The coefficients of the AR models for the factors are as follows, in order from lags 1 to 4: factor 1: .81, -.18, .19, -.19; factor 2: .80, -.05, .16, -.18; factor 3: -.36, .16, .22, .12; factor 4: .31, .08, .39, .01; and factor 5: .25, .15, .24, .05. The residual variances of the five factors are as follows, in order for factors 1 through 5: 6.36, 2.35, .92, 2.08, 1.62. 12

combination of the restricted and unrestricted forecasts, using the weight implied by equation (4) and population moments implied by the DGP; (4) the estimated optimal linear combination of the restricted and unrestricted forecasts, using the weight given in (8) and estimated moments of the data; (5) the estimated optimal linear combination using the Stein rule–variant weight given in (9); and (6) a simple average of the restricted and unrestricted forecasts (as noted above, weights of 1/2 are optimal if the signal associated with the x variables equals the noise, making the models equally accurate). 3.3 Simulation results In our Monte Carlo comparison of methods, we primarily base our evaluation on average MSEs over a range of forecast samples. For simplicity, in presenting average MSEs, we only report actual average MSEs for the restricted model (13). For all other forecasts, we report the ratio of a forecast’s average MSE to the restricted model’s average MSE. To capture potential differences in MSE distributions, we also present some evidence on the probabilities of equaling or beating the restricted model. 3.3.1 Results for signal = noise experiments We begin with the case in which the coefficients b (elements of β ) on the lags of x ij 22 it (elements of x ) in the DGPs (10)–(12) are set such that, at the 1-step ahead horizon, the 22 restricted and unrestricted model forecasts for period T P +2 are expected to be equally − accurate — because the signal and noise associated with the x variables are equalized it as of that period. In this setting, the optimally combined forecast should, on average, be more accurate than either the restricted or unrestricted forecasts. Note, however, that the modelsarescaledtomakeonly1–stepaheadforecastsequallyaccurate. Atthe4–stepahead forecast horizon, the restricted model may be more or less accurate than the unrestricted, depending on the DGP. The average MSE results reported in Table 1 confirm the theoretical implications. Consider first the 1–step ahead horizon. With all three DGPs, the ratio of the unrestricted model’s average MSE to the restricted model’s average MSE is close to 1.000 for all forecast samples. At the 4-step ahead horizon, for all DGPs the ratio of the unrestricted model’s average MSE to the restricted model’s average MSE is generally above 1.000. The unrestricted model fares especially poorly relative to the restricted in the case of DGP 3, in which the unrestricted model includes five more variables than the restricted. In general, 13

in all cases, the MSE ratios for 4-step ahead forecasts from the unrestricted model tend to fall as P rises, reflecting the increase in the precision of the x coefficient (Γ ) estimates 22 that occurs as forecasting moves forward in time and the model estimation sample grows. AcombinationoftherestrictedandunrestrictedforecastshasaloweraverageMSE,with the gains generally increasing in the number of variables omitted from the restricted model and the forecast horizon. At the 1–step horizon, using the known optimal combination weight α yields P = 20 MSE ratios of .994, .983, and .974 for, respectively, DGPs 1, 2, ∗t and 3. At the 4–step horizon, the forecast based on the known optimal combination weight has P = 20 MSE ratios of .986, .962, and .973 for DGPs 1-3.9 Not surprisingly, having to estimate the optimal combination weight tends to slightly reduce the gains to combination. For example, in the case of DGP 2 and P = 20, the MSE ratio for the estimated optimal combination forecast is .989, compared to .983 for the known optimal combination forecast. Using the Stein rule–based adjustment to the optimal combination estimate (based on equation (9)) has mixed consequences, sometimes faring a bit worse than the directly estimated optimal combination forecast (based on equation (8)) and sometimes a bit worse. To use the same DGP 2 example, the P = 20 MSE ratio for the Stein version of the estimated optimal combination is .990, compared to .989 for the directly estimated optimal combination. However, in the case of 4-step ahead forecasts for DGP 3 with the P = 20 sample, the MSE ratios of the known α , estimated αˆ , and ∗t ∗t Stein–adjusted αˆ are, respectively, .973, .991, and .985. ∗t IntheTable1experiments,thesimpleaverageoftherestrictedandunrestrictedforecasts is consistently a bit more accurate than the estimated optimal combination forecast. For example, for DGP 3 and the P = 20 forecast sample, the MSE ratio of the simple average forecast is .974 for both 1–step and 4–step ahead forecasts, compared to the estimated optimal combination forecasts’ MSE ratios of .982 (1-step) and .991 (4-step). There are two reasons a simple average fares so well. First, with the DGPs parameterized to make signal = noise for one–step ahead forecasts for period T P +2, the theoretically optimal − combinationweightis1/2. Ofcourse,asforecastingmovesforwardintime,thetheoretically optimal combination weight declines, because as more and more data become available for estimation, the signal-to-noise ratio rises (e.g., in the case of DGP 3, the known optimal 9Compared to the restricted model, the gains to combination is a bit larger with DGP 2 than DGP 3. However,consistentwithourtheory,whenthecombinationforecastiscomparedtotheunrestrictedforecast, the gains to combination are (considerably) larger for DGP 3 than DGP 2. 14

weight for the forecast of the 80th observation in the prediction sample is about .33). But the decline is gradual enough that only late in a long forecast sample would noticeable differences emerge between the theoretically optimal combination forecast and the simple average. A second reason is that, in practice, the optimal combination weight may not be estimated with much precision. As a result, imposing a fixed weight of 1/2 is likely better than trying to estimate a weight that is not dramatically different from 1/2. 3.3.2 Results for signal > noise experiments In DGPs with larger b (β ) coefficients — specifically, coefficient values set to those ij 22 obtained from empirical estimates of inflation models — the signal associated with the x it (x ) variables exceeds the noise, such that the unrestricted model is expected to be more 22 accurate than the restricted model. In this setting, too, our asymptotic results imply the optimalcombinationforecastshouldbemoreaccuratethantheunrestrictedmodelforecast, on average. However, relative to the accuracy of the unrestricted model forecast, the gains to combination should be smaller than in DGPs with smaller b coefficients. ij The results for DGPs 1–3 reported in Table 2 confirm these theoretical implications. At the 1–step ahead horizon, the unrestricted model’s average MSE is about 5-6 percent lower than the restricted model’s MSE in DGP 1 and 3 experiments and roughly 15 percent lower in DGP 2 experiments. At the 4–step ahead horizon, the unrestricted model is more accurate than the restricted by about 12, 28, and 4 percent for DGPs 1, 2, and 3. Combinationusingtheknownoptimalcombinationweightα improvesaccuracyfurther, ∗t more so for DGP 3 (for which the unrestricted forecasting model is largest) than DGPs 1 and 2 and more so for the 4–step ahead horizon than the 1-step horizon. Consider, for example, the forecast sample P = 1. For DGP 2, the known optimal combination forecast’s MSEratiosare.839(1-step)and.716(4-step), comparedtotheunrestrictedforecast’sMSE ratios of, respectively, .845 and .723. For DGP 3, the known optimal combination forecast’s MSEratiosare.924(1-step)and.919(4-step), comparedtotheunrestrictedforecast’sMSE ratios of, respectively, .947 and .971. Consistent with our theoretical results, the gains to combination seem to be larger under conditions that likely reduce parameter estimation precision (more variables and residual serial correlation created by the multi-step forecast horizon). Similarly, the gains to combination (gains relative to the unrestricted model’s forecast) riseastheestimationsamplegetssmaller. Table3reportsresultsforthesameDGPsusedin 15

Table 2, but for the case in which the initial estimation sample is 40 observations instead of 80. With the smaller estimation sample, DGP 2 simulations yield known optimal combination MSE ratios of .882 (1-step) and .807 (4-step), compared to the unrestricted forecast’s MSE ratios of, respectively, .908 and .851. For DGP 3, the known optimal combination forecast’s MSE ratios are .960 (1-step) and .959 (4-step), compared to the unrestricted forecast’s MSE ratios of, respectively, 1.064 and 1.146. Again, not surprisingly, having to estimate the optimal combination weight tends to slightly reduce the gains to combination. For instance, in Table 2’s results for case DGP 2 and P = 1, the 4–step ahead MSE ratio for the estimated optimal combination forecast is .723, compared to .716 for the known optimal combination forecast. Using the Stein rule– based adjustment to the optimal combination estimate (based on equation (9)) typically reduces forecast accuracy a bit more (to a MSE ratio of .732 in the same example), but not always — the adjustment often improves forecast accuracy with DGP 3 and a small estimation sample (Table 3). Imposing simple equal weights in averaging the unrestricted and restricted model forecasts sometimes slightly improves upon the estimated optimal combination but other times reducesaccuracy. InTable2’sresultsforDGPs1and2, theestimatedoptimalcombination is always more accurate than the simple average. For example, with DGP 2 and the 4-step horizon, the P = 20 MSE ratio of the estimated optimal combination forecast is .725, comparedtothesimpleaverageforecast’sMSEratioof.767. ButforDGP3, thesimpleaverage is often slightly more accurate than the estimated optimal combination. For instance, at the 4-step horizon and with P = 20, the optimal combination and simple average forecast MSEs are, respectively, .928 and .919. As these results suggest, the meritsofimposingequalcombinationweights overestimating weights depend on how far the true optimal weight is from 1/2 (which depends on the population size and precision of the model coefficients) and the precision of the estimated combination weight. In cases in which the known optimal weight is relatively close to 1/2 (DGP 3, 1-step forecast, Table 2), the simple average performs quite similarly to the known optimal forecast, and better than the estimated optimal combination. In cases in which the known optimal weight is far from 1/2 (DGP 2, 1-step forecast, Table 2), the simple average is dominated by the known optimal forecast and, in turn, the estimated optimal combination. Consistent with such reasoning, reducing the initial estimation sample gener- 16

ally improves the accuracy of the simple average forecast relative to the estimated optimal combination. For example, Table 3 shows that, with DGP 2 and the 4-step horizon, the P = 20 MSE ratio of the simple average forecast is .789, compared to the estimated optimal combination forecast’s MSE ratio of .775 (in Table 2, the corresponding figures are .767 and .725). 3.3.3 Distributional results In addition to helping to lower the average forecast MSE, combination of restricted and unrestricted forecasts helps to tighten the distribution of relative accuracy — specifically, the MSE relative to the MSE of the restricted model. The results in Table 4 indicate that combination — especially simple averaging — often increases the probability of equaling or beating the MSE of the restricted model, often by more than it lowers average MSE (note that, to conserve space, the table omits results for DGP 1). For instance, with DGP 2 parameterized such that signal = noise for forecasting 1-step ahead to period T P+2, the − frequency with which the unrestricted model’s MSE is less than or equal to the restricted model’s MSE is 47.2 percent for P = 20. The frequency with which the known optimal combination forecast’s MSE is below the restricted model’s MSE is 57.4 percent. Although the estimated combination does not fare as well (probability of 51.4 percent), a simple average fares even better, beating the MSE of the restricted model in 58.1 percent of the simulations. Note also that, by this distributional metric, using the Stein variant of the combination weight estimate often offers a material advantage over the direct approach to estimating the combination weight. In the same example, the Stein–based combination forecast has a probability of 55.3 percent, compared to the 51.4 percent for the directly estimated combination forecast. By this probability metric, the simple average (and, to a lesser extent, the optimal combinationbasedontheSteinruleestimate)alsofareswellinotherexperiments. Consider, for example, the experiments with the signal > noise version of DGP 3, a forecast horizon of 4 steps, and P = 20. In this case, the probability the unrestricted model yields a MSE less than or equal to the restricted model’s MSE is 54.3 percent. The probabilities for the estimated optimal combination, Stein–estimated optimal combination, and simple average are,respectively,62.9,64.9,and69.1percent. Again,averaging,especiallysimpleaveraging, greatly improves the probability of beating the accuracy of the restricted model forecast. 17

4 Empirical Applications To evaluate the empirical performance of our proposed forecast methods compared to some relatedalternatives(describedbelow),weconsiderthewidelystudiedproblemofforecasting inflation with Phillips curve models. In particular, we examine forecasts of quarterly core PCE (U.S.) inflation. In light of the potential for the benefits of forecast combination to rise as the number of variables and, in turn, overall parameter estimation imprecision increases, we consider a range of applications, including between one and five predictors of core inflation. In a first application, patterned on analyses in such studies as Stock and Watson (1999, 2003), Orphanides and van Norden (2005), and Clark and McCracken (2006),theunrestrictedforecastingmodelincludeslagsofinflationandtheoutputgap. Ina second application, the unrestricted forecasting model is augmented to include lags of food and energy price inflation, following Gordon’s (1998) approach of including supply shock measures in the Phillips curve. In another set of applications, patterned on such studies as Brave and Fisher (2004), Stock and Watson (2002, 2005), and Boivin and Ng (2005), the unrestricted forecast model includes lags of inflation and 1, 2, 3, or 5 common business cycle factors, estimated as in Stock and Watson (2005). This section proceeds by detailing the data and forecasting models, describing some additional forecast methods included for comparison, and presenting the results. 4.1 Data and model details Inflation is measured in annualized percentage terms (as 400 times the log change in the price index).10 The output gap is measured as the log of real GDP less the log of CBO’s estimate of potential GDP. Following Gordon (1998), the food and energy price inflation variable is measured as overall PCE inflation less core PCE inflation. The common factors are estimated with the principal component approach of Stock and Watson (2002, 2005), using a data set of 127 monthly series nearly identical to Stock and Watson’s (2005).11 Following the specifications of Stock and Watson (2005), we first transformed the data for stationarity, screened for outliers, and standardized the data, and then computed principal 10DataonactualrealGDPandthePCEpriceindexesaretakenfromtheFAMEdatabaseoftheFederal Reserve Board of Governors. Data on the CBO’s estimate of potential output are taken from the CBO’s website. The data used to estimate business cycle factors are from a variety of sources, including FAME, the Conference Board, and the BEA’s website. 11Duetochangesindataavailability,inafewcaseswewereunabletoobtaincontinuousseriesforvariables used by Stock and Watson (2005). 18

components at the monthly frequency. Following Stock and Watson (2005) and Boivin and Ng (2005), the factors are estimated recursively for each month of the forecast sample, applying the factor estimation algorithm to data through the given month. Quarterly data on factors used in model estimation ending in quarter t are within–quarter averages of monthly factors estimated with data from the beginning of the sample through the last month of quarter t Following the basic approach of Stock and Watson (1999, 2003), among others, we treat inflation as having a unit root, and forecast a measure of the direct multi-step change in inflation as a function of lags of the change in quarterly inflation and lags of other variables. In particular, using the notation of the last section, we make y the log difference of the quarterly core PCE price index (scaled by 400 to make y an annualized percentage change); (τ) (τ) ∆y isthenthechangeinquarterlyinflation. Thepredictandisy y , wherey denotes t+τ− t t+τ the average annual rate of price change from t to t+τ. The x variables denote the output gap, relative food and energy price inflation, and the set of common factors included in the model (with the number ranging from 1 to 5). The restricted model is autoregressive — the multi-step change in inflation is a function of just lags of the one–period change in inflation. The unrestricted model adds lags of x variables to the set of regressors. In particular, the competing forecasting models take the forms of section 4’s equations (13) and (14). All models include four lags of the change in inflation (∆y ). For the output gap t and the factors, the models use one lag. For food–energy inflation, the models include two lags. The forecasting models are estimated with data starting in 1961:Q1. The parameters of the forecasting models are re-estimated with added data as forecasting moves forward throughtime(thatis,ourforecastingschemeistheso–calledrecursive). Theforecastsample is 1985:Q1 (1985:Q4 for four–step ahead forecasts) through 2006:Q2. We report results — MSEs — for forecast horizons of one quarter and one year. 4.2 Additional forecast methods Because our proposed forecast combination methods correspond to a form of shrinkage, for comparison we supplement our results to include not only our proposed methods but also some alternative shrinkage forecasts based on Bayesian methods. Doan, Litterman, and Sims (1984) suggest that conventional Bayesian estimation (specifically, the prior) provides a flexible method for balancing the tradeoff between signal and parameter estimation noise. 19

Accordingly, onealternativeforecastisobtainedfromtheunrestrictedforecastingmodel (for a given application) estimated with generalized ridge regression, which is similar to and under some implementations identical to conventional BVAR estimation. Consistent with thespiritofourproposedcombinationapproaches, whichtrytolimittheeffectsofsampling noiseinthecoefficientsofthexvariables,theridgeestimatorpushesthecoefficientsonthex variables toward zero by imposing informative prior variances on the associated coefficients (note that the tightness of the prior increases with the number of lags of x included). The ridge estimator allows very large variances on the coefficients of the intercept and lagged inflation terms. In the case of the 1–step ahead model, our generalized ridge estimator is exactly the same as the conventional Bayesian or BVAR estimator of Litterman (1986), except that we use flat priors on the intercept and lagged inflation terms.12 We apply the same priors to the 4–step ahead model (based on some experimentation to ensure the prior setting worked well). We report a second alternative forecast constructed by applying Bayesian model averaging (BMA) to the restricted and unrestricted models, following the BMA approach of Fernandez, Ley, and Steel (2001). In particular, we first estimate the models imposing a simple g–prior (but with a flat prior on intercepts), and then average the models based on posterior probabilities calculated as in Fernandez, Ley, and Steel (2001). Based on Wright’s (2003) findings on forecasting inflation with BMA methods, we set the g–prior coefficient (g in the notation of Fernandez, et al., or 1/φ in Wright’s notation) at .20. 0j 4.3 Results Inverybroadterms,theresultsinTable5seemreasonablyreflectiveoftheoverallliterature on forecasting U.S. inflation in data since the mid-1980s: the variables included in the unrestrictedmodelbutnotthe restrictedonlysometimesimprove forecastaccuracy. Across the12columnsofTable5(coveringsixapplicationsandtwoforecasthorizons),therestricted model’s MSE is lower than the unrestricted model’s in six cases, sometimes slightly (e.g., 1–year ahead forecasts from the model with five factors) and sometimes dramatically (e.g., 1–year ahead forecasts from the model with the output gap and food–energy inflation). Combining forecasts with our proposed methods significantly improves upon the accu- 12InthenotationofLitterman,weusethefollowingparametersettingsindeterminingthepriorvariances: λ=.2andθ=.5. Insomesupplementalanalysis,weverifiedthatthisgeneralizedridgeforecastwasatleast as good as a similar ridge forecast that shrinks all coefficients, in line with conventional BVAR estimation. NotethatwedescribetheestimatorasgeneralizedridgeratherthanBVARbecause,inthemulti–stepcase, the estimator is not a proper Bayesian estimator. 20

racy of the unrestricted model’s forecast, by enough that, in each column, at least one of the average forecasts is more accurate than the restricted model’s forecast. For every application and horizon, our estimated optimal combination forecast has a lower MSE than the unrestricted model. For example, in the three factor application (lower block, middle), the optimal combination forecast has a 1–year ahead MSE ratio of .791, while the unrestricted model has a MSE ratio of .879. Consistent with our theoretical results, the advantage of the combination forecast over the unrestricted forecast tends to rise as the number of x variables in the unrestricted model increases (with the increase in the number of variables tending to lower the signal–noise ratio) and as the forecast horizon increases. For example, in the same (three factor) application, the 1–quarter ahead MSE ratios of the unrestricted and optimal combination forecasts are, respectively, .950 and .935 — closer together than for the 1–year ahead horizon. In the five factor application, the 1–year ahead MSE ratios of the unrestricted and optimal combination forecasts are 1.001 and .834 — farther apart than in the three factor application. EstimatingtheoptimalcombinationweightwithourproposedSteinrule–basedapproach yields a consistent, modest improvement in forecast accuracy. In all columns of Table 5, the optimalcombinationforecastbasedontheStein–estimatedweight(9)hasalowerMSEthan does the optimal combination based on the baseline approach (8). In the same three factor application, at the 1–year horizon the optimal combination based on the Stein–estimated weight has a MSE ratio of .781, compared to the directly estimated optimal combination forecast’s MSE ratio of .791. With five factors and the 1–year horizon, the Stein–estimated optimal combination’s MSE ratio is .807, while the directly estimated optimal combination forecast’s MSE ratio is .834. Inmostcases,imposingequalweightsincombiningtherestrictedandunrestrictedmodel forecasts further improves forecast accuracy, sometimes substantially. As a result, in many cases, the simple average forecast is the best forecast of all considered. For instance, in the output gap and food–energy inflation application, the 1–year ahead MSE ratios of the simple average and Stein–estimated combination forecasts are .871 (the lowest among all forecasts) and 1.020, respectively. In the application with two factors, the 1–year ahead MSE ratios are .854 (again, the lowest among all forecasts) and .866 for, respectively, the simple average and Stein–estimated combination forecasts. In some cases, though, the simple average is only slightly better than or worse than our proposed Stein rule–based 21

approach. For example, in the three factor application, the simple average forecast’s 1–year aheadMSEratiois.782,comparedtotheStein–estimatedcombinationforecast’sMSEratio of .781. In these applications, our proposed combinations clearly dominate Bayesian model averaging and are generally about as good as or better than ridge regression.13 For example, in the two factor application, the 1–year ahead MSE ratio is .866 for the Stein–estimated combination, .854 for the simple average, and .891 for the ridge regression forecast. In the same application, though, the 1–quarter ahead MSE ratios are virtually identical, at .953, .950, and .950, respectively. In the application with the output gap and food–energy inflation as predictands, the 1–quarter ahead MSE ratios of the Stein-estimated, simple average, and ridge regression forecasts are, respectively, 1.060, .983, and 1.032. However, in all cases, the BMA forecast is less accurate than the Stein–estimated combination and simple average forecasts. For instance, in the two factor application, the BMA forecast has MSE ratios of .971 and .934 at the 1–quarter and 1–year ahead horizons (compared, e.g., to the Stein–estimated combination MSE ratios of .953 and .866). 5 Conclusion As reflected in the principle of parsimony, when some variables are truly but weakly related to the variable being forecast, having the additional variables in the model may detract from forecast accuracy, because of parameter estimation error. Focusing on such cases of weak predictability, we show that combining the forecasts of the parsimonious and larger models can improve forecast accuracy. We first derive, theoretically, the optimal combination weight and combination benefit. In the special case in which the coefficients on the variables of interest are of a magnitude that makes the restricted and unrestricted models equally accurate, the MSE–minimizing forecast is a simple, equally–weighted average of the restricted and unrestricted forecasts. A range of Monte Carlo experiments and empirical examples show our proposed approach of combining forecasts from nested models to be effective in practice. 13Of course, it is possible that alternative specifications of Bayesian/ridge estimation and BMA could improveuponthosewehaveconsidered(althoughwedidexperimentwithsomealternatives,noneofwhich beatthoseforwhichwehavereportedresults). Ataminimum,though,ourproposedcombinationapproaches would seem likely to at least remain competitive with such alternative Bayesian approaches. 22

6 Appendix 1: Theory Details Note that, in the notation below, W() denotes a standard (k 1) Brownian motion. · × Theorem 1: T (uˆ2 uˆ2 ) 1 ξ (s)= 2 1 α(s)s t= 1 T W − P ( + s 1 )V1 2 / , 2 t+ ( τ J − B W J ,t+ + τ B → )V d 1/ 1 2 − d λ W P (s W ) + {− 1 1 − ( λ 1 P ( / 1 − α(s) " )2)s 2W − (s)V 1 1/ " 2( J 2 B J 0 +B )V1/2W(s)ds + + + + 2 { 0 0 0 1 1 1 { 1 1 − − − − − 0 λ λ λ P P P 1 0 1 − 1 α α 1 − λ ( 2 P λ s ( − P s ) α ( ) α 1 s (s − ( − − ) s 1 2 ) δ α δ δ " " " B ( B B s 2− ) 2− 2− ) 1 1 s 1 ( ( − ( − − − 1 J − δ J J " B B B B 1 1 1 2− J J J " 1 " " " ( + − + + J B B B B 2 2 2 ) 1 ) ) B B J V − 2− " 2− 1 + 1 / 1 2 J δ d B B d W 1 s 2 1 } ) J ( " V . s " V 1 ) / 1 2 / W 2 2W (s ( ) s d ) s d } s } 0 Proof of Theorem 1: The proof is provided in two stages. In the first stage we provide an asymptotic expansion. In the second we apply a functional central limit theorem and a weak convergence to stochastic integrals result, both from De Jong and Davidson (2000). In the first stage we show that T (uˆ2 uˆ2 ) (15) t=T P+1 T,2,t+τ − T,W,t+τ − 9 T = 2 α (T 1/2h )( JB J +B )(T1/2H (t)) {− t=T P+1 t − "T,2,t+τ − 1 " 2 T,2 − 9 T +T 1 (1 (1 α )2)(T1/2H (t))( JB J +B )(T1/2H (t)) − t=T P+1 − − t T",2 − 1 " 2 T,2 } − 9T +2 {− t=T P+1 α t δ"B 2− 1( − JB 1 J " +B 2 )(T − 1/2h T,2,t+τ ) − 9T +T − 1 t=T P+1 α2 t δ"B 2− 1( − JB 1 J " +B 2 )B 2− 1JB 1 J " (T1/2H T,2 (t)) − 9T +T − 1 t=T P+1 α t (1 − α t )δ"B 2− 1( − JB 1 J " +B 2 )(T1/2H T,2 (t)) } − 9 T + {− T − 1 t=T P+1 α2 t δ"B 2− 1( − JB 1 J " +B 2 )B 2− 1δ } +o p (1). − 9 To do so first note that straightforward algebra reveals that T (uˆ2 uˆ2 ) (16) t=T P+1 2,t+τ − W,t+τ − 9 T = 2 α (T 1/2h )( JB (t)J +B (t))(T1/2H (t)) {− t=T P+1 t − "T,2,t+τ − 1 " 2 T,2 − 9 T +T 1 (1 (1 α )2)(T1/2H (t))B (t)x x B (t)(T1/2H (t)) − t=T P+1 − − t T",2 2 T,2,t "T,2,t 2 T,2 − 9T T 1 α2(T1/2H (t))JB (t)J x x JB (t)J (T1/2H (t)) − − t=T P+1 t T",2 1 " T,2,t "T,2,t 1 " T,2 − 9 T 2T 1 α (1 α )(T1/2H (t))B (t)x x JB (t)J (T1/2H (t)) − − t=T P+1 t − t T",2 2 T,2,t "T,2,t 1 " T,2 } − 9T +2 {− t=T P+1 α t δ"B 2− 1(t)( − JB 1 (t)J " +B 2 (t))(T − 1/2h T,2,t+τ ) − 9T +T − 1 t=T P+1 α2 t δ"B 2− 1(t)( − JB 1 (t)J " +B 2 (t))x T,2,t x "T,2,t JB 1 (t)J " (T1/2H T,2 (t)) − 9T +T − 1 t=T P+1 α t (1 − α t )δ"B 2− 1(t)( − JB 1 (t)J " +B 2 (t))x T,2,t x "T,2,t B T,2 (t)(T1/2H T,2 (t)) } − 9 T + {− T − 1 t=T P+1 α2 t δ"B 2− 1(t)( − JB 1 (t)J " +B 2 (t))x T,2,t x "T,2,t ( − JB 1 (t)J " +B 2 (t))B 2− 1(t)δ } . − 9 23

We must then show that each bracketed term from (15) corresponds to that in (16). For brevity wewillshowthisindetailonlyforthefirstbracketedterm. Thesecondandthirdfollowfromsimilar arguments. Consider the first bracketed term in (16). If we add and subtract JB J +B in the first 1 " 2 − component, and rearrange terms we obtain T 2 α (T 1/2h )( JB (t)J +B (t))(T1/2H (t)) − t=T P+1 t − "T,2,t+τ − 1 " 2 T,2 − 9T = 2 α (T 1/2h )( JB J +B )(T1/2H (t)) − t=T P+1 t − "T,2,t+τ − 1 " 2 T,2 − 9 T 2T 1/2 α [(T1/2H (t)) (T 1/2h )]vec(T1/2[( JB (t)J +B (t)) ( JB J +B )]). − − t=T P+1 t T",2 ⊗ − "T,2,t+τ − 1 " 2 − − 1 " 2 − 9 Thefirstright-handsidetermisthedesiredresult. Forthesecondright-handsidetermfirstnote that Assumptions 3 and 4 suffice for each of α(t), T1/2H (t) and vec(T1/2[( JB (t)J +B (t)) T",2 − 1 " 2 − ( JB J +B )]) to converge weakly. Applying Theorem 3.2 of de Jong and Davidson (2000) then 1 " 2 − implies that the second right-hand side term is o (1) and the proof is complete. p For the second, third and fourth components of the first bracketed term note that adding and subtracting B 2 , B 2− 1, B 1 and B 1− 1 provides T T 1 (1 (1 α )2)(T1/2H (t))B (t)x x B (t)(T1/2H (t)) (17) − t=T P+1 − − t T",2 2 T,2,t "T,2,t 2 T,2 − 9T = T 1 (1 (1 α )2)(T1/2H (t))B (T1/2H (t)) − t=T P+1 − − t T",2 2 T,2 − 9 T +2T 1 (1 (1 α )2)(T1/2H (t))(B (t) B )(T1/2H (t)) − t=T P+1 − − t T",2 2 − 2 T,2 − 9T +T − 1 t=T P+1 (1 − (1 − α t )2)(T1/2H T",2 (t))B 2 (x T,2,t x "T,2,t− B 2− 1)B 2 (T1/2H T,2 (t)) − 9 T +2T − 1 t=T P+1 (1 − (1 − α t )2)(T1/2H T",2 (t))B 2 (x T,2,t x "T,2,t− B 2− 1)(B 2 (t) − B 2 )(T1/2H T,2 (t)) − 9T +T − 1 t=T P+1 (1 − (1 − α t )2)(T1/2H T",2 (t))(B 2 (t) − B 2 )(x T,2,t x "T,2,t− B 2− 1)(B 2 (t) − B 2 )(T1/2H T,2 (t)) − 9 T +2T − 1 t=T P+1 (1 − (1 − α t )2)(T1/2H T",2 (t))(B 2 (t) − B 2 )B 2− 1(B 2 (t) − B 2 )(T1/2H T,2 (t)), − 9 T T 1 α2(T1/2H (t))JB (t)J x x JB (t)J (T1/2H (t)) (18) − t T",2 1 " T,2,t "T,2,t 1 " T,2 t=T P+1 − 9T = T 1 α2(T1/2H (t))JB J (T1/2H (t)) − t T",2 1 " T,2 t=T P+1 − 9 T +2T − 1 t=T P+1 α2 t (T1/2H T",2 (t))JB 1 J " B 2− 1J(B 1 (t) − B 1 )J " (T1/2H T,2 (t)) − 9T +T − 1 t=T P+1 α2 t (T1/2H T",2 (t))JB 1 J " (x T,2,t x "T,2,t− B 2− 1)JB 1 J " (T1/2H T,2 (t)) − 9 T +2T − 1 t=T P+1 α2 t (T1/2H T",2 (t))JB 1 J " (x T,2,t x "T,2,t− B 2− 1)J(B 1 (t) − B 1 )J " (T1/2H T,2 (t)) − 9T +T − 1 t=T P+1 α2 t (T1/2H T",2 (t))J(B 1 (t) − B 1 )J"(x T,2,t x "T,2,t− B 2− 1)J(B 1 (t) − B 1 )J " (T1/2H T,2 (t)) − 9 24

T +T − 1 t=T P+1 α2 t (T1/2H T",2 (t))J(B 1 (t) − B 1 )J " B 2− 1J(B 1 (t) − B 1 )J " (T1/2H T,2 (t)), − 9 T T 1 α (1 α )(T1/2H (t))B (t)x x JB (t)J (T1/2H (t)) (19) − t=T P+1 t − t T",2 2 T,2,t "T,2,t 1 " T,2 − 9T = T 1 α (1 α )(T1/2H (t))JB J (T1/2H (t)) − t=T P+1 t − t T",2 1 " T,2 − 9 T +T − 1 t=T P+1 α t (1 − α t )(T1/2H T",2 (t))B 2 (x T,2,t x "T,2,t− B 2− 1)JB 1 J " (T1/2H T,2 (t)) − 9T +T − 1 t=T P+1 α t (1 − α t )(T1/2H T",2 (t))B 2 (x T,2,t x "T,2,t− B 2− 1)J(B 1 (t) − B 1 )J " (T1/2H T,2 (t)) − 9T +T − 1 t=T P+1 α t (1 − α t )(T1/2H T",2 (t))(B 2 (t) − B 2 )B 2− 1JB 1 J " (T1/2H T,2 (t)) − 9T +T − 1 t=T P+1 α t (1 − α t )(T1/2H T",2 (t))(B 2 (t) − B 2 )B 2− 1J(B 1 (t) − B 1 )J " (T1/2H T,2 (t)) − 9T +T 1 α (1 α )(T1/2H (t))J(B (t) B )J (T1/2H (t)) − t=T P+1 t − t T",2 1 − 1 " T,2 − 9T +T − 1 t=T P+1 α t (1 − α t )(T1/2H T",2 (t))(B 2 (t) − B 2 )(x T,2,t x "T,2,t− B 2− 1)J(B 1 (t) − B 1 )J " (T1/2H T,2 (t)) − 9T +T − 1 t=T P+1 α t (1 − α t )(T1/2H T",2 (t))(B 2 (t) − B 2 )(x T,2,t x "T,2,t− B 2− 1)JB 1 J " (T1/2H T,2 (t)). − 9 Note that the weighted sum of the first right-hand side term of each of (17) – (19) gives us T T 1 (1 (1 α )2)(T1/2H (t))B (T1/2H (t)) − t=T P+1 − − t T",2 2 T,2 − 9 T T 1 α2(T1/2H (t))JB J (T1/2H (t)) − − t=T P+1 t T",2 1 " T,2 − 9 T 2T 1 α (1 α )(T1/2H (t))JB J (T1/2H (t)) − − t=T P+1 t − t T",2 1 " T,2 − 9T = T 1 (1 (1 α )2)(T1/2H (t))( JB J +B )(T1/2H (t)) − t=T P+1 − − t T",2 − 1 " 2 T,2 − 9 thesecondright-handsidetermin(15). Wemustthereforeshowthatalloftheremainingright-hand sidetermsin(17)-(19)areo (1).Theproofofeachisverysimilar. Forexample,takingtheabsolute p value of the fifth right-hand side term in (17) provides T | T − 1 t=T P+1 (1 − (1 − α t )2)(T1/2H T",2 (t))(B 2 (t) − B 2 )(x T,2,t x "T,2,t− B 2− 1)(B 2 (t) − B 2 )(T1/2H T,2 (t)) | − 9 T ≤ k4(su t p | T1/2H T,2 (t) | )2(su t p | B 2 (t) − B 2 | )2(T − 1 t=T − P+1| x T,2,t x "T,2,t− B 2− 1 | ). 9 Sinceassumptions3and4sufficeforT − 1 T t=T P+1| x T,2,t x "T,2,t− B 2− 1 | =O p (1),sup t| T1/2H T,2 (t) | = − O (1) and sup B (t) B =o (1) we o/btain the desired result. p t| 2 − 2 | p For the second stage of the proof we show that the expansion in (15) converges in distribution to the term provided in the Theorem. To do so recall that Assumption 4 implies α α(s). Also, t ⇒ Assumptions 3 (a) - (d) imply T1/2H (t) s 1V1/2W(s). Continuity then provides the desired T,2 − ⇒ resultsforthesecondcontributiontothefirstbracketedterm,forthesecondandthirdcontributions to the second bracketed term and the third bracketed term. 25

The remaining two contributions (the first in each of the first two bracketed terms), are each weighted sums of increments h . Consider the first contribution to the second bracketed term. T,2,t+τ Since this increment satisfies Assumption 3 (d) and has an associated long-run variance V, we can apply Theorem 4.1 of De Jong and Davidson (2000) directly to obtain the desired convergence in distribution 1 T − t=T P+1 α t δ"B 2− 1( − JB 1 J " +B 2 )(T − 1/2h T,2,t+τ ) → d − α(s)δ"B 2− 1( − JB 1 J " +B 2 )V1/2dW(s). 9 − :1 − λP For the first contribution to the first bracketed term additional care is needed. Again, since the incrementssatisfyAssumption3(d)withlong-runvarianceV wecanapplyTheorem4.1ofDeJong and Davidson (2000) to obtain T 2 α (T 1/2h )( JB J +B )(T1/2H (t)) − t=T P+1 t − "T,2,t+τ − 1 " 2 T,2 − 9 1 2 α(s)s 1W (s)V1/2( JB J +B )V1/2dW(s)+Λ. d − " 1 " 2 → − − :1 − λP Note the addition of the drift term Λ. To obtain the desired result we must show that this term is zero. A detailed proof is provided in Lemma A6 of Clark and McCracken (2005) – albeit under the technical conditions provided in Hansen (1992) rather than those provided here. Rather than repeat the proof we provide an intuitive argument. Note that H T,2 (t) = t − 1 t s − = τ 1 h T,2,s+τ . In particular note the range of summation. Since Assumption 3 (b) maintains that/the increments of the stochastic integral h form an MA(τ 1) we find that h is uncorrelated with every T,2,t+τ T,2,t+τ − element of H (t). Since Λ captures the contribution to the mean of the limiting distribution due T,2 tocovariancesbetweentheincrementsh andtheelementsofH (t)weknowthatΛ=0and T,2,t+τ T,2 the proof is complete. Proof of Corollary 1: First note that the assumptions, and notably Assumption 3 (c), suffice for uniform integrability of the difference in MSEs and hence the limit of the expectation converges to the expectation of the limit. Second, note that both the second bracketed term and the first component of the first bracketed term are zero mean and moreover, the third bracketed term is nonstochastic. Taking expectations of the limit we then obtain 1 E ξ (s) { W } :1 − λP 1 = 0+ (1 (1 α(s))2)s 2E[W (s)V1/2( JB J +B )V1/2W(s)]ds − " 1 " 2 { − − − } :1 − λP 1 + { 0 }− α2(s)δ"B 2− 1( − JB 1 J " +B 2 )B 2− 1δds :1 − λP 1 = (1 (1 α(s))2)s 2tr(E[W(s)W (s)]( JB J +B )V)ds − " 1 " 2 − − − :1 − λP 1 − α2(s)δ"B 2− 1( − JB 1 J " +B 2 )B 2− 1δds :1 − λP 26

1 = (1 (1 α(s))2)s 1tr(( JB J +B )V)ds − 1 " 2 − − − :1 − λP 1 − α2(s)δ"B 2− 1( − JB 1 J " +B 2 )B 2− 1δds. :1 − λP Proof of Corollary 2: We obtain our pointwise optimal combining weight by maximizing, for each fixed s, the argument of the integral in Corollary 1. That is we choose α(s) to maximize (1 − (1 − α(s))2)s − 1tr(( − JB 1 J " +B 2 )V) − α2(s)δ"B 2− 1( − JB 1 J " +B 2 )B 2− 1δ (20) Differentiating (20) with respect to α we obtain FOC α : 2(1 − α(s))s − 1tr(( − JB 1 J " +B 2 )V) − 2α(s)δ"B 2− 1( − JB 1 J " +B 2 )B 2− 1δ SOC α : − 2s − 1tr(( − JB 1 J " +B 2 )V) − 2δ"B 2− 1( − JB 1 J " +B 2 )B 2− 1δ. SettingtheFOCtozeroandsolvingforα(s)providestheformulafromtheCorollary. TheSOC is negative at this solution and we obtain the desired result. 27

References Atkeson, A., and Ohanian, L.E. (2001), “Are Phillips Curves Useful for Forecasting Inflation?” Quarterly Review, Federal Reserve Bank of Minneapolis, 25, 2-11. Bates, J.M., and Granger, C.W.J. (1969), “The Combination of Forecasts,” Operations Research Quarterly, 20, 451-468. Boivin, J., and Ng, S. (2005), “Understanding and Comparing Factor–Based Forecasts,” International Journal of Central Banking, 1, 117-151. Brave, S., and Fisher, J.D.M. (2004), “In Search of a Robust Inflation Forecast,” Economic Perspectives, Federal Reserve Bank of Chicago, Fourth Quarter, 12-31. Clark,T.E.,andMcCracken,M.W.(2005),“EvaluatingDirectMultistepForecasts,”Econometric Reviews, 24, 369-404. Clark, T.E., and McCracken, M.W. (2006), “The Predictive Content of the Output Gap for Inflation: Resolving In–Sample and Out–of–Sample Evidence,” Journal of Money, Credit, and Banking, 38, 1127-1148. Clark, T.E., and West, K.D. (2006a), “Using Out–of–Sample Mean Squared Prediction Errors to Test the Martingale Difference Hypothesis,” Journal of Econometrics, 135, 155-186. Clark, T.E., and West, K.D. (2006b), “Approximately Normal Tests for Equal Predictive Accuracy in Nested Models,” Journal of Econometrics, forthcoming. Clements, M.P., and Hendry, D.F. (1998), Forecasting Economic Time Series, Cambridge, U.K.: Cambridge University Press. deJong, R.M., andDavidson, J.(2000), “TheFunctionalCentralLimitTheoremandWeak ConvergencetoStochasticIntegralsI:WeaklyDependentProcesses,”EconometricTheory, 16, 621-642. Diebold,F.X.(1998),Elements of Forecasting,Cincinnati,OH:South-WesternCollegePublishing. Doan, T., Litterman, R., and Sims, C. (1984), “Forecasting and Conditional Prediction Using Realistic Prior Distributions,” Econometric Reviews, 3, 1-100. Elliott, G., and Timmermann, A. (2004), “Optimal Forecast Combinations Under General LossFunctionsandForecastErrorDistributions,” Journal of Econometrics, 122, 47-79. Fernandez, C., Ley, E., and Steel, M.F.J. (2001), “Benchmark Priors for Bayesian Model Averaging,” Journal of Econometrics, 100, 381-427. Godfrey, L.G., and Orme, C.D. (2004), “Controlling the Finite Sample Significance Levels of Heteroskedasticity-Robust Tests of Several Linear Restrictions on Regression Coeffi- 28

cients,” Economics Letters, 82, 281-287. Gordon, R.J. (1998), “Foundations of the Goldilocks Economy: Supply Shocks and the Time-Varying NAIRU,” Brookings Papers on Economic Activity (no. 2), 297-346. Goyal, A., and Welch, I. (2003), “Predicting the Equity Premium with Dividend Ratios,” Management Science, 49, 639-654. Hansen, B.E. (1992), “Convergence to Stochastic Integrals for Dependent Heterogeneous Processes, Econometric Theory, 8, 489-500. Hendry, D.F., and Clements, M.P. (2004), “Pooling of Forecasts,” Econometrics Journal, 7, 1-31. Jacobson, T., and Karlsson, S. (2004), “Finding Good Predictors for Inflation: A Bayesian Model Averaging Approach,” Journal of Forecasting, 23, 479-496. Koop, G., and Potter, S. (2004), “Forecasting in Dynamic Factor Models Using Bayesian Model Averaging,” Econometrics Journal, 7, 550-565. Litterman, R.B. (1986), “Forecasting with Bayesian Vector Autoregressions — Five Years of Experience,” Journal of Business and Economic Statistics, 4, 25-38. Newey,W.K.,andWest,K.D.(1987),“ASimple,PositiveSemi-Definite,Heteroskedasticity and Autocorrelation Consistent Covariance Matrix,” Econometrica, 55, 703-708. Orphanides, A., and van Norden, S. (2005), “The Reliability of Inflation Forecasts Based on Output Gap Estimates in Real Time,” Journal of Money, Credit, and Banking, 37, 583-601. Stock, J.H., and Watson, M.W. (1999), “Forecasting Inflation,” Journal of Monetary Economics, 44, 293-335. Stock, J.H., and Watson, M.W. (2002), “Macroeconomic Forecasting Using Diffusion Indexes,” Journal of Business and Economic Statistics, 20, 147-162. Stock, J.H., and Watson, M.W. (2003), “Forecasting Output and Inflation: The Role of Asset Prices,” Journal of Economic Literature, 41, 788-829. Stock, J.H., and Watson, M.W. (2005), “An Empirical Comparison of Methods for Forecasting Using Many Predictors,” manuscript, Harvard University. Theil, H. (1971), Principles of Econometrics, New York: John Wiley Press. Timmermann, A. (2006), “Forecast Combinations,” in Handbook of Forecasting, eds. G. Elliott, C.W.J. Granger, and A. Timmermann, North Holland, 135-196. Wright,J.H.(2003),“ForecastingU.S.InflationbyBayesianModelAveraging,”manuscript, Board of Governors of the Federal Reserve System. 29

Table 1. Monte Carlo Results from Signal = Noise Experiments: Average MSEs (for restricted model, average MSE; for other forecasts, ratio of average MSE to restricted model’s average MSE) DGP 1 horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 restricted .773 .775 .771 .764 .818 .816 .808 .796 unrestricted 1.004 1.002 1.000 .998 1.029 1.011 1.006 .998 opt. combination: known α .995 .994 .994 .993 .995 .986 .985 .982 t∗ opt. combination: αˆ .999 .998 .997 .996 1.007 .996 .993 .989 t∗ opt. combination: Stein αˆ .999 .998 .998 .997 1.005 .996 .994 .991 t∗ simple average .995 .994 .994 .993 .992 .984 .983 .981 DGP 2 horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 restricted .678 .678 .677 .672 .635 .632 .627 .620 unrestricted 1.009 1.003 .999 .993 1.004 1.004 .996 .983 opt. combination: known α .984 .983 .982 .980 .959 .962 .960 .956 t∗ opt. combination: αˆ .992 .989 .987 .984 .972 .974 .971 .964 t∗ opt. combination: Stein αˆ .993 .990 .989 .987 .975 .976 .973 .967 t∗ simple average .984 .982 .982 .980 .958 .960 .959 .956 DGP 3 horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 restricted .780 .752 .747 .740 .843 .822 .817 .805 unrestricted 1.012 1.009 1.003 .994 1.058 1.050 1.037 1.020 opt. combination: known α .974 .974 .973 .970 .972 .973 .971 .967 t∗ opt. combination: αˆ .983 .982 .980 .976 .993 .991 .987 .980 t∗ opt. combination: Stein αˆ .985 .983 .981 .978 .987 .985 .983 .978 t∗ simple average .974 .974 .973 .970 .974 .974 .972 .967 Notes: 1. DGPs 1–3 are defined in, respectively, equations (10), (11), and (12). In all experiments, the bij coefficients are scaled such that the null and alternative models are (in population) expected to be equally accurate in the first forecastperiod. ForDGP1,b11=.042. ForDGP2,b11=.026,b21=.100,b22=.037. ForDGP3,b11=.026,b21= .06,b31=.106,b41=.026,b51=.053. 2. The forecasting approaches are defined as follows. The restricted forecast is obtained from OLS estimates of the modelomittingxterms(equation(13)). TheunrestrictedforecastisobtainedfromOLSestimatesofthefullmodel (equation(14)). Theopt.combination: knownα∗t forecastiscomputedasα∗t × restricted+(1 − α∗t ) × unrestricted, withα∗t computedaccordingto(4),usingtheknownfeaturesoftheDGP.Theopt. combination: αˆ∗t forecastisαˆ∗t × restricted + (1 − αˆ∗t ) × unrestricted, with αˆ∗t computed according to (8), using moments estimated from the data. Theopt. combination: Steinαˆ∗t forecastisαˆ∗t × restricted+(1 − αˆ∗t ) × unrestricted,withαˆ∗t computedaccording to(9). Finally,thesimple averageforecastis.5 restricted+.5 unrestricted. × × 3. P definesthenumberofobservationsintheforecastsample. Thesizeofthesampleusedtogeneratethefirst(in time)forecastathorizonτ is80 τ+1(theestimationsampleexpandsasforecastingmovesforwardintime). − 4. ThetableentriesarebasedonaveragesofforecastMSEsacross10,000MonteCarlosimulations. 30

Table 2. Monte Carlo Results from Signal > Noise Experiments: Average MSEs (for restricted model, average MSE; for other forecasts, ratio of average MSE to restricted model’s average MSE) DGP 1 horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 restricted .813 .813 .809 .802 .958 .955 .946 .932 unrestricted .955 .954 .953 .950 .898 .882 .878 .871 opt. combination: known α .952 .952 .951 .949 .889 .875 .872 .867 t∗ opt. combination: αˆ .957 .955 .954 .952 .895 .881 .878 .872 t∗ opt. combination: Stein αˆ .960 .958 .957 .954 .903 .888 .884 .877 t∗ simple average .959 .959 .958 .958 .896 .890 .889 .887 DGP 2 horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 restricted .811 .805 .803 .798 .967 .948 .940 .929 unrestricted .845 .845 .841 .837 .723 .729 .724 .714 opt. combination: known α .839 .840 .837 .834 .716 .721 .718 .710 t∗ opt. combination: αˆ .843 .843 .840 .836 .723 .725 .722 .713 t∗ opt. combination: Stein αˆ .847 .845 .841 .837 .732 .732 .728 .718 t∗ simple average .865 .866 .866 .865 .765 .767 .767 .764 DGP 3 horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 restricted .834 .803 .798 .790 .968 .942 .934 .921 unrestricted .947 .944 .939 .931 .971 .966 .956 .940 opt. combination: known α .924 .924 .922 .918 .919 .920 .917 .910 t∗ opt. combination: αˆ .930 .928 .926 .921 .929 .928 .923 .915 t∗ opt. combination: Stein αˆ .936 .932 .929 .924 .935 .932 .928 .920 t∗ simple average .928 .927 .927 .925 .918 .919 .917 .913 Notes: 1. DGPs 1–3 are defined in, respectively, equations (10), (11), and (12). For DGP 1, b11 = .042. For DGP 2, b11=.07,b21=.27,b22=.10. ForDGP3,b11=.04,b21=.09,b31=.16,b41=.04,b51=.08. 2. SeethenotestoTable1. 31

Table 3. Signal > Noise Experiments with Small Estimation Sample: Average MSEs (for restricted model, average MSE; for other forecasts, ratio of average MSE to restricted model’s average MSE) DGP 1 horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 restricted .854 .853 .839 .823 1.061 1.023 .998 .966 unrestricted .988 .970 .964 .958 .970 .938 .918 .900 opt. combination: known α .971 .961 .957 .954 .925 .909 .897 .886 t∗ opt. combination: αˆ .978 .967 .962 .958 .924 .913 .902 .891 t∗ opt. combination: Stein αˆ .980 .971 .967 .962 .925 .919 .909 .898 t∗ simple average .968 .962 .961 .959 .908 .902 .897 .894 DGP 2 horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 restricted .876 .852 .837 .820 1.071 1.024 .998 .970 unrestricted .908 .882 .868 .855 .851 .801 .775 .749 opt. combination: known α .882 .865 .856 .847 .807 .773 .755 .736 t∗ opt. combination: αˆ .888 .870 .860 .850 .804 .775 .758 .741 t∗ opt. combination: Stein αˆ .896 .878 .866 .854 .819 .791 .773 .751 t∗ simple average .886 .877 .873 .870 .807 .789 .782 .775 DGP 3 horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 restricted .841 .837 .827 .812 1.022 1.009 .988 .960 unrestricted 1.064 1.019 .993 .967 1.146 1.087 1.047 1.002 opt. combination: known α .960 .950 .941 .932 .959 .952 .943 .929 t∗ opt. combination: αˆ .980 .963 .951 .939 .994 .976 .961 .942 t∗ opt. combination: Stein αˆ .972 .962 .953 .942 .968 .962 .953 .940 t∗ simple average .957 .947 .940 .934 .964 .951 .941 .929 Notes: 1. DGPs 1–3 are defined in, respectively, equations (10), (11), and (12). For DGP 1, b11 = .042. For DGP 2, b11=.07,b21=.27,b22=.10. ForDGP3,b11=.04,b21=.09,b31=.16,b41=.04,b51=.08. 2. P definesthenumberofobservationsintheforecastsample. Thesizeofthesampleusedtogeneratethefirst(in time)forecastathorizonτ is40 τ+1(ratherthan80 τ+1asinthebaselineexperiments). − − 3. SeethenotestoTable1. 32

Table 4: Monte Carlo Probabilities of Equaling or Beating Restricted Model’s MSE, DGPs 2 and 3 DGP 2: signal = noise horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 unrestricted .501 .472 .483 .524 .509 .495 .493 .525 opt. combination: known α .521 .574 .627 .698 .535 .576 .600 .660 t∗ opt. combination: αˆ .515 .514 .547 .613 .530 .538 .554 .605 t∗ opt. combination: Stein αˆ .642 .553 .554 .592 .646 .597 .579 .601 t∗ simple average .521 .581 .639 .727 .539 .593 .628 .706 DGP 2: signal > noise horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 unrestricted .557 .803 .901 .977 .589 .785 .869 .951 opt. combination: known α .567 .834 .926 .986 .600 .810 .893 .963 t∗ opt. combination: αˆ .570 .843 .935 .989 .609 .836 .916 .975 t∗ opt. combination: Stein αˆ .574 .849 .939 .990 .618 .844 .921 .977 t∗ simple average .598 .921 .980 .999 .635 .900 .965 .995 DGP 3: signal = noise horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 unrestricted .497 .461 .473 .519 .481 .424 .412 .417 opt. combination: known α .524 .609 .652 .736 .521 .564 .590 .642 t∗ opt. combination: αˆ .517 .549 .584 .670 .505 .506 .513 .555 t∗ opt. combination: Stein αˆ .608 .564 .583 .664 .614 .554 .540 .562 t∗ simple average .524 .616 .671 .770 .517 .556 .587 .653 DGP 3: signal > noise horizon = 1 horizon = 4 method/model P=1 P=20 P=40 P=80 P=1 P=20 P=40 P=80 unrestricted .521 .621 .684 .796 .512 .543 .571 .648 opt. combination: known α .541 .715 .794 .899 .540 .644 .696 .786 t∗ opt. combination: αˆ .539 .702 .778 .890 .537 .629 .677 .773 t∗ opt. combination: Stein αˆ .568 .710 .791 .900 .587 .649 .688 .787 t∗ simple average .554 .779 .867 .955 .552 .691 .758 .861 Notes: 1. Thetableentriesarefrequencies(percentagesof10,000MonteCarlosimulations)withwhicheachforecastapproach yieldsaforecastMSElessthanorequaltotherestrictedmodel’sMSE. 2. SeethenotestoTables1and2. 33

Table 5. Application Results: 1985-2006 Forecasts of Core PCE Inflation (RMSE for restricted forecast, and MSE ratios for other forecasts) output gap output gap & 1 factor food-energy inflation method/model 1Q 1Y 1Q 1Y 1Q 1Y restricted .632 .516 .632 .516 .632 .516 unrestricted .980 1.044 1.150 1.380 .979 1.034 opt. combination: αˆ .976 .990 1.073 1.081 .977 .986 t∗ opt. combination: Stein αˆ .976 .984 1.060 1.020 .977 .978 t∗ simple average .973 .906 .983 .871 .977 .950 Ridge regression .978 1.018 1.032 1.135 .976 .976 BMA .995 1.006 1.085 1.155 .999 1.024 2 factors 3 factors 5 factors method/model 1Q 1Y 1Q 1Y 1Q 1Y restricted .632 .516 .632 .516 .632 .516 unrestricted .965 .971 .950 .879 1.136 1.001 opt. combination: αˆ .954 .879 .935 .791 1.040 .834 t∗ opt. combination: Stein αˆ .953 .866 .934 .781 1.021 .807 t∗ simple average .950 .854 .936 .782 .963 .794 Ridge regression .950 .891 .933 .807 .955 .815 BMA .971 .934 .954 .846 1.065 .914 Notes: 1. Theforecastingmodelstaketheformsgiveninequations(13)and(14). Inthefirstapplication, theunrestricted model includes just one lag of the output gap, defined as the log ratio of actual GDP to the CBO’s estimate of potentialGDP.Inthesecondapplication,theunrestrictedmodelincludesonelagoftheoutputgapandtwolagsof relativefoodandenergypriceinflation,calculatedasoverallPCEinflationlesscorePCEinflation. Intheremaining applications,theunrestrictedmodelincludesonelagofcommonbusinesscyclefactors—withthenumberoffactors varyingfrom1to5acrossapplications—estimatedasinStockandWatson(2005). 2. The first six forecast approaches are defined in the notes to Table 1. The BMA forecast is a Bayesian average of theforecastsfromtherestrictedandunrestrictedmodels,implementedwiththeaveragingapproachrecommendedby Fernandez, Ley, and Steel (2001), with the difference that these results are based on a g–prior coefficient setting of 1/5. The Ridge regression forecast is obtained from a generalized ridge estimator which shrinks the β22 coefficients (but not the other coefficients) of the unrestricted model toward 0 based on conventional Minnesota prior settings describedinsection4.2. 34

Cite this document
APA
Todd E. Clark and Michael W. McCracken (2007). Combining Forecasts From Nested Models (FEDS 2007-43). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2007-43
BibTeX
@techreport{wtfs_feds_2007_43,
  author = {Todd E. Clark and Michael W. McCracken},
  title = {Combining Forecasts From Nested Models},
  type = {Finance and Economics Discussion Series},
  number = {2007-43},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2007},
  url = {https://whenthefedspeaks.com/doc/feds_2007-43},
  abstract = {Motivated by the common finding that linear autoregressive models forecast better than models that incorporate additional information, this paper presents analytical, Monte Carlo, and empirical evidence on the effectiveness of combining forecasts from nested models. In our analytics, the unrestricted model is true, but as the sample size grows, the data generating process converges to the restricted model. This approach captures the practical reality that the predictive content of variables of interest is often low. We derive MSE-minimizing weights for combining the restricted and unrestricted forecasts. Monte Carlo and empirical analyses verify the practical effectiveness of our combination approach.},
}