Forecasting with Sufficient Dimension Reductions
Abstract
Factor models have been successfully employed in summarizing large datasets with few underlying latent factors and in building time series forecasting models for economic variables. When the objective is to forecast a target variable y with a large set of predictors x, the construction of the summary of the xs should be driven by how informative on y it is. Most existing methods first reduce the predictors and then forecast y in independent phases of the modeling process. In this paper we present an alternative and potentially more attractive alternative: summarizing x as it relates to y, so that all the information in the conditional distribution of y|x is preserved. These y-targeted reductions of the predictors are obtained using Sufficient Dimension Reduction techniques. We show in simulations and real data analysis that forecasting models based on sufficient reductions have the potential of significantly improved performance.
Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C. Forecasting with Sufficient Dimension Reductions Alessandro Barbarino and Efstathia Bura 2015-074 Please cite this paper as: Barbarino, Alessandro, and Efstathia Bura (2015). “Forecasting with Sufficient Dimension Reductions,” Finance and Economics Discussion Series 2015-074. Washington: Board of Governors of the Federal Reserve System, http://dx.doi.org/10.17016/FEDS.2015.074. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.
Forecasting with Su¢ cient Dimension Reductions Alessandro Barbarino and Efstathia Bura (cid:3) y September 14, 2015 Abstract Factor models have been successfully employed in summarizing large datasets with few underlying latent factors and in building time series forecasting models for economic variables. When the objective is to forecast a target variable y with a large set of predictors x, the constructionofthesummaryofthexsshouldbedrivenbyhowinformativeony itis. Mostexisting methods (cid:133)rst reduce the predictors and then forecast y in independent phases of the modeling process. In this paper we present an alternative and potentially more attractive alternative: summarizing x as it relates to y, so that all the information in the conditional distribution of y x is preserved. These y-targeted reductions of the predictors are obtained using Su¢ cient Dij mension Reduction techniques. We show in simulations and real data analysis that forecasting models based on su¢ cient reductions have the potential of signi(cid:133)cantly improved performance. JEL Classification Number: C32, C53, C55, E17 Keywords: Forecasting, Factor Models, Principal Components, Partial Least Squares, Dimension Reduction, Di⁄usion Index. 1 Introduction Methods able to identify and estimate a condensed latent structure that summarizes a large set of variables with few (cid:147)factors(cid:148)attracted attention early on in Statistics (Hotelling, 1933 [40]). The Economics and Econometrics literature caught on with investigations ranging from the estimation of the underlying factors that drive the economy, as in Geweke (1977) [34] and Sargent and Sims (1977) [51], to the description of asset prices as in Chamberlain and Rothschild (1983) [16], to applications in labor markets as in Engle and Watson (1981) [31]. In this body of work, the common thread is the usage of factor models and the focus is on identifying the common latent sources of correlation of a set of given variables without explicit reference to a target variable. InpartduetotheavailabilityofricherdatasetsandalsototheseminalworkofStockandWatson (2002a) [54], factor models under the form of Dynamic Factor Models (DFM) have resurfaced in the Econometric literature in the past 15 years and are now a standard tool for both measuring comovement and forecasting time series. In contrast to the early applications of factor models and their use to measure comovement, a distinctive feature of applying DFM in forecasting is the inherent targeted nature of the process. Departing from being simply a tool to identify and estimate a latent structure in multivariate data, DFMs aim to 1) reduce the dimension of a large (cid:3)Alessandro Barbarino, Research and Statistics, Federal Reserve Board. Address: 20th St. & C St. NW Washington, DC 20551 USA. Email: alessandro.barbarino@frb.gov yEfstathia Bura, Department of Statistics, George Washington University. Address: 801 22nd St. NW Washington, DC 20052 USA. Email: ebura@gwu.edu 1
panel of data to a su¢ ciently (cid:147)lean(cid:148)factor structure and 2) exploit such structure to forecast a target variable y. Targeting comes into the picture only after a condensed latent structure is estimated and is resolved by postulating a linear relationship between the target variable y and the factors. The reduction step has so far been largely disconnected from the targeting step, likely a legacy of the origin of factor models. Su¢ cient Dimension Reduction (SDR), a parallel, yet so far unrelated, collection of novel tools for reducing the dimension of multivariate data in regression problems without losing inferential information on the distribution of a target variable y, has emerged over the last twenty (cid:133)ve years in Statistics. SDR focuses on (cid:133)nding su¢ cient (in a statistical sense) reductions of a potentially large set of explanatory variables with the aim of modeling a given target response y. In SDR, the reduction and the targeting are obtained simultaneously by exploiting the concept of statistical su¢ ciency. SDR aims to identify and estimate a su¢ cient function of the regressors, R(x), that preserves the information in the conditional distribution of y x. SDR also o⁄ers a powerful toolbox j to analyze the link between the target y and the panel of regressors x in contrast to the potentially unjusti(cid:133)ed assumption that y depends linearly on some factors as in the DFM literature. Since SDR methods preserve the information of the conditional distribution of y x it should j prove superior to current practice in producing density forecasts although we do not explore this aspect in this paper. SDR is not the only approach that allows for simultaneous reduction and targeting, and the potential gains that can be obtained from linking the two modeling steps have been already acknowledged in the Econometrics literature. Bai and Ng (2008) [4] and De Mol, Giannone and Reichlin (2008) [25] explore the e⁄ectiveness of RIDGE regression, a penalty based regression, and other shrinkage estimators including the LASSO and execute their papers on the canvas laid out by Stock and Watson (2002b and 2002b) [54] [55]. In more recent contributions, Kelly and Pruit (2014)[43] and Groen and Kapetanios (2014) [35] explore variants of partial least squares (PLS) in order to obtain simultaneous reduction and targeting. In contrast to the innovative (in the Econometrics literature) statistical learning tools explored by these authors, SDR methods ensure the preservation of all the statistical information contained in the data as encapsulated in the conditional distribution of y x. SDR methods also allow to selectively preserve targeted statistical j information regarding the conditional distribution of y x, such as the conditional mean, the conj ditional variance, or both, and clearly specify the assumptions required for the proper extraction of such information. In contrast, RIDGE regression is constrained by the forecasting functional form of the model, whereas PLS builds predictors that are little understood and their reliability ultimately depends on the particular application at hand using, quite arbitrarily, the covariance of the response and the original predictors. Although the SDR methodology o⁄ers a potentially powerful source of new methods for the econometrician, several hurdles need be overcome. SDR has been developed and tested with statistical modeling in mind and its e⁄ectiveness has not been tested and proven in macro-forecasting. Most importantly, SDR has been developed for cross-sectional applications and adaptations are necessary for a successful application in forecasting and econometrics, analogously to the contributions of Stock and Watson (2002a and 2012b)[54] [55], Bai and Ng (2003) [2] and Forni, Hallin, Lippi and Reichlin (2005) [32] that enabled the application of principal components to a time series setting. Thispapertakesa(cid:133)rststabatintroducingSDRtechniquestomacro-forecastingbyestablishing a connection between the SDR and DFM bodies of literature, by extending some key SDR results to a time series setting and by o⁄ering a (cid:133)rst assessment of the e⁄ectiveness of SDR methods in a real-world forecasting application. In order to draw analogies and highlight di⁄erences with results 2
intheDFMliteratureweconductourreal-worldforecastingexperimentwithalargepanelofmacro variables as in Stock and Watson (2002a) [54] and Stock and Watson (2002b) [55] although our data source is the novel repository FRED-MD maintained by the St. Louis Fed and documented by McCracken and Ng (2015) [49]. The task of conducting extensive Monte-Carlo simulations drawing fromStockandWatson(2002a)[54]andDozetal. (2012)[28]inordertocomparetheperformance of competing methods is deferred to a companion paper (see Barbarino and Bura (2015) [7]). InSection2welistthechallengesofmacro-forecastinginadatarichenvironmentanddescribing the speci(cid:133)c solution adopted in the DFM framework. We next propose an alternative forecasting frameworkbasedonSDRmethodsandcontrastitwiththeDFMframeworkprovidingaconnection between the two. In Section 3 we review shrinkage estimators that have been proposed within the DFM literature and that are tested in the empirical application. Section 4 contains the conceptual motivation for targeted reductions. Section 5 is a detailed exposition of the principles of SDR and our proposal for an SDR-based forecasting framework, including extensions to a time-series setting of sliced inverse regression (SIR), the SDR method we choose to present and apply in the empirical Section. Finally, as in Stock and Watson (2002a and 2002b) [54] [55] Section 6 contains the description and results of a horserace between the estimators reviewed in the paper on a large panel of macro variables in which the focus is on forecasting accuracy in predicting in(cid:135)ation and industrial production in an out-of-sample forecasting experiment. We (cid:133)nd that SIR has similar or superior performance to PCR and always superior to PLS. The last Section concludes. Coverage of likelihood-based SDR methods is outside the scope of this exploratory paper and deferred to future work. 2 Forecasting with a Large Set of Explanatory Variables A large set of p explanatory variables x is available to forecast a single variable y using a t t (p 1) (1 1) (cid:2) (cid:2) sample of size T. The most immediate approach to the problem, as described, for instance, in the survey of Stock and Watson (2006) [57], is to consider all regressors to be potentially useful in modeling y and use OLS to estimate the model t y = (cid:12) x +(cid:13) w +" (2.1) t+h 0 t 0 t t+h where w may contain additional regressors such as lags of y 1. Notice that although in this sett t up all regressors are potentially useful in forecasting, they enter the model through just one linear combinationnamely(cid:12) x +(cid:13) w . Morethanonelinearcombinationsorprojectionsoftheregressors 0 t 0 t may instead be necessary to model y in order to preserve all the information that the covariates t x and w carry about y . We will revisit this point later on in the paper. t t t+h Estimation in (2.1) via OLS can be problematic when p is large relative to T, or variables in x t are nearly collinear, as is the case in the macro forecasting literature (see, e.g., Stock and Watson (2006) [57]). The variance of the prediction is of order p=T; so when p is large other estimation methods may dominate OLS, even under assumptions that guarantee that the OLS estimator is unbiased. Moreover, when p > 3, the OLS estimator is not even admissible under the mean square error (MSE) criterion,2 a striking result by James and Stein (1961) that inaugurated research in shrinkage estimation. 1TheassumedlinearityofE(y t+h )inthelinearcombinations(cid:12)0 x t and(cid:13)0 w t andthepresumedlackofdependence ofthelatterfromtheerror" areimportantassumptionsthatleadtoordinaryleastsquares(OLS)astheestimator t+h of choice for the parameters (cid:12) and (cid:13). The presence of the lags of y is meant to capture the dynamics in y and to t t avoid that such dynamics might impart non-orthogonality between the error and the regressors. 2AlthoughamongunbiasedestimatorstheOLSestimator(cid:18) hasminimummeansquarederror,otherestimators OLS b 3
Whenthenumberofpredictorsexceedsthenumberofobservations(p > T),theOLSparameter estimates are not identi(cid:133)able and multiple estimators of the parameter vector are solutions to the OLS problem. In this case, di⁄erent shrinkage estimators, such as RIDGE and LASSO, can be viewed as speci(cid:133)c criteria-based choices in the space of OLS-consistent solutions. Furthermore, for typical datasets encountered in macro and (cid:133)nance forecasting, even if p << T, collinearity or ill-conditioning of the X matrix, which collects all the observations on the vector t (T p) (cid:2) x , makes OLS predictions very unstable. This is likely to occur if the set of explanatory variables t contains an index and several sub-indexes, e.g. industrial production (IP) along with its subindexes such as manufacturing IP or mining IP, or when variables are linked by identities or tight relationships, such as the inclusion of assets linked by arbitrage conditions. 2.1 The DFM Forecasting Framework Pioneering results in Stock and Watson in a series of papers (1999, 2002a, 2002b) [53][54][55], relaunched the idea of working around the impossibility and/or lack of desirability of using OLS for estimation when p is large by: (i) Positing that the set of explanatory variables x is, up to idiosyncratic noise, driven by a t small r < p set of latent factors f with t (r 1) (cid:2) x = (cid:3)f +u (2.2) t t t where f and u are independent although u can be serially and cross-sectionally correlated. t t t This setup generalizes the case of a classic factor structure. (ii) Assuming that the forecasting model is y = (cid:21)f +(cid:13) w +" (2.3) t+h 0 t 0 t t+h The primary goal of considering the factors f is to introduce structure that reduces the hight dimensionality of the problem. The linear factor structure implies that although p is potentially large, the information content of the regressors is drastically reduced to r. However the factors are latent variables hence they need to be estimated. The complexity induced by the high dimensionality of the problem is traded o⁄ with the need of estimating additional (cid:133)ctional unobserved variables f . Stock and Watson [54] provide conditions under which the factors are estimated via t Principal Component Analysis (PCA) of the variance-covariance matrix of the observed predictors. In fact, most estimators of the factors proposed in the literature turn out to be linear functions of the explanatory variables, vx , where the matrix v of coe¢ cients or weights corresponds to the 0 t particular method, e.g. PCR, RIDGE or PLS. The rank of the column space of v is the dimension of the problem detected by that particular estimation method. In this sense there is no unique dimension of the information contained in the explanatory variables. Rather the dimension of the problem is estimator-dependent. When the factors are estimated via PCA, as in Stock and Watson (2002a) [54], v is a matrix whose columns are the leading principal components of (cid:6) = (X X(cid:22) )(X X(cid:22) )=T. If the T T 0 T T (cid:0) (cid:0) (cid:18), some also linear, have uniformly smaller mean square error, by trading o⁄bias for variance : b MSE (cid:18);(cid:18) MSE (cid:18) OLS ;(cid:18) (cid:20) for all (cid:18) with strict inequality for some (cid:18). Hence(cid:16)the O(cid:17)LS estima(cid:16)tor is no(cid:17)t admissible under the MSE criterion. b b 4
dynamic principal components of Forni, Lippi Hallin and Reichlin (2005) [32] are used, v also captures a summary of the dynamics in x as it contains the eigenvectors of the autocovariances t (cid:6)(k) = (X X(cid:22) )(X X(cid:22) )=(T k). In (Quasi-) Maximum Likelihoods methods as in Doz T k T k 0 T T (cid:0) (cid:0) (cid:0) (cid:0) (cid:0) etal. (2012)[28]orBanburaandModugno(2014)[6], thealgorithmusedtocomputethelikelihood heavily exploits the linearity of the underlying system and v is derived from the Kalman Filter. A main theoretical objective in the DFM literature has been to show that asymptotically, for both T and p ; the chosen v, is such that vx consistently estimates the factors. 0 t ! 1 The forecasting equation (2.3) is secondary since attention is shifted to reducing the dimension of the set of explanatory variables, a process assumed to unveil the data generating process driving x . The practical goal of forecasting the target y via link equation (2.3) relies on the assumption t t that the same factors that determine the marginal distribution of x also determine the conditional t distribution of y and in the same functional form; that is, linearly. t The assumption of a factor structure and the ancillary role of the forecasting equation in Stock and Watson (2002a) [54] was adopted in the ensuing literature. For instance, Doz et al. (2012) [28] focus on the performance of di⁄erent estimators in identifying the factors and show no interest in the forecasting accuracy of their estimators. The assumption that the underlying DGP has a linear factor structure, while convenient, imposes restrictions on the conditional distribution of y given x, which are di¢ cult to pin down. In a follow-up to this paper (Barbarino and Bura (2015) [7]), we show by means of Monte-Carlo simulations and formally by exploiting recent results in Leeb (2013) [44] and, more importantly, the extension in Steinberger and Leeb (2015) [52] that a linear factor structure underlying the generation of both y and x coupled with the assumption of joint normality of the factors implies that a linear model is the correct speci(cid:133)cation for the conditional mean of y given x, which is a rather restrictive model. The formal result is stated in the following proposition. Proposition 1 Let y denote a response random variable and x a random p-vector of explanatory variables. Suppose that the response can be written as y = (cid:11) 0 f where (cid:11) R r is unknown and that 2 the set of explanatory variables can be written as x = (cid:12)f where (cid:12) is such that: (p r) (cid:2) (i) for each (cid:11), the conditional mean of (cid:11) 0 f given (cid:12) 0 f = x is linear in x R p; 2 (ii) for each (cid:11) the conditional variance of (cid:11) 0 f given (cid:12) 0 f = x is constant in x R p. 2 Then, y can be decomposed into the sum of a linear function of x and a remainder or error term, as follows, y = cx+u (2.4) 0 where c = (cid:12)(cid:11) R p is an unknown parameter, E(u x) = 0 and var(u x) is constant. 2 j j Proof. See Barbarino and Bura (2015) [7]. A key point to note is that conditions (i) and (ii) in Proposition 1 are satis(cid:133)ed for any (cid:12), if f is normally distributed, the assumption made in practice in the DFM literature especially likelihood-based but also non-parametric. Moreover, the implied model (2.4) is a standard linear model with the error term uncorrelated with the predictors so that the solution to the normal equations in the population is the OLS, c = (xx) 1xy. For a random sample, regressing OLS 0 (cid:0) 0 y on X using OLS yields c^ = (XX) 1XY; which is optimal, in terms of statistical OLS 0 (cid:0) 0 (T 1) (T p) (cid:2) (cid:2) 5
e¢ ciency, provided the sample size is larger than the number of predictors, i.e. p < T. Therefore the DFM model assumptions (2.2) and (2.3) imply that the forecasting model is approximately linear in the explanatory variables with errors uncorrelated with the predictors. As a consequence it should not come as a surprise that shrinkage methods such as RIDGE perform well in our simulations. Although ([52]) is valid when (y ;x ) is a random sample for t = 1;:::;T, on the t t basis of our simulations, we conjecture that the results in Proposition 1 are approximately true in a factor model, under the set of assumptions commonly made in the DFM literature where both the response and the predictors are potentially autocorrelated time series. 2.2 The SDR Forecasting Framework IncontrasttoDFM,su¢ cientdimensionreduction(SDR)methodsdepartfromthepervasivelinear factor assumption. As we show in greater detail in a companion paper (Barbarino and Bura (2015) [7]) and summarize in the following paragraph, SDR methods thrive when the relationship between the response and the predictors contains non-linearities whereas they do not have comparative advantage when a linear factor structure is the true DGP. SDR works directly with observables and sidestep the assumption of a factor structure. Thus, instead of imposing an arti(cid:133)cial latent factor structure on the panel x , SDR methods directly t seek to identify how many and which functions of the explanatory variables are needed to fully describe the conditional distribution function F (y x ), or its features, such as the conditional t+h t j mean. SDR aims to identify and estimate functions of the predictors, R(x ); that are called t reductions because they preserve all the information that x carries about y in the sense that t t+h F (y x ) = F (y R(x )). Obviously only if such functions are fewer than p do they represent t+h t t+h t j j proper reductions. The reductions can be either linear or nonlinear functions. In this paper we focus on momentbased and linear SDR methods that obtain linear reductions in order to draw a more pertinent comparison with the DFM literature. Linear SDR methods lay down conditions under which it is possible to identify the number of and the linear combinations of the explanatory variables needed to (cid:147)adequately(cid:148)model y . They also provide estimation algorithms. More formally, linear SDR t methods provide the means to estimate a matrix v : p d, 0 d p, such that R(x ) = vx . t 0 t (cid:2) (cid:20) (cid:20) Our proposed SDR-based forecasting framework is based on the following two conditions: (i) The linearity condition E x vx = Avx (2.5) t 0 t 0 t j for any invertible matrix A and a p (cid:2)d full ra(cid:3)nk matrix v with 0 d p3 (cid:2) (cid:20) (cid:20) (ii) The forward model y = g(vx ;" ) (2.6) t+h 0 t t+h Moment-based SDR methods place conditions on the marginal distribution of x , such as the t linearity assumption (2.5), which is a (cid:133)rst moment condition and analogous to the linear factor structure (2.2) of DFM.4 However, in contrast with the DFM setup, no dependence on underlying factors is postulated. 3The rank of v is the structural dimension of the regression and the case d=0 signi(cid:133)es that y is independent t+h of x . t 4Within the SDR literature the term (cid:147)moment-based(cid:148)catalogues estimators conceptually distinct from SDR (cid:147)likelihood-based(cid:148)methods that require assumptions on the distribution of x y . Likelihood-based SDR methods t t+h j canbecomparedtolikelihoodbasedestimationmethodsforDFMhoweverwedonotpursuesuchcomparisoninthis paper for clarity purposes. 6
The second equation (2.6) speci(cid:133)es the forward model and is analogous to the link equation in DFM, although the SDR framework allows more (cid:135)exibility admitting a general g( ) instead (cid:1) of a linear function. Linear SDR methods are powerful tools that can determine the number of linear combinations of the explanatory variables x needed to model the response y and provide t t consistent estimators without the need to specify the functional form of the forward model; that is, without specifying the exact relationship between y and vx . Linear SDR replaces a large t 0 t number of explanatory variables by a few of their linear combinations without loss of inferential information; their number d(= rank(v)) indicates the dimension of the regression problem. In our experience, fewer than the number of PCs are needed in order to generate a comparable MSFE, which is expected as the SDR estimator is targeted to y. As a result, the forecaster can concentrate on the estimation of g( ) with the option of also using non-parametric regression since the number (cid:1) of predictors is signi(cid:133)cantly reduced. How restrictive is the linearity condition? (cid:150)The linearity condition is satis(cid:133)ed for any v by any elliptically contoured vector of predictors. Importantly, Hall and Li (1993) [36] showed that, as the cross-section gets large and p , such condition is ever more likely to be satis(cid:133)ed provided ! 1 that the dimension of the problem d remains small relative to p. More recently, Steinberger and Leeb (2015) [52] showed that as d=logp becomes smaller, the discrepancy between E[x vx ] and t 0 t j Avx in (2.5) goes to zero and the linearity condition holds approximately. 0 t Comparison of SDR and DFM Assumptions (cid:150)As outlined in Proposition 1, the assumption of a linear factorial structure along with normality imply that a linear model is the correctly speci(cid:133)ed model. By contrast moment-based SDR requires the linearity condition (2.5) hold for the linear projections vx that satisfy the general forward regression model F(y x) = F(y vx). Therefore, 0 0 j j DFM assumes more restrictive conditions than SDR on the marginal distribution of x. Dimension of the Regression (cid:150)The SDR framework allows for more general models to describe the relationship between y and x. As a consequence, for example, the (cid:133)nding that on the same dataset four PCA-estimated factors produce the same MSFE as two SDR predictors is not contradictory. The PC-based DFM framework ignores non-linearities in the DGP so that a larger number of factors is required to approximate non-linearities in an incorrectly speci(cid:133)ed forward regression. This is analogous to the e⁄ect of dynamic misspeci(cid:133)cation as shown in Bai and Ng (2007) [3] where one needs a larger set of static factors in order to approximate a given set of (true) dynamic factors. In their setting a larger set of static factors approximate a polynomial in the lag operator. In our setting, a larger number of static factors is needed to approximate non-linearities which are captured by SDR methods. Remark 1 When SDR (cid:133)nds that two or more linear combinations are needed (d 2) in the (cid:21) forward model (2.6) it means that the forward regression function contains non-linear functions of the linear combinations vx . Thus, a correct speci(cid:133)cation of the forward regression entails 0 t non-linearities that the DFM framework misses. Robustness to Model Misspeci(cid:133)cation (cid:150)SDR techniques frequently yield very few derived predictors, which allows non-parametric estimation of the relationship between the response y and t the (linear combinations of) the explanatory variables x . As a result, the prohibitive curse of t dimensionality problem in non-parametric regression is circumvented . Issues in Large p Problems (cid:150)SDRtechniquesaredataintensiverelativetoshrinkageestimators such as Principal Components (PC), RIDGE or Partial Least Squares (PLS). As we will see, when the number of predictors p is larger than the sample size, direct application of SIR is not possible 7
since rank((cid:6)) min(T 1;p) and (cid:6) is not invertible. Moreover, when T and p are of the (cid:20) (cid:0) same order, or when the columns of X are highly correlated, (cid:6) is ill-conditioned and its inverse is numerically unstable resulting in non-robust estimation. In this sense SDR methods su⁄er from the same limitations as OLS. Therefore we also evaluate a regularized version of our estimators, following Chiaromonte and Martinelli (2002) [22] and Li and Li (2004) [46] who used principal components as an intermediate step in order to eliminate PC directions in which the random vector x is degenerate. Bernard-Michel et al. (2011) show that preprocessing the data with PC in order to eliminate degenerate projections of x and then applying SIR is a special case of their Gaussian Regularized SIR, where a Gaussian prior is introduced on the unknown parameters of the inverse regression. In a companion paper (Barbarino and Bura (2015) [7]), we show that even though regularization can be a useful approach in large p settings, a more appealing work-around to the ill-conditioning or non-invertibility of (cid:6) is the extension of SDR methods using Krylov subspaces. Di¢ culties in Estimating the Forward Model (cid:150)SDR approaches obtain optimal results in an interactive modeling setting. That is, the sequence of modeling steps in SDR is to (1) reduce and estimate the d(< p) SDR-derived predictors, and (2) obtain visual assessments via scatterplots of the response versus the SDR-predictors to form a forward regression model g( ) that best describes (cid:1) the data at hand. This process cannot be carried out in an automatic fashion so that g( ) be (cid:1) recursively estimated prior to the computation of the out-of-sample forecast. Instead, we simply use the linear SDR predictors, which are linear combinations of x ; with no further transformations t as input to a linear forward model for the response. As a result, when the estimated dimension of the problem turns out to be greater than one, the forecasting horserace will penalize the SDR estimator by means of misspeci(cid:133)cation of g( ). Such di¢ culty is not faced by the applied forecaster (cid:1) as they only need SDR predictors for one-shot forecast (even when repeated over time) in which case they can easily evaluate the presence of non-linearities and decide on the appropriate modeling of g( ). (cid:1) 3 Dimension Reduction via Linear Combinations Several estimators in the literature form linear combinations of the explanatory variables vx as a 0 t data reduction step before (cid:133)tting the model used for prediction. In this Section, we provide a brief review of some such widely used estimators that are also used in the empirical Section. We start with OLS, move on to principal component regression (PCR), which is the prevalent method in dynamic factor analysis, and RIDGE regression. The last leg of our tour provides the fundamentals of the partial least squares (PLS) algorithm. We cast these methods in a shared framework of minimization of an objective function, which is what distinguishes individual methods, and discuss how the resulting estimators exploit the eigen-structure of the data matrix X. A more in-depth treatment under a common unifying framework is discussed in Barbarino and Bura (2015) [7]. To motivate the discussion about the features and relative drawbacks and advantages of di⁄erent methods, we start o⁄from a simple data generating model, the multivariate normal distribution. The Normal Data Generating Process (cid:150)The simplest DGP that implies 1-dimensional linear reduction is a Normal DGP where the predictors and the response are jointly normal: x (cid:22) (cid:6) (cid:27) N x ; xy (cid:18) y (cid:19) (cid:24) (cid:18)(cid:18) (cid:22) y (cid:19) (cid:18) (cid:27) 0xy (cid:27)2 y (cid:19)(cid:19) Under this assumption the best predictor under quadratic loss is the linear regression function, E(y x) = (cid:22) +(cid:12) (x (cid:22) ) y 0 x j (cid:0) 8
with (cid:12) = (cid:6) 1(cid:27) . Therefore, in a Normal DGP the relationship between x and y is entirely (cid:0) xy and exhaustively encapsulated in one linear combination of the predictors. We note that even a small departure from this model may result in linear reductions of the predictors no longer being exhaustive (see Bura and Forzani (2015) [14]). Importantly, as shown in Proposition 1, the same DGP is implied by a normal and linear factor structure. We consider various competing ways of estimating the population parameter (cid:12) next. Ordinary Least Squares (cid:150)The OLS coe¢ cient is the solution to the following maximization problem: maxCorr2 y;x(cid:12) (3.1) 0 (cid:12) f g (cid:0) (cid:1) OLSselectsoneandonlyone linearcombinationwiththepropertythatitmaximizesthecorrelation between the response and x(cid:12). Assuming that (cid:6) is full rank, the unique solution to (3.1) is 0 (cid:12) = (cid:6) 1(cid:27) (3.2) OLS (cid:0) xy and the OLS prediction of the response y at an observed x is t0 t0 y = x (cid:12) OLS 0t0 OLS Principal Component Regression (PCR) (cid:150)PCR operates in two steps. First, the linear combinations that maximize the variance of x and that are mutually orthogonal are the solution to the following maximization problem max Var(x(cid:12) ) (3.3) 0 k ck f g (cid:12)0k (cid:12) k =1 k 1 f (cid:12)0k (cid:6)(cid:12) i =0 gi= (cid:0) 1 Amaximumofpsuchlinearcombinations,calledprincipalcomponents,canbeextracted. Secondly, y is regressed on the (cid:133)rst M p PCs. The number of PCs, M, is a meta parameter chosen by the (cid:20) user. The solution to (3.3) is (cid:12) (M) = (cid:6) (M)(cid:27) (3.4) PCR (cid:0)PCR xy If M = p, then (cid:12) (M) = (cid:12) . The pseudo-inverse (cid:6) (M) used to compute the solution, PCR OLS (cid:0)PCR which can be shown to be a Moore-Penrose inverse, will depend on M. Notice that (cid:6) (M) is (cid:0)PCR an estimate of a truncated (cid:6) 1 by retaining only the (cid:133)rst M components that explain a speci(cid:133)ed (cid:0) amount of variance in the predictors. Therefore, PCA reduces the dimension of the input predictor vectorxby(cid:133)ndingorthogonallinearcombinationsthatmaximizevar(ax). Suchoperationentirely 0 disregardstheresponsey,sothatalthoughthexprincipalcomponentscontainasmuchinformation as the entire predictor vector, they are ordered in relevance to x and not to y. Targeting enters only in the second stage through the term (cid:27) in (3.4). The PCR prediction is xy y = x (cid:12) PCR 0t0 PCR RIDGE Regression (cid:150)RIDGE regression has been reviewed in the macro-forecasting literature by De Mol et al. (2008) [25] in connection to Bayesian regression. The RIDGE estimator picks one and only one linear combination of the data as it can be shown that RIDGE regression minimizes the least squares criterion on the sphere with radius a: 9
p maxCorr2 y;x(cid:12) subject to (cid:12)2 a (3.5) 0 j (cid:12) (cid:20) f g j=1 (cid:0) (cid:1) X The solution to (3.6) is (cid:12) ((cid:21)) = ((cid:6)+(cid:21)I) 1(cid:27) (3.6) RR (cid:0) xy where (cid:21) denotes the Lagrange multiplier in the constrained maximization (3.5) and is a function of a. It is also a meta parameter playing a role similar to M in PCR. When (cid:21) = 0 the maximization is thesameasinOLS.As(cid:21) > 0;thesolutiondeviatesfromtheOLSsolutionanditcanbeshownthat thepenalizationworkstoshrinkmorethosedirectionswithsmallestvariancealthoughRIDGEdoes not truncate directions in such draconian way as PCR. The relationship with the OLS coe¢ cient (3.2) is (cid:12) RR ((cid:21)) = I+(cid:21)(cid:6) (cid:0) 1 (cid:0) 1 (cid:6) (cid:0) 1(cid:27) xy = I+(cid:21)(cid:6) (cid:0) 1 (cid:0) 1 (cid:12) OLS The RIDGE prediction at x t0 is(cid:0) (cid:1) (cid:0) (cid:1) y = x (cid:12) ((cid:21)) RR 0t0 RR Partial Least Squares (cid:150)PLS, an increasingly popular method of dimension reduction, followed a peculiar trajectory in Econometrics. Originally developed by H. Wold [63] in the mid 70s, it did not gain much traction in Econometrics and swiftly fell into oblivion. By contrast, it garnered a lot of attention in Chemometrics, a (cid:133)eld that produced a large volume of PLS studies in the late 80s and early 90s (see, for example, the instructive work of Helland (1988) [38]). Only recently has the method resurfaced in Econometrics with the work of Kelly and Pruitt (2014) [43] and Groen and Kapetanios [35] within macro-forecasting applications as PLS handles well cases in which p > T. The PLS algorithm induces a simultaneous bi-linear decomposition of both the target variable and the panel of regressors5. That is, factors f and loadings q and p are generated i i i (T 1) (p 1) (1 1) (cid:2) (cid:2) (cid:2) at each step so that the factors are orthogonal and the following decompositions are carried out concurrently x = q f +q f +:::+q f +E 10 1 20 2 u0 u u y = p f +p f +:::+p f +e 01 1 02 2 0u u u Thesu¢ xudenotesthelaststepoftheprocedure. Thealgorithmalwaysconvergesinthesense that after p steps the factors will be identically zero. Using the recursive formulas for the factors and the loadings, one can show that PLS prediction admits a linear form similar to the predictions of the other estimators at x : t0 y = x (cid:12) (u) PLS 0t0 PLS where 1 (cid:12) PLS (u) = W u W u0 (cid:6)W u (cid:0) W u0 (cid:27) xy (3.7) The matrix W u = (w 1 ;:::;w u ) is obtained a(cid:0)fter u rec(cid:1)ursions of the algorithm by stacking the weights generated at each step. Such weights are initialized with w = (cid:27) , and, for u > 1, 1 xy 1 w u = (cid:27) xy (cid:6)W u 1 W u0 1 (cid:6)W u 1 (cid:0) W u0 (cid:27) xy (cid:0) (cid:0) (cid:0) (cid:0) generates the subsequent weights. Notice that(cid:0)the weights ar(cid:1)e (cid:147)weighted(cid:148)covariances of the predictors and the response. 5In the appendix we show how the decomposition naturally leads to the factorial structure. 10
4 Reductions as Eigen-Decompositions In order to set the stage for the sequel, we now focus on the di⁄erent targets of the eigendecompositions the methods in Section 3 entail. Eigen-Decompositions Underpinning PCR and PLS (cid:150)PCR targets the extraction of derived inputs that maximize the variance of the explanatory variables. The PCs are left eigenvectors from the eigen-decomposition of var(x). PLS has (initial) target cov(x;y), which reveals its targeted nature and that, by focusing on the principal directions of cov(x;y), it is hard-wired to extract linear signals.6 Eigen-Decomposition Underpinning SDR (cid:150)SDR methods that will be introduced in the next Section carry out an eigen-decomposition of a target, called the seed or kernel, in order to isolate directions of principal variation in relevance to y. The SDR method used in the empirical applications, sliced inverse regression (SIR), is based on the eigen-decomposition of var[E(x y)]. To gain intuition of why such target works, we resort to j a classical probabilistic identity satis(cid:133)ed by any random vector x with (cid:133)nite second moment and conditioning random variable or vector y, var(x) = var[E(x y)] +E[var(x y)] (4.1) j j identi(cid:133)ed by PCA identi(cid:133)ed by linearSDR noise | {z } | {z } | {z } Suppose the range of y is sliced in non-overlapping bins and x y is the restriction of x in the bins j de(cid:133)ned by the slices of y. The right hand side of (4.1) obtains that the variance of x can be split into two parts: var[E(x y)] or between slice variation in x, and (cid:15) j E[var(x y)] or within slice variation. (cid:15) j In ANOVA, the (cid:133)rst summand is the signal that y carries about x since it represents variation of the average value of x associated with di⁄erent values of y from the overall x mean. The second element represents noise, i.e. deviations of x from its overall average across bins, hence unrelated to y. From this perspective, since PCA performs an eigenanalysis of var(x), noise in E[var(x y)] j may attenuate or suppress the signal in var[E(x y)] and result in PCs that are little related to y. j PLS targets cov(x;y) and can potentially suppress non-linear signal. By contrast, a method that focuses on the eigen-analysis of var[E(x y)] produces derived inputs ordered according to their j importance with respect to y and has the capacity to preserve non-linear signals. As we will see next,centeringonthesignalandignoringthenoiseiswhatsu¢ cientdimensionreductioningeneral and, in particular, sliced inverse regression is designed to do. 4.1 Su¢ ciency and Inverse Reductions De(cid:133)nition 2 A reduction R : R p R q, where q p, is su¢ cient if it satis(cid:133)es y x y R(x) or ! (cid:20) j (cid:24) j equivalently F (y x) = F (y R(x)) (4.2) j j 6Tobeprecise,wheny isascalar,cov(x;y)isavectoranditseigen-decompositionreturnsthevectoritself. HoweveritisusefultothinkofthePLSalgorithmasasequenceofeigen-decompositionswhencomparingthemethodology with PCA. When y is multivariate the PLS algorithm entails eigen-decomposition of the matrix cov(x;y). 11
Aconsequenceofthede(cid:133)nitionofsu¢ ciencyisthat,since(4.2)canbewrittenasF (y x;R(x)) = j F (y R(x)) we have j y x R ?? j Consequently, R(x) is called a forward reduction. Although the term (cid:147)su¢ cient(cid:148)was originally coined to highlight the information preserving role of R(x), it turns out that there is a speci(cid:133)c link with the Fisherian concept of statistical su¢ ciency (see Cook (2007)) [19] as we will see shortly. Before doing so, we introduce the concept of inverse reduction. De(cid:133)nition 3 A function R : R p R q, where q p, is an inverse reduction if ! (cid:20) d x (R(x);y) = x R(x) (4.3) j j Ifonetreatsy asaparameter, (4.3)statesthatR(x)isasu¢ cientstatisticfory anditcontains all information x contains about y. Thus, it is a su¢ cient reduction for the forward regression of y on x. Proposition 2 provides the formal statement and proof of this fact. 0 Proposition 2 Assume that the random vector y;x0 has a joint distribution and let R(x) be a measurable function of the predictor vector. Th(cid:16)en, (cid:17) d F (y x) = F (y R(x)) i⁄ x (R(x);y) = x R(x) j j j j Proof. Denote R(x) with R. Assume F (y x) = F (y R(x)) so that y x R and F (y;x R) = j j ?? j j F (x R)F (y R). Therefore, j j F (y;x R) F (y;x;R) F (y;x;R) F (x R) = j = = = F (x y;R) j F (y R) F (y R)F (R) F (y;R) j j j To prove the reverse statement we start with the de(cid:133)nition of conditional distribution of y (x;R) j F(y;x;R) F(x y;R)F(y;R) F(y x;R) = = j j F(x;R) F(x R)F(R) j d Using the condition x (R(x);y) = x R(x); which is equivalent to F(x y;R) = F(x R) and simj j j j plifying, one obtains F(y x;R) = F(y R). j j Proposition 2 sheds light on why inverse regression is a powerful tool for the identi(cid:133)cation of su¢ cient reductions of the predictors: if a function R(x) is a su¢ cient statistic for the inverse regression, it is also a su¢ cient reduction for the forward regression. This implies that the econometrician is free to choose the most convenient way to determine a su¢ cient reduction, either from the forward or inverse regression. An immediate advantage of inverse regression is that it treats eachpredictorseparatelyinsteadoftreatingthepanelasablock. Thatis,alargep-dimensionalforward regression (potentially non-linear) problem is split in p univariate regression problems, which are easily modeled if y is univariate (or has a small dimension) even if p is large. Furthermore, inverse regression allows a plethora of estimation methods, also non-parametric, where the curse of dimensionality would make modeling of the forward regression practically impossible. Therefore, the method can result in signi(cid:133)cantly more accurate estimation than a linear forward regression model. Most importantly, inverse regression accomplishes another goal in connecting su¢ cient reductions with the classical concept of a su¢ cient statistic: the (cid:147)parameter(cid:148)to be estimated and predicted is the whole time-series y . t 12
4.2 Linear Reductions and Moment-Based SDR Even though su¢ cient reductions need not be linear, moment-based SDR was developed under the requirement that R(x ) be linear. In linear SDR, R(x ) is a projection P x onto a lowert t t dimensional subspace of R p that incurs no loss of information about the condit S ional distribution S F (y x ) or selected features thereof. If the mean squared error loss is used to evaluate the t+h t j accuracy of the forecast, the conditional mean E(y x ) is of interest and the goal is to (cid:133)nd a ret+h t j duction R(x ) such that E(y x ) =E(y R(x )). In density forecasting, the whole conditional t t+h t t+h t j j distribution is the target and a reduction R(x ) such that F (y x ) = F (y R(x )) is sought. t t+h t t+h t j j In the rest of this Section we suppress subscripts keeping in mind that y is used in place of y t+h and x in place of x . t In this Section we focus the discussion on the identi(cid:133)cation (and peripherally to existence and uniqueness) of linear su¢ cient reductions and show how to exploit inverse regression to identify them. We require at the very outset the reduction be linear: Condition 1 SupposethereductionR(x)issu¢ cientandalinearfunctionofx;thatis, itsatis(cid:133)es F (y x) = F (y R(x)) (4.4) j j and R(x) = ax 0 for some p d matrix a: (cid:2) Notice that the de(cid:133)nition of su¢ ciency implies that we can only identify the subspace spanned by a linear reduction, span(a), rather than a per se, since F (y ax) = F (y bx) for all matrices 0 0 j j a and b such that span(a) = span(b). A subspace spanned by the columns of a matrix a with F (y x) = F (y ax) is called a dimension reduction subspace (DRS). 0 j j Existence and Uniqueness (cid:150)A linear reduction, although a trivial one, always exists, since one can always set R(x) = x = I x. For the same reason a DRS is not generally unique. SDR(cid:146)s p objective is to identify a minimal reduction, that is a DRS with minimum dimension as well as conditions that insure existence and uniqueness. Uniqueness and minimality are jointly guaranteed byfocusingontheintersectionofallDRS;suchintersection,ifitisitselfaDRS,iscalledthecentral subspace. The latter exists under reasonably mild conditions on the marginal distribution of x, such as convexity of its support. We refer to Cook (1998)[17] for more details and henceforth restrict attention to those regressions for which a central subspace exists. The identi(cid:133)cation of a minimal su¢ cient reduction or, equivalently, the identi(cid:133)cation of a basis for the central subspace requires moment conditions on the marginal distribution of the predictor vector x. Condition 2 (Linear Design Condition) There exists a full rank p d matrix v such that (cid:2) E x vx = Avx (4.5) 0 0 j for an invertible matrix A. (cid:2) (cid:3) In general, the linearity condition (4.5) on the marginal distribution of the predictors is di¢ cult to verify as it requires knowledge of v. Nevertheless, it is satis(cid:133)ed for all v R p (cid:2) d if the predictors 2 have an elliptically contoured distribution [See Eaton (1986)[30]]. The elliptically contoured family 13
of distributions includes the multivariate normal and Student(cid:146)s t distributions. Moreover, Steinberger and Leeb (2015) [52] showed that under comparatively mild conditions on the distribution of x the condition (4.5) is likely to be satis(cid:133)ed as p grows and the cross-section becomes large. Speci(cid:133)cally, they showed that if a random p-vector x has a Lebesgue density, the mean of certain functions of x is bounded and that certain moments of x are close to what they would be in the Gaussian case [see the bounds (b1) and (b2) in Th. 2.1, Steinberger and Leeb (2015)[52]], then the conditional mean of x given vx is linear in vx for a p d matrix v, as p and d is either 0 0 (cid:2) ! 1 (cid:133)xed or grows very slowly at the rate d=logp 0. An appealing feature of these results is that ! they rely on bounds that can be estimated from data. Steinberger and Leeb(cid:146)s result is of fundamental importance in SDR since it ascertains that the linearity condition (4.5) is satis(cid:133)ed by a large class of predictor distributions. Thus, (cid:133)rst-moment SDR estimators, such as SIR in the ensuing Section 4.3, can be widely used to estimate basis elements of the column space of v in the reduction R(x ) = vx . t 0 t The following lemma links the linearity condition with inverse regression and points to a means to (cid:133)nd the reduction. Lemma 1 Assume R(x) = vx satis(cid:133)es (4.4), that is, it is a su¢ cient reduction, and the linearity 0 condition (4.5) is satis(cid:133)ed for v. Then (cid:6) 1[E(x y) E(x)] span(v) (cid:0) j (cid:0) 2 where (cid:6) = cov(x). Equivalently, span (cid:6) 1[E(x y) E(x)] span(v) (cid:0) j (cid:0) (cid:18) (cid:0) (cid:1) Proof. See Corollary 10.1 in Cook (1998)[17] and Theorem 3.1 in Li (1991)[45]. Lemma 1 obtains that the centered and scaled inverse regression function (cid:147)lives(cid:148)in a subspace, the inverse regression subspace, spanned by the columns of v. That is, as y varies in R, the random vector (cid:6) 1[E(x y) E(x)] is contained in a subspace that is spanned by the columns of (cid:0) j (cid:0) v. Therefore, in order to identify the su¢ cient reduction we need to identify a basis that generates the subspace that contains (cid:6) 1[E(x y) E(x)] as y varies. The following proposition provides (cid:0) j (cid:0) the answer. Proposition 3 The column space of the matrix (cid:6) 1var(E(x y)) spans the same subspace as the (cid:0) j subspace spanned by (cid:6) 1[E(x y) E(x)]. That is, (cid:0) j (cid:0) span (cid:6) 1Var(E(x y) ) = span (cid:6) 1[E(x y) E(x)] span(v) (cid:0) (cid:0) j j (cid:0) (cid:18) (cid:0) (cid:1) (cid:0) (cid:1) Proof. SeeProposition11.1inCook(1998)[17],anextensionofProposition2.7inEaton(1983)[29], and Lemma 1. Lemma 1 and Proposition 3 draw a link between the distribution of the data and the subspace that we wish to identify. Notice that in general the column space of (cid:6) 1var(E(x y)) provides only (cid:0) j partialcoverageofthecentralsubspacesincetheinverseregressionsubspacecanbeapropersubset of the central subspace. Underadditionalconditionsonecanshowthatmoreexhaustivecapturingofthecentralsubspace is possible. Other inverse regression moments, such as E((cid:6) Var(E(x y))2; also live in the central (cid:0) j subspace under additional conditions on the marginal distribution of the predictors (Cook and Weisberg (1991) [23]). In order not to clutter the present exposition, we focus only on the (cid:133)rst inverse regression moment E(X Y) in order to introduce SDR methodology to the econometrics j 14
literature via the simple and widely used Sliced Inverse Ression (SIR, Li (1991)[45]).7 In general, linear moment-based SDR methods provide a way of identifying the number and coe¢ cients (up to rotations, as in the DFM literature) of the linear combinations of the predictors in the forward forecasting equation. A feature of SDR methods, which can be viewed both as an advantage and a downside, is that they are silent regarding the functional form of the forward regression. They obtain the linear combinations of x that are needed in the forecasting equation in order to t adequately reduce x . When the number of SDR directions is 1 or 2, a plot of the response versus t the reduction(s) can visually inform forward regression modeling. Dimension 2 or larger indicates that the forward model involves non-linear functions of the reductions. It is important to note that SDR methods do not remove the need to model the response but rather reduce signi(cid:133)cantly the complexity of modeling and uncover the structural dimension of the forward regression problem, i.e. how many derived linear combinations of the original predictors su¢ ce to completely explain y. 4.3 Sliced Inverse Regression Several estimators have been proposed in order to estimate the central subspace. We focus on the (cid:133)rst and most widely used: Sliced Inverse Regression (SIR) proposed by Li (1991)[45]. SIR is a semiparametric method for (cid:133)nding dimension reduction subspaces in regression. It is based on the results of Section 4.3 and uses a sample counterpart8 to (cid:6) 1Var(E(x y)), the population object (cid:0) j that lives in the subspace generated by the coe¢ cient matrix of the reduction. The name derives from using the inverse regression of x on the sliced response y to estimate the reduction. For a univariatey, themethodisparticularlyeasytoimplement, SIR(cid:146)sstepfunctionsbeingaverysimple nonparametric approximation to E(x y). j Implementation of SIR (cid:150)In order to estimate M = var(E(x y)), the range of the observed rej sponsesY = (y ;:::;y ) isdividedinJ disjointslicesS ;:::;S whoseunionistherangeofY. We 1 T 0 1 J denotetheoverallsamplemeanofthesamplepredictormatrixXbyX(cid:22) = ( T x =T;:::; T x =T), t=1 t1 t=1 tp 0 and for j = 1;:::;J, we let X(cid:22) = X =n , where n is the number of y (cid:146)s in slice S . The coj yt Sj t j j P t j P variance matrix of x is estimated by th 2 e sample covariance matrix (cid:6) = T (X X(cid:22))(X X(cid:22))=T, P t=1 t (cid:0) t (cid:0) 0 and the SIR seed M with J P n b M = j (X(cid:22) X(cid:22))(X(cid:22) X(cid:22)) j j 0 T (cid:0) (cid:0) j=1 X c The spectral value decomposition of M yields its d left eigenvectors u ;:::;u that correspond 1 d to its d largest eigenvalues, (cid:21)^ > (cid:21)^ > ::: > (cid:21)^ . The matrix B = (cid:6)(cid:0) 1 (u ;:::;u ) = (b ;:::;b ) 1 2 d 1 d 1 d estimates v in R(x) = vx of Lemma 1c. The SIR predictors to replace Xb in thebforward regression 0 are the columns of the T d matrix XB = (Xb 1 ;:::;Xb d ). Tbhe nubmberbof SIRbdirections, d, is (cid:2) typically estimated using asymptotic weighted chi-square tests (Bura and Cook (2001a)[12], Bura and Yang (2011)[15]), information criterbia such as AIC and BIC, or permutation tests (Yin and Cook (2001)[24]. We note that these tests are valid under the assumption of iid draws from the joint distribution of (y;x), which is typically not the case for econometric data. How SIR works: SIR (cid:133)nds the directions of maximum variance between slices, with T data points collapsed in J slice means clustered according to y labels (slices). In the extreme case of 7For instance, although SIR may not exhaustively identify the central subspace, it can be shown that SIR is exhaustive when xy is multivariate normal with constant variance-covariance matrix (See Cook (2007)[19]). 8Striclyspeakin j g,SIR,throughslicing,forcesadiscretizationofyratherthantheactualrealizationsofy,however it is possible to show that the space spanned by the slices is a subset of the central subspace. 15
J = T, i.e. when each slice corresponds to a single y observation, M becomes (cid:6), the sample covariance of x, and SIR is identical to PCA. However, for J < T, the variance (noise) of the components within the same slice is suppressed in favor of their signal, which makes SIR much more e¢ cient in identifying x components (projections) targeted to y. 4.3.1 Statistical Properties In Proposition 4;we show that the SIR directions are consistent estimators of directions in the central subspace for all x satisfying the linear design condition (4.5) and conditional distributions t of x y , h = 1;2;::: with (cid:133)nite second moments. t t+h j Proposition 4 Assume that the time series x and x (y = s), s = 1;:::;J, t = 1;:::, h = t t t+h j 0;1;:::, arebothcovariance-statonarywithabsolutelysummableautocovariances, i.e. (cid:27) (l) < 1l= j jj j ; (cid:27) (l) < ;j = 1;:::;p: Then, the SIR directions are consistent estimat(cid:0)o1rs of di- 1 rection 1l= s(cid:0)i1n j the jj j c y e t+ n h tral j sub 1 space for all x satisfying the linear design condition (4.5). P t P Proof. SIRisbasedonthecovariancematrixM = cov(E(x y )), t = 1;:::;T. Ify isdiscrete h t t+h t j and (cid:133)nite, we can assume y 1;2;:::;J without loss of generality. Let p = Prob(y = s) t s t+h 2 f g and m = E(x y = s), s = 1;:::;J. Then, s t t+h j J cov(E(x y )) = p (m (cid:22))(m (cid:22)) t t+h s s s 0 j (cid:0) (cid:0) s=1 X As a result of the second order stationarity with absolutely summable autocovariances of x and t x (y = s), s = 1;:::;J, t = 1;:::, h = 0;1;:::, the sample moments X(cid:22) and m = X(cid:22) = t t+h s s j X =n , where n is the number of y (cid:146)s equal to s, are both consistent as T;n . Also, yt+h=s t s s t s ! 1 p^ s = n s =T p s . Therefore, b P ! J M = p^ (m X(cid:22))(m X(cid:22)) p M h s s s 0 h (cid:0) (cid:0) ! s=1 X c b b as it is a continuous function of consistent estimators. Consequently, the eigenvectors of M , h u , k = 1;:::;p, converge to the corresponding eigenvectors of M . Moreover, since the sample k h 1 covariance matrix (cid:6) is consistent for (cid:6), the SIR predictors (cid:6)(cid:0) u , k = 1;:::;d are consistentcfor k tbhe d columns of v in the su¢ cient reduction R(x ) = vx . Notation and results for stationary t 0 t and ergodic time sebries that we use are provided in AppendixbD.b When y is continuous, it is replaced with a discrete version y~based on partitioning the observed range of Y into J (cid:133)xed, non-overlapping slices. Since y x vx yields that y~ x vx, we have 0 0 ?? j ?? j S S . In particular, provided that J is su¢ ciently large, S S , and there is no loss of in Y~ foj x rm (cid:18) ati Y on j x when y is replaced by y~. y~ j x (cid:25) y j x Under more restrictive assumptions on the processes x and x (y = s), s = 1;:::;J, t = t t t+h j 1;:::, h = 0;1;:::, it can also be shown that their sample means are approximately normally distributed for large T (see Appendix D). Under the same assumptions we can then obtain that M isasymptoticallynormalfollowingsimilarargumentsasBuraandYang(2011)[15]whoobtained h the asymptotic distribution of M when the data are iid draws from the joint distribution of (y;x). c c 16
4.3.2 Inverse Regression as Extraction of Targeted Factors Most SDR methodology is based on inverse regression. In general, inverse regression focuses attention on the set of p inverse regressions x = a+Bf (y)+e (4.6) where y is substituted with f (y) that contains functions of y whose choice re(cid:135)ects di⁄erent inverse regression based SDR methods. Such functions play the role of (cid:147)observed factors(cid:148)and in practice, in addition to contemporaneous and lagged values of y, may contain various functions of y such as polynomials. For example, SIR e⁄ectively approximates f (y) with step functions; parametric inverseregression(PIR)(BuraandCook(2001b)[13])andprincipal(cid:133)ttedcomponents(PFC,Cook and Forzani (2008)[20]) approximate f (y) with continuous functions of the response. These three SDR methods essentially analyze and extract the (cid:133)rst few PCs of the space of the (cid:133)tted values in (4.6). The term f (y) plays the role of a factor structure, but, in contrast with the DFM, it is observable. Intuitively the inverse regression approach replaces x with its projection on f (y) and in so doing it extracts its (cid:147)targeted(cid:148)factor structure. 5 Empirical Application 5.1 Data and Out-of-Sample Forecasting Exercise Next we want to put our estimators to work comparing them in a classical pseudo out-of-sample macro-forecasting horserace against the parsimonious AR(4). We pick in(cid:135)ation (CPIAUCSL) and industrial production as our targets (INDPRO). Then we look for a large set of regressors in order to feed into our models as much information on the macroeconomy as possible. We would like to adopt a (cid:147)standard(cid:148)dataset containing a large number of US macro variables however the various forecasting studies that use large panels of US macro variables, aside from a core set of agreed upon macro variables, is quite inhomogenous regarding the speci(cid:133)c set of non-core variables to be used in forming the initial dataset from which to choose or combine the variables. Given that one of the tasks of the forecasting exercise is choosing the most useful variables to forecast a given target variable, the choice of the initial dataset seems indeed crucial. The next paragraphs highlights the rugged landscape of data sources used in some of the most important studies in the DFM macro-forecasting literature and our (cid:133)nal choice. Settling on a Shared Data Source (cid:150)A problem faced by any researcher in the macro-forecasting (cid:133)eld is the lack of comparability across studies due to the multitude of data sources and data vintages used in the literature, a phenomenon that has been pervasive until recently. Luckily an initiative spearheaded by McCracken and Ng and documented in McCracken and Ng (2015) [49] has set out on the project to impose some discipline in the current and future production of macroforecasting studies. One outcome of the project has been the creation of a dataset of 132 macro variables called FRED-MD and updated in real time by the same sta⁄that maintains the popular FRED database.9 We embraced their initiative and adopted FRED-MD as our dataset of choice although the dataset comes with some limitations that we discuss to some extent below and more in detail in the appendix. Alternative Data Sources in Other DFM Studies (cid:150)FRED-MD has fewer variables than the quarterly dataset of 144 variables used by Stock and Watson in [58], apparently the most exploited 9The data can be dowloaded from Michael McCracken(cid:146)s website at the St. Louis Fed at the address: https://research.stlouisfed.org/econ/mccracken/fred-databases/ 17
dataset in quarterly studies. It has also fewer variables than the dataset of 143 variables used by Stock and Watson in [60]. The latter is a quarterly study but the dataset posted by Mark Watson contains monthly variables. Finally our dataset contains fewer variables than the 149 regressors used by Stock and Watson in [54] or the 215 series used by Stock and Watson in [55].10 In turn, both studies draw upon the work in the seminal NBER working paper [53] by Stock and Watson. The recently published quarterly study [60] by Stock and Watson on shrinkage methods uses 143 series but in the dataset posted online by Mark Watson one can (cid:133)nd only 108 monthly variables and 79 quarterly variables. Although we have chosen the most up-to-date dataset available online at the cost of omitting some variables given the data intensity of our statistical techniques we also tried to run our pseudo-out-of-sample forecasts using some of the richer datasets mentioned above, and we have noticed a slight deterioration of forecasting performance when using our dataset of choice signalling that either some of the variables that we do not include might be marginally helpful in improving the forecast or that data revisions played a role.11 De(cid:133)nitely the inclusion of the great recession and the subsequent recovery, a period not covered by the mentioned studies, appears to impart a substantive deterioration in the forecasting performance of the estimators that wereview. Thefollowingtablesummarizesseveralrelevantstudiesandthesalientcharacteristicsof their dataset and statistical methods used and McCracken and Ng (2015) [49] have an informative chronology of the evolution of the large panel of macro datasets. Table 1: Summary of Datasets Characteristics of Several Relevant Studies in the DFM literature Study # Var. Freq. Online Data Span Meth. Stock and Watson (1998) [53] 224 m no 1959-1997 PCR, VAR Stock and Watson (2002) [54] 149 m no 1959-1999 PCR, VAR Stock and Watson (2002) [55] 215 m no 1959-1999 PCR, VAR Bai and Ng (2008) [4] 132 m no 1959-2003 SPC, EN Stock and Watson (2005) [56] 132 m yes 1959-2003 VAR, SVAR, PCR Boivin and Ng (2006) [10] 147 m no 1960-1998 WPCR Stock and Watson (2008) [58] 144 q yes 1959-2006 Split-Sample PCR PCR, IC, PT Stock and Watson (2012) [60] 108 m, q yes 1960-2008 BMA, EB, B Jurado et al. (2015) [42] 135 m yes 1959-2010 PCR FRED-MD (cid:150)The dataset is called FRED-MD (where MD stands for monthly dataset) and it contains a balanced panel with monthly data from 1960m1 to 2014m12, covering 54 years totaling 648 monthly observations.. We choose to work with monthly data since the companion quarterly dataset FRED-QD is not available yet. Moreover, our SDR forecasting procedure is quite data intensive and monthly data seem to be a better workbench to test our estimators at this juncture. The dataset is described in McCracken and Ng (2015) [49] along with a discussion of some data adjustments needed to construct the panel. We note that a major shortcoming of the dataset is that core CPI and non-farm payroll employment, two of the most watched series by forecasters and FED o¢ cials have not been included so far. FRED-MD contain very limited real-time vintages making real-time forecasting unfeasible at the moment. 10The series in this study came from the DRI/McGraw-Hill Basic Economics database, formerly named Citibase. 11We are currently working on retrieving an updated dataset with real-time vintages with about 150 regressors as used by Stock and Watson in some of their work. 18
Manipulation of FRED-MD (cid:150)Some variables in the dataset have a large number of missing data. Rather than running an EM algorithm to (cid:133)ll in the missing data and achieve a balanced panel as done by McCraken and Ng (2015) [49], who in turn follow Stock and Watson (2002b) [55], we exclude them. The (cid:133)ve excluded variables are: ACOGNO=(cid:147)New Orders for Consumer Goods(cid:148), ANDENOx=(cid:147)New Orders for Nondefense Capital Goods(cid:148), OILPRICE=(cid:147)Crude Oil, spliced WTI and Cushing(cid:148), TWEXMMTH=(cid:147)Trade Weighted U.S. Dollar Index: Major Currencies(cid:148)and UM- CSENT=(cid:147)Consumer Sentiment Index(cid:148). We do not apply any cleaning of outliers. Data Transformations and Forecast Targets (cid:150)We adopt the transformations suggested by Mc- CrackenandNg(2015)[49]andcodedinthesecondrowoftheoriginaldownloadeddataset. Weopt to present our results for a (cid:147)nominal(cid:148)forecast target, CPI in(cid:135)ation with mnemonics CPIAUCSL, and a (cid:147)real(cid:148)target, total industrial production with mnemonics INDPRO. We follow the literature and instead of forecasting the chosen target variables h months ahead we forecast the average realization of the variable in the h months ahead period. Hence the transformation of the target variable dictates the forecast target. For instance in the case of in(cid:135)ation, a variable marked as I(2) and transformed as y = (cid:1)2log(CPI ) t t we generate the target 1200 CIP CPI yh = ln t+h 1200ln t h+t h CIP (cid:0) CPI t t 1 (cid:18) (cid:19) (cid:18) (cid:0) (cid:19) Industrial production is a variable marked as I(1) and tranformed by y = (cid:1)log(IP ) t t and the resulting target will be 1200 IP yh = ln t+h h+t h IP t (cid:18) (cid:19) The Pseudo Out-of-Sample Forecasting Scheme (cid:150)We align the data as shown in the following Figurebyplacingthetargetonthesamelineoftheregressorsinanidealh-stepaheadOLSscheme. The transformed and aligned data are available on request. We conduct our forecasting exercise at horizons h = 1;3;6;12;24. Both these are relevant horizons in practice and they are enough to allowexplorationofpossiblevariationacrosshorizonswithineachforecastingmethod. Forpractical reasons, ascommonintheliterature, weadopth-stepaheadregressionratherthaniteratedinorder to avoid the simulation and feeding of exogenous regressors. As indicated in Figure 1, an advantage of PCR, a non-supervised method, is that the principal components can be re-computed as soon as a new line of obervations becomes available in the recursive scheme. The superscript of the PC component in the table highlights this point: for instance when estimating PCR to forecast 3-steps ahead, a 3-step ahead regression is run in t = 1984m01 regressing y3 on the PCs for 1984m01 t+3 but computed using data through t+1 = 1984m02. Then new data through 1982m02 is used to forecast y3 and prediction is formed t+3+1 t+3+1 y3 = (cid:11) +(cid:13) (L)y +(cid:12) (L)PC (t) t+3+1 3 3 t 3 and compared with the real b ized data (the comparison involvding the two yellow cells in Figure 1). 19
Figure 1: Data Alignement and Out-of-Sample Forecasting Scheme Row Date t yh=3(t+h) yh=1(t+h) y(t) y(t 1) X(t) PC(t) PLS(t) yh3hat yh1hat 11959m01 NA NA y(1959m01)NA X(1959m01) NA NA 21959m02 yh=3(1959m05)yh=1(1959m03)y(1959m02)y(1959m01)X(1959m02) NA NA 31959m03 yh=3(1959m06) NA NA … … … … … … … NA NA … … … … … … … NA NA 131960m01 yh(1960m04) yh=1(1960m02)y(1960m01)y(1959m12)X(1960m01) NA NA … … … … … … … NA NA … … … … … … … NA NA … … … … … … … NA NA … … … … … … … NA NA … … … … … … … NA NA 3011984m01 yh=3(1984m04)yh=1(1984m02)y(1984m01)y(1983m12)X(1984m01)PC1984m02(1984m01)PLS1984m01(1984m01) NA NA 3021984m02 yh=3(1984m05)yh=1(1984m03)y(1984m02)y(1984m01)X(1984m02) yhath=3(1984m05) yhath=1(1984m03) 3031984m03 yh=3(1984m06)yh=1(1984m04)y(1984m03)y(1984m02)X(1984m03) yhath=3(1984m06) yhath=1(1984m04) y(1984m04)y(1984m03)X(1984m04) … … … … … … … … … … … … … … … … … … … … yh=3(2015m05)yh=1(2015m03)y(2015m02)y(2015m01)X(2015m02) … … 6752015m03 yh=3(2015m06)yh=1(2015m04)y(2015m03) … … yhath=3(2015m06) yhath=1(2015m04) 6762015m04 NA yh=1(2015m05)y(2015m04) … … NA yhath=1(2015m05) 6772015m05 NA yh=1(2015m06)y(2015m05) … … NA yhath=1(2015m06) 6782015m06 NA NA y(2015m06) … … NA NA We report results for di⁄erent sub-samples however our out-of-sample forecasting exercise uses a recursive window rather then a moving one. 5.2 Estimation Details and Results on the Number of Components The practical implementation of the estimators summarized in an earlier Section necessitates the choice of several details. In this Section we present these choices by estimation procedure and some sensitivity analysis of the results. RIDGE (cid:150)In order to implement RIDGE we used the R package glmnet by Friedman et al. (2015) [33]. The shrinkage parameter (cid:21) needs to be chosen prior to the estimation. DeMol et al. (2008) [25] resorted to (cid:133)t a grid of values and report MSFE for all. We tried some of the values of their grid and we discuss their forecast performance below. However we also opted to fully exploit the full functionalities of glmnet and let the data suggest a value for lambda using nfold crossvalidation (where n = 8). As a result of the cross-validation we obtain sequences for (cid:21)s at each forecasting step that we plot in Figure 2. In the top plot are the values of (cid:21) that minimize the min cross-validation error. The bottom plot contains the values of (cid:21) corresponding to a 1 standard 1se deviation of the cross-validation sequence (the default). A striking pattern is revealed in the plot: it appears that forecasting at any horizon becomes increasingly di¢ cult after the great recession with a sudden surge both in the volatility and average level of (cid:21), a sign of RIDGE attempting to shrink more as a reaction in the increase di¢ culty in prediction. The discouraging results in terms of MSFE that we report in the next Section appear to be the (cid:135)ip side of this feature. 20
Figure 2: CPIAUCSL: (cid:21) and (cid:21) Selected by Cross-Validated RIDGE over the min 1se Recursive Forecasting Window. PCR (cid:150)We estimate many di⁄erent versions of PCR in order to cover a wide menu of choices of thetruncationmeta-parameterM. FirstofallwefollowStockandWatson(2002)[55]andestimate a series of PCRs (in this context known as di⁄usion indexes with acronym DIAR) with a constant M throughout the forecasting experiment. We looked at values for M = 1;2;3;4;5;6;8;10;20;30 generating models PCRn1 through PCRn30. When forecasting in(cid:135)ation we (cid:133)nd that 8 components explain about 45% of the variance of the panel matching the (cid:133)ndings in McCracken and Ng (2015) 21
[49]. We note that simulations from an approximate factor model in Barbarino and Bura (2015) [7] in which the true number of factors is 8 result in at least 75% of explained variance on average across simulations. We use the R-package pls by Mevik et al. (2013) [50] which also implements PCR where the number of components M is selected via cross validation.. The cross-validation in-sample MSFE has an interesting contour over the number of components with one or two local minima and a global minimum all of them depending on the forecast horizon. The global minimum on average is achieved only with a rather large number of components especially toward the end of the sample as shown in the (cid:133)rst panel in Figure 3. That is a striking feature considering that in simulations reported in our companion paper Barbarino and Bura (2015) [7] cross-validation tends to gravitate around a number of components close to the number of true factors. The results in our empirical investigations in this paper suggest that either the true number of factors generating FRED-MD is indeed very large or that some non-linearities (such as change in regime) or other features of the data disrupt somewhat the e⁄ectiveness of cross-validation after the early 2000s. Cross-validation appears to generate a quite volatile choice for the number of components over the experiment at all forecast horizons. The evolution of the (cid:133)rst local minimum is shown in the second panel of Figure 3 (modelPCRmin1), whichalthoughmaybelessvolatileitisstillquiteunstable. Anotherpossibility is implementing best subset selection (BSS) in the choice of the components: model PCRbss is allowed to pick any component and PCRbss30 only in the (cid:133)rst 30.To implement BSS, we use the leaps R-package by Thomas Lumley (2009) [47]. BSS does not impose a hierarchical ordering of the components (in which if component #3 is used also component #2 is used). Rather it uses a backward search running very many regressions from larger models to smaller ones eliminating regressors using the BIC criterion. The last two panels of Figure 3 highlight that forecasting at longerhorizonsseemsmoredi¢ cultandrequiresmorecomponents. Alsoforthesemodelsitappears that forecasting in the last part of the window requires more components on balance. 22
Figure 3: CPIAUCSL: Number of Components Selected by Cross-Validated and Best Subset PCR over the Recursive Forecasting Window. We also run PCR with selection of the components according to Bai and Ng (2002) [2] and pick their favorite PCp2 and the alternative ICp2 criteria. Also in this exercise we (cid:133)nd instabilities 23
worth of note. We can more or less reproduce the results in McCracken and Ng (2015) [49] in which it is shown that such criteria select about 8 to 10 components (as remarked above this number of components explains less than half of the variance in the panel). However especially PCp2 is very sensitive to the maximum number of allowed components that has to be entered in their computation (a meta-parameter that we denote with k ). Setting k = 30, a natural max max choice since 30 components explain about 90% of the variance of the dataset (model PCRpcp2, top plotintheFigurebelow), weobtainthatPCp2selectsmanymorecomponentsthanwithk = 15 max (model PCRpcp2b, 3rd panel below). For k = 50 PCp2 selects way more components, more max than cross validation and very frequently all components, about 120 on average. By contrast ICp2 appears to be more stable when k moves from 15 to 30. ICp2 becomes very unstable when k max max is above 70. The selection of the number of components for these criteria is shown in Figure 4. Notice that these criteria, being untargeted are not a⁄ected by the forecast horizon at hand. PLS (cid:150)We implement PLS using the pls package in [50]. We choose the truncation metaparameter u with cross-validation. We implement two models. In PLSRcv we include y and its t lags directly in the X matrix whereas in model PLSRcvd we do not include them and in a second step we run OLS of the target on y and its lags and the PLS components. In contrast with PCR, t at short horizons u is not very large. At longer horizons, h = 12 and h = 24 PLS needs a large number of components, similar to PCR. The great recession appears to require more components also at shorter horizons. 24
Figure 4: CPIAUCSL: Number of Components Selected by PCp2 and ICP2 Criteria over Recursive Forecasting Window (k = 30 in top 2 panels, k = 15 in bottom max max 2 panels). 25
Figure 5: CPIAUCSL: Number of Components Selected by Cross-Validation of PLS over Recursive Forecasting Window. Table2comparesPCRandPLSonthebasisofexplainedvariancewith10components. Despite the much smaller number of components the superior targeting nature of PLS relative to PCR is evident as 10 components explain about 70% of the variance CPIAUCSL and 60% of the variance 26
in INDPRO. Not only does PLS explains a larger fraction of the variance in the target but also it explains a fraction of the variance of the panel similar to the fraction explained by PCR. Table 2: Comparing PCR and PLS on the Basis of Explained Variances by 10 Comp. CPIAUCSL INDPRO %var(X) 52% 52% PCR10 %var(y ) 33% 40% t+h %var(X) 50% 46% PLS10 %var(y ) 70% 60% t+h Tables 3 and 4 wrap up the results on the average number of components and their volatility across our estimators. Table 3: Average of Num. Components Selected (Estimation Sample: 1960m01- 1982m01-2013m01) CPIAUCSL INDPRO horizon h=1 h=3 h=6 h=12 h=24 h=1 h=3 h=6 h=12 h=24 RIDGE 1.43 0.91 0.54 0.73 0.47 14.78 9.23 5.43 7.41 3.18 RIDGE1se 69.94 15.19 26.05 6.31 2.70 192.24 70.51 35.10 25.39 12.89 PCRcv 47.90 46.93 46.18 48.39 46.57 117.38 117.06 117.00 117.13 117.11 PCRmin1 11.38 11.57 11.56 11.35 11.21 34.30 35.75 36.21 35.79 36.37 PCRmin2 23.37 23.43 23.68 23.02 23.19 57.72 59.60 60.95 60.26 60.16 PCRpcp2 23.73 23.73 23.73 23.73 23.73 23.53 23.53 23.53 23.53 23.53 PCRpcp2b 10.37 10.37 10.37 10.37 10.37 10.37 10.37 10.37 10.37 10.37 PCRicp2 6.74 6.74 6.74 6.74 6.74 6.74 6.74 6.74 6.74 6.74 PCRicp2b 6.74 6.74 6.74 6.74 6.74 6.74 6.74 6.74 6.74 6.74 PCRbss 9.30 7.06 12.10 22.23 24.61 10.53 12.41 18.49 17.02 17.94 PCRbss30 3.15 2.98 3.53 5.97 7.67 3.88 4.12 4.40 7.28 8.73 PLSRcv 9.01 10.97 27.85 70.86 63.99 1.39 2.37 13.18 2.06 4.17 PLSRcvd 3.23 5.24 8.37 39.31 44.79 1.26 2.47 4.58 2.18 3.94 27
Table 4: Standard Deviation of Num. Components Selected (Estimation Sample: 1960m01-1982m01-2013m01) CPIAUCSL INDPRO horizon h=1 h=3 h=6 h=12 h=24 h=1 h=3 h=6 h=12 h=24 RIDGE 3.88 3.15 2.95 2.98 2.07 9.86 9.23 5.43 7.41 3.18 RIDGE1se 241.51 115.36 192.18 26.19 16.66 91.86 70.51 35.10 25.39 12.89 PCRcv 20.91 21.15 21.06 20.88 20.80 2.55 117.06 117.00 117.13 117.11 PCRmin1 6.50 6.67 6.36 5.83 6.16 25.76 35.75 36.21 35.79 36.37 PCRmin2 8.18 7.77 7.46 7.41 7.35 27.51 59.60 60.95 60.26 60.16 PCRpcp2 0.95 0.95 0.95 0.95 0.95 0.91 23.53 23.53 23.53 23.53 PCRpcp2b 0.49 0.49 0.49 0.49 0.49 0.48 10.37 10.37 10.37 10.37 PCRicp2 0.96 0.96 0.96 0.96 0.96 0.96 6.74 6.74 6.74 6.74 PCRicp2b 0.96 0.96 0.96 0.96 0.96 0.96 6.74 6.74 6.74 6.74 PCRbss 2.34 2.74 3.27 3.02 1.18 3.96 12.41 18.49 17.02 17.94 PCRbss30 1.18 1.71 1.79 1.26 1.77 1.14 4.12 4.40 7.28 8.73 PLSRcv 6.98 5.78 23.09 37.27 26.39 0.68 2.37 13.18 2.06 4.17 PLSRcvd 3.49 6.27 11.29 36.88 34.18 0.44 2.47 4.58 2.18 3.94 SIR (cid:150)As mentioned SIR requires a large sample to yield reliable estimates. In our companion paper Barbarino and Bura (2015) [7] we develop a Krylov subspaces version of SIR that can handle cases where p > T. However in line with the exploratory nature of this paper we wanted to test the performance of a standard SIR which necessitated pre-processing of the data over samples where p > T or p is of order close to T. When we pre-condition the data we use 20 or 30 PCs, a number su¢ cient to preserve between 75% and 90% of the total variance in the panel. The idea is that by retaining a su¢ cient number of components not too much information on the conditional predictive density of y x is lost and the application of SIR on the reduced data can t+h t j still identify and estimate a SDR subspace. We describe in detail the algorithm use to compute our pre-processedSIRintheAppendix. Althoughweareexperimentingwithnon-parametricregression techniques in order to model the forward regression at this point we report only results obtained using a linear forward regression that models the dynamics of y and includes SIR components. t This solution is suboptimal and likely negatively a⁄ects the forecasting accuracy of our estimator as it omits including non-linear terms which are implied by the higher than one dimension of the SIR predictors. Despite this fact our results are encouraging. Regarding the choice of the dimension of the SDR subspace, we tried to apply both the asymptotic weighted-chi square in Bura and Cook (2001b) [13] and the permutation test in Cook and Yin (2001) [24], however both proved to be very unstable and unreliable in our time-series settings. While we are working on the development of a test appropriate for our time-series environment, in this paper we estimate the dimension to be the number of SIR predictors resulting in the most accurate forecasts under several scenarios of constructing the SIR predictors and forward forecasting models. We verify that almostneveradimensionlargerthantwoisbene(cid:133)cialintheforecastingexercise. Noticethatinthe factor literature the information criteria used to select the number of components are somewhat cumbersome, involving some (cid:147)model-mining(cid:148)(for instance see the preceding discussion on k ). max Our two-step procedure can be viewed as an alternative way of selecting the PC components whereby optimal selection is achieved by SDR techniques achieving as we will see shortly extreme parsimony.12 We will also show that for large samples standard SIR applied to the raw variables 12This might have the (cid:135)avor of a optimal weighting scheme in the extraction of the factors as suggested by Boivin 28
has competitive performance using only one or two SIR predictors. 5.3 Results: Forecasting Performance We now turn to the analysis of the forecasting performance of the estimators that we have lined up in this study. We concentrate on the mean square forecast error (MSFE) as a measure of performance although broadly similar results are obtained using the mean absolute error criterion. The forecasting performance can depend on the range of the sample it is based on. This is particularly true as the (cid:147)great recession(cid:148)is covered in our data. To study such e⁄ect and also to be able to draw inference unencumbered by such e⁄ect, we consider several sample ranges. EstimatorsthatUseOneComponent (cid:150)AR(4),OLSandRIDGEuseonlyonelinearcombination to form their forecast. Also RIDGE does so. In Table 5 we report MSFE for these three methods at (cid:133)ve di⁄erent horizons. OLS, as expected, is greatly a⁄ected by the relatively small sample size. This said, in simulations conducted in a companion paper [7] we found that OLS are much more competitive when the data are generated by an exact factor model and only large deviation from such DGP or extreme paucity of observations disrupt the forecast e¢ ciency of OLS. We do show in that paper that indeed an exact factor model implies that OLS is the correct model to use. We were surprised by the sub-par performance of RIDGE as it performs well very few times. This is in contrast with results in DeMol et al. (2008) [25]. We did feed into our RIDGE estimator (models RIDGE141 through RIDGE3532) also the parameters suggested in DeMol et al. (2008) [25] with little success. Data revisions, sample and estimation procedures may explain the di⁄erence. Table 5: OLS, AR4 and RIDGE: MSFE Relative to AR4 (Estimation Sample: 1960m01- 1982m01-2013m01) CPIAUCSL INDPRO horizon h=1 h=3 h=6 h=12 h=24 h=1 h=3 h=6 h=12 h=24 OLS 2.11 6.07 8.92 7.77 1.07 1.73 1.33 4.61 32.47 7.33 AR4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 RIDGE 0.92 2.05 5.57 7.66 1.51 0.97 0.87 1.11 6.04 6.99 RIDGE1se 1.00 1.68 5.33 6.66 1.47 0.94 0.95 0.99 1.70 2.55 RIDGE141 1.10 1.56 2.04 2.36 2.73 0.92 0.98 1.00 1.05 1.02 RIDGE288 1.11 1.59 2.09 2.43 2.84 0.96 1.08 1.08 1.05 1.01 RIDGE292 1.11 1.59 2.09 2.43 2.84 0.96 1.08 1.08 1.05 1.01 RIDGE528 1.12 1.60 2.11 2.47 2.89 1.01 1.17 1.15 1.06 1.01 RIDGE582 1.12 1.60 2.12 2.47 2.89 1.01 1.18 1.16 1.07 1.01 RIDGE949 1.12 1.61 2.13 2.49 2.92 1.05 1.24 1.19 1.07 1.01 RIDGE3532 1.12 1.62 2.15 2.52 2.96 1.11 1.34 1.27 1.09 1.01 Principal Components Regression (cid:150)We now turn to the performance of the di⁄usion index models in Table 6. PCR appears to be e⁄ective when forecasting one month ahead for both targets howeverforecastingin(cid:135)ationwithPCRhitsawallatlongerhorizons. Includingenoughcomponents appears to be key in general. Industrial production appears to be an easier to forecast and methods that select a relatively stable number of components are the most successful, such as (cid:133)xing 8 or 10 components or using PCp2 or ICp2. Also BSS restricted to 30 components seems to be working and Ng (2006) [10]. 29
consistently well across horizons. PCR with 10 PCs has consistently good performance for CPI but not for industrial production. Table 6: Principal Component Regression: MSFE Relative to AR4 (Estimation Sample: 1960m01-1982m01-2013m01) CPIAUCSL INDPRO horizon h=1 h=3 h=6 h=12 h=24 h=1 h=3 h=6 h=12 h=24 PCRcv 0.97 1.02 0.96 1.31 1.83 1.17 7.67 15.88 7.66 3.77 PCRmin1 0.95 1.07 1.11 1.18 1.37 1.00 1.06 1.04 1.00 0.90 PCRmin2 0.99 1.05 1.06 1.22 1.29 1.02 1.12 1.12 1.03 0.95 PCRpcp2 0.96 1.01 1.02 1.10 1.25 1.04 1.01 0.99 0.93 0.89 PCRicp2 0.95 1.06 1.11 1.19 1.38 0.92 0.89 0.94 0.96 1.06 PCRpcp2b 0.93 1.07 1.12 1.20 1.40 0.94 0.87 0.93 0.92 0.93 PCRicp2b 0.95 1.06 1.11 1.19 1.38 0.92 0.89 0.94 0.96 1.06 PCRbss 0.97 1.09 1.04 1.03 1.10 1.05 1.03 1.05 1.01 0.89 PCRbss30 0.97 1.06 1.06 1.16 1.30 0.96 0.93 0.95 0.91 0.90 PCRn1 1.02 1.05 1.08 1.13 1.34 0.96 0.96 0.98 0.99 1.01 PCRn2 1.03 1.07 1.08 1.11 1.22 0.95 0.94 1.00 1.01 1.08 PCRn3 1.03 1.07 1.09 1.14 1.27 0.94 0.93 0.98 0.99 1.07 PCRn4 0.98 1.07 1.09 1.16 1.33 0.93 0.93 0.98 0.99 1.08 PCRn5 0.97 1.07 1.10 1.18 1.39 0.93 0.94 0.99 1.00 1.07 PCRn6 0.95 1.05 1.10 1.19 1.40 0.93 0.93 1.00 1.01 1.08 PCRn8 0.95 1.06 1.11 1.19 1.38 0.92 0.85 0.92 0.94 1.05 PCRn10 0.93 1.07 1.12 1.21 1.41 0.94 0.87 0.93 0.93 0.94 Partial Least Squares Regression (cid:150)The general impression from Table 7 is that cross-validation is very e⁄ective when forecasting industrial production at all horizons whereas (cid:133)xing the number of components causes a deterioration of the MSFEs. The opposite seems to be true when forecasting in(cid:135)ation, in which case (cid:133)xing the number of components to 6 or 8 seems the most appropriate choice. On balance, PLS comes out of the horserace as one of the best performers especially for in(cid:135)ation. Table 7: Partial Least Squares Regression: MSFE Relative to AR4 (Estimation Sample: 1960m01-1982m01-2013m01) CPIAUCSL INDPRO horizon h=1 h=3 h=6 h=12 h=24 h=1 h=3 h=6 h=12 h=24 PLSRcv 1.02 1.09 1.06 1.01 0.87 0.93 0.87 0.95 0.89 0.86 PLSRcvd 0.96 1.00 1.02 1.09 1.09 0.96 0.90 0.97 0.95 0.87 PLSRn2 0.99 0.98 1.00 1.05 1.19 0.94 0.92 0.96 0.92 0.91 PLSRn4 0.90 0.94 0.94 1.00 1.14 1.09 1.08 1.02 0.93 0.85 PLSRn6 0.95 0.96 0.94 0.99 1.11 1.11 1.18 1.12 1.02 0.92 PLSRn8 0.97 0.98 0.95 0.98 1.07 1.08 1.11 1.08 1.00 0.91 PLSRn10 0.98 0.96 0.93 0.95 1.02 1.11 1.14 1.09 1.04 0.93 Sliced Inverse Regression (cid:150)Table 8 reports SIR MSFE(cid:146)s relative to AR(4). Pre-conditioned 30
SIR (the (cid:133)rst 4 lines of the table) on 30 or 20 PCs turns out to deliver some good results. The last two lines of the table report the results of using principal component regression. SIR is capable of summarizing in just one or two components the information encapsulated in 20 or 30 PCs. SIR improves the dismal performance of PCR in forecasting in(cid:135)ation in the medium term. We view the gaininparsimonyandmodelingasthemajoradvantageofusingSIRinthisinstance. Asmentioned earlier our results are certainly adversely a⁄ected by forward model misspeci(cid:133)cation given that it appears that two SIR components capture all relevant information on the conditional distribution of yh x . Existing methods in the SDR literature exploit regression graphic devices in this case t+hj t which are not easily ported in a pseudo forecasting experiment with estimation repeated hundreds of times. We are working to e¢ ciently implement non-parametric methods in order to solve this problem. Finally we also report estimates for SIR on the raw data, without pre-conditioning. The sample is already long-enough to deliver a performance that slightly beats the AR(4) at short horizons although the estimation has been carried out using an inverse regression of the regressors on y rather than yh in order to preserve as much information as possible in the estimation t t+h algorithm, an additional source of mispeci(cid:133)cation. Table 8: Sliced Inverse Regression: MSFE Relative to AR4 (Estimation Sample: 1960m01-1982m01-2013m01) CPIAUCSL INDPRO horizon h=1 h=3 h=6 h=12 h=24 h=1 h=3 h=6 h=12 h=24 PC30SIRdr1OLS 0.99 0.97 0.95 1.01 1.13 1.07 1.04 1.12 0.95 0.94 PC30SIRdr2OLS 1.00 0.97 0.95 1.01 1.13 1.03 1.00 1.05 0.93 0.91 PC20SIRdr1OLS 0.99 0.97 1.02 1.07 1.18 0.99 0.97 1.03 0.94 0.94 PC20SIRdr2OLS 0.97 0.99 1.04 1.07 1.20 0.98 0.93 0.97 0.91 0.90 SIRdr1OLS 0.98 1.00 1.01 1.01 1.03 0.97 1.01 1.02 1.02 1.04 SIRdr2OLS 0.97 0.99 1.01 1.00 1.05 0.96 1.00 1.01 1.03 1.04 PCRn20 0.96 1.07 1.11 1.20 1.41 0.99 0.97 0.97 0.92 0.88 PCRn30 0.99 1.03 1.04 1.12 1.26 1.05 1.05 1.02 0.94 0.88 A Sub-Sample Comparison of SIR and PCR (cid:150)In Tables 10-13, we report relative MSFEs with respect to AR(4) focusing attention to PCR, as the most widely used dimension reduction method in macro-economic forecasting, and SIR based forecasting models, whose regularized version also uses the PCs of the x variables. Because the SIR predictors are driven by the inverse regression of the predictors on the response, in a time series context, where the contemporaneous target variable and its lags can be used as predictors, di⁄erent choices of variables to consider as predictors and response lead to di⁄erent SIR models. The four di⁄erent SIR based models we use are de(cid:133)ned in Table 9 as follows. The second column describes how the SIR predictors are formed. For example, in SIRa, the SIR predictors are obtained from using all x-predictors and y and its 4 lags as the t X-predictor matrix and y as the response that is sliced in the SIR algorithm of Section 4.3. t+h When the PCs are used, then the regularized SIR algorithm in Appendix B is applied. The third column de(cid:133)nes the forward forecasting model. For SIRa, for example, y is regressed on a linear t+h model with inputs the corresponding SIR predictors from the second column. In Tables 10 and 12, only the regularized version of SIR is used as the starting sample is small relative to the number of predictors. 31
Table 9: SIR based Forecasting Models SIR Predictors (inverse regression) Forward Model SIRa X or PCs and y +4 lags on y y on SIR predictors t t+h t+h SIRb X or PCs and y +4 lags on y y on y +4 lags and SIR predictors t t+h t+h t SIRc X or PCs on y y on y +4 lags and SIR predictors t+h t+h t SIRd X or PCs on y y on y +4 lags and SIR predictors t t+h t For both in(cid:135)ation and industrial production, the general pattern across forecasting windows and horizons is that SIR, either standard or regularized, has similar performance to PCR. For the longest horizon of 24 months, SIR with has better performance. The only exception is for industrial production over the period 2003:01-2014:12 for models SIRa, SIRb and SIRc (see relative MSFEs in Table 13) where SIR predictors based on all 129 x variables are used. In contrast, over 2010:01-2014:12, the performance is on par with the other methods. This (cid:133)nding con(cid:133)rms that the sample size has a dramatic impact in SIR performance. Notably, SIRd, where the SIR predictors are built using only y , does not appear to be a⁄ected by the size of the sample. In e⁄ect, for these t econometricseries, SIRdexhibitsoverallthebestperformanceforbothPC-basedandstandardSIR across periods and horizons. PCR typically needs 10 components to achieve its best performance across horizons and time windows. In sum, SIR is shown to achieve what it is designed to do; that is, signi(cid:133)cantly reduce the dimension of the forecasting problem. Table 10: Relative MSFE with respect to AR(4) for predicting CPIAUCSL CPIAUCSL 1971:01-2014:12 1982:01-2014:12 horizon h=1 h=3 h=6 h=12 h=24 h=1 h=3 h=6 h=12 h=24 PCRn4 0.96 0.974 0.95 0.894 0.767 0.97 1.07 1.092 1.153 1.33 PCRn5 0.96 0.98 0.959 0.909 0.791 0.97 1.07 1.102 1.178 1.4 PCRn10 0.94 0.982 0.976 0.913 0.763 0.93 1.07 1.119 1.202 1.41 PCRn20 1 0.996 0.989 0.934 0.779 0.96 1.07 1.117 1.202 1.41 PC20SIRa 0.99 1.008 1.022 1.01 0.805 0.935 1.07 0.935 1.3 1.45 PC20SIRb 0.954 0.983 0.999 0.938 0.763 0.947 1.05 0.947 1.194 1.35 PC20SIRc 0.972 1.001 1.015 0.933 0.768 0.972 1 0.972 1.067 1.2 PC20SIRd 0.984 1.004 0.982 0.953 0.939 0.977 1.01 0.977 0.995 1.05 32
Table 11: Relative MSFE with respect to AR(4) for predicting CPIAUCSL CPIAUCSL 2003:01-2014:12 2010:01-2014:12 horizon h=1 h=3 h=6 h=12 h=24 h=1 h=3 h=6 h=12 h=24 PCRn4 0.967 1.08 1.144 1.229 1.372 0.962 1.091 1.121 1.136 1.124 PCRn5 0.961 1.083 1.161 1.267 1.469 0.963 1.078 1.105 1.188 1.204 PCRn10 0.901 1.079 1.183 1.311 1.516 1.001 1.134 1.185 1.293 1.3 PCRn20 0.918 1.065 1.174 1.311 1.493 1.065 1.134 1.161 1.264 1.221 PC20SIRa 0.921 1.063 1.217 1.414 1.537 1.1 1.136 1.21 1.508 1.406 PC20SIRb 0.935 1.05 1.185 1.291 1.408 1.073 1.102 1.157 1.24 1.192 PC20SIRc 0.963 0.998 1.079 1.068 1.162 1.117 1.023 1.053 1.057 1.314 PC20SIRd 0.968 1.005 0.993 0.99 1 1.045 1.04 0.986 0.973 1.208 SIRa 57.416 30.219 6.788 7.059 8.975 1.597 1.345 1.39 1.339 1.917 SIRb 25.035 18.929 4.659 5.897 8.183 1.209 1.122 1.192 1.238 1.78 SIRc 20.473 25.865 2.388 11.89 1.902 1.057 1.145 1.062 1.378 2.225 SIRd 0.948 0.981 0.991 0.989 1.012 0.982 0.997 1.023 0.999 1.012 Table 12: Relative MSFE with respect to AR(4) for predicting CPIAUCSL INDPRO 1971:01-2014:12 1982:01-2014:12 horizon h=1 h=3 h=6 h=12 h=24 h=1 h=3 h=6 h=12 h=24 PCRn4 0.904 0.8 0.761 0.717 0.75 0.943 0.947 0.996 0.989 1.078 PCRn5 0.897 0.8 0.769 0.718 0.74 0.937 0.946 0.999 0.996 1.069 PCRn10 0.914 0.83 0.807 0.713 0.65 0.932 0.884 0.94 0.927 0.946 PCRn20 0.954 0.9 0.868 0.741 0.64 0.987 0.986 0.983 0.922 0.882 PC20SIRa 0.946 0.86 0.867 0.742 0.64 0.977 0.93 0.977 0.921 0.91 PC20SIRb 0.94 0.88 0.872 0.741 0.64 0.996 0.95 0.996 0.917 0.901 PC20SIRc 0.914 0.89 0.819 0.693 0.62 0.975 0.95 0.975 0.912 0.907 PC20SIRd 1.008 1.02 1.002 0.973 1 1.035 1.036 1.035 1.002 0.976 Table 13: Relative MSFE with respect to AR(4) for predicting CPIAUCSL INDPRO 2003:01-2014:12 2010:01-2014:12 horizon h=1 h=3 h=6 h=12 h=24 h=1 h=3 h=6 h=12 h=24 PCRn4 0.972 1.015 1.024 0.942 1.032 1.058 1.449 1.675 2.092 11.594 PCRn5 0.987 1.009 1.022 0.925 0.994 1.043 1.34 1.38 0.875 4.311 PCRn10 0.994 0.946 0.965 0.849 0.814 1.086 1.629 1.937 2.405 4.521 PCRn20 1.038 0.986 0.959 0.783 0.657 1.037 1.619 1.962 2.78 4.701 PC20SIRa 1.008 0.912 0.965 0.77 0.685 0.986 1.423 1.594 1.863 6.566 PC20SIRb 1.02 0.952 0.976 0.768 0.673 1.006 1.517 1.626 1.868 6.835 PC20SIRc 1.019 0.961 0.954 0.774 0.679 1.062 1.597 1.653 1.985 7.596 PC20SIRd 1.027 1.043 1.012 0.973 0.922 1.045 1.183 1.085 1.203 3.425 SIRa 18.339 2.83 9.592 9.422 10.488 1.966 2.729 2.769 3.659 12.07 SIRb 15.967 2.668 9.295 8.723 10.446 1.815 2.652 2.648 3.6 11.782 SIRc 10.115 1.312 7.347 5.717 6.459 1.697 2.37 2.672 4.013 11.232 SIRd 0.93 1.008 1.033 1.044 1.043 0.956 1.125 1.149 1.29 1.378 33
Evolution of the MSFE (cid:150)In(cid:135)ation appears de(cid:133)nitely harder to forecast than industrial production, consistent with (cid:133)ndings in the forecasting literature. Are there speci(cid:133)c periods in which the forecast performance of our estimators deteriorates? Figure 6 portrays the evolution of the MSFE in forecasting in(cid:135)ation 12 months ahead for selected estimators (so each point represents the MSFE up to that point). There are de(cid:133)nitely periods in which forecasting in(cid:135)ation is harder, however it seems that these periods vary by estimator with SIR computed out of 20 PCs being the (cid:133)rst to react negatively to bad data entering the sample through the rolling window. Using asa a metric the height of the spikes it looks like PC20SIRdr performs at par if not better than other estimators except in the very last portion of the window. Figure 6: CPIAUCSL: Rolling MSFE for Selected Estimators In-Sample Fit (cid:150)As is evident from Figure 7 there is no obvious relationship between in-sample (cid:133)t and the out-of-sample forecasting performance commented above. For instance, OLS thanks to the very large number of variables, some subcomponents of the target variable itself, produces very high R-squared and bad forecasting results. 34
Figure 7: CPIAUCSL: R-Squared for Selected Estimators over Pseudo Forecasting Window 35
6 Conclusions 7 Summary and Conclusions In this paper we (1) introduced su¢ cient dimension reduction methodology to econometric forecasting and focused on linear moment-based SDR, (2) derived properties of the SIR SDR estimator forcovariancestationaryseries, (3)castOLS,PCR,RIDGEandPLSinacommonframework, and (4) studied the forecasting performance of these four methods, as well as SIR, using the FRED-MD data set put together by McCracken and Ng (2015) [49]. The empirical results indicate that PCR, PLS and SIR do not exhibit drastically di⁄erent performance. The competitive edge of SIR is its parsimony:itattainspracticallythesameforecastingaccuracyusingoneortwolinearcombinations of the predictors. In contrast, both PLS and PCR require many components, in many cases more than ten. OLS and RIDGE are not found to be competitive for these data and the time periods we considered in our forecasting exercise. There are several issues that impede the performance of SIR and which can be improved upon. Dimension two or higher in SIR indicates the presence of nonlinear relationships between the response and the SIR predictors. In such cases, plots of the response versus the SIR predictors would inform the construction of a more appropriate forward model. As the forecasting experiment was carried out in an automatic fashion, we could not visually assess the nature of nonlinearities and nonlinear functions of the SIR predictors were not included in the forecasting model. Gains in forecasting accuracy can potentially be realized by the inclusion of nonlinear SIR terms in the forward model. SDR in general, and SIR in particular, require a large sample size to yield reliable results. The sample size of the FRED-MD data set is not large enough for SIR to be optimally used. For some periodsintheforecastingexercise, SIRpredictorswereextremelyunstableasthesamplecovariance matrix of the raw predictors was close to ill-conditioned. We develop a Krylov subspaces version of SIR to address this issue in a separate paper. Nevertheless, both these issues amount to limitations that need to be addressed for SIR, or SDR in general, to be properly applied to such data and deserve future empirical and theoretical research. Appendix A: List and Description of Variables The following set of tables summarizes the variables in FRED-MD. Each table collects variables by statistical data release imparting an organization of the variables slightly rearranged relative to the tables in McCracken and Ng (2015) [49]. Grouping by statistical data release report is more useful both because the timing of the data release is di⁄erent (although the timing is not exploited in the present study) and because some variables are aggregates of more detailed information in any one statistical data release and share the information and possible biases of that statistical release. Our reordering allows a better bird(cid:146)s eye view on the sources of information. We brie(cid:135)y describeeachstatisticaldatareleasebelow. Inadditioneachtablereports,underthecolumnT, the transformation used13. The G column denotes the grouping chosen by McCracken and Ng (2015) [49] in turn not too dissimilar from groupings operated in other DFM studies. The FRED-MD column reports the variable mnemonics in the original FRED-MD datasets. The Description column permits to identify the series. The remaining two columns denote the Global Insight code and description; the GSI description allows to map the individual series with datasets in older papers. In some papers not all variables are used to compute principal components, a strategy 13The transformations closely follow McCracken and Ng (2015) [49] who in turn follow Stock and Watson. 36
followed by Stock and Watson (2005) [56], who add an additional column containing a dummy to denotewhetherthevariablewasusedinthecomputationofthePCs. Weincludeallvariableswhen computing PCs hence we do not need such additional column. Asterisked series are adjusted by McCracken and Ng (2015) (see [49] for details). Variables Directly Measuring Output (cid:150)The most reliable and used data containing measures of output at a monthly frequency come from the IP system within the statistical release G.17 producedattheFederalReserveBoardandcoveringindustrialproduction. TheIPsystemcontains information on about 200 sectors at NAICS 4-digits level and covers the manufacturing, mining and utilities sectors. The last variable is capacity utilization in manufacturing, one of the few observable measures of slack also from the G.17, computed as manufacturing IP ; manufacturing manufacturing capacity capacity is estimated by sta⁄at the FRB using the quarterly survey of capacity (in turn run by the BLS) and included in the G.17 publication. The G.17 publication contains information on about 94 subaggregates at NAICS 4 digit level whereas the IP system used to produce it is based on 200+ atoms. Apart from the top aggregate INDPRO, the next seven rows represent the splitting and regrouping of the 200+ atoms in so called (cid:147)market(cid:148)groups. The last market group is split in two subaggregates, durableandnon-durable materials. ManufacturingIPisasubaggregateofIP atthe samelevelasUtilities. FuelsIPisanoddseriestobeincludedinthisdatasetgivenitsidiosyncratic pattern and its higher level of detail. Notice that 25% of (cid:133)nal industrial production data (that is after all revisions have taken place), are estimated from employment data (in the second table of this Section), implying that this set of variables and the set in the second tables might be strongly linked or have a factor in common. Table 14: Output Variables from the IP System id T G FRED-MD Description GSI Description 6 5 1 INDPRO IP Index IP: total 7 5 1 IPFPNSS IP: Final Products and Nonindustrial Supplies IP: products 8 5 1 IPFINAL IP: Final Products (Market Group) IP: (cid:133)nal prod 9 5 1 IPCONGD IP: Consumer Goods IP: cons gds 10 5 1 IPDCONGD IP: Durable Consumer Goods IP: cons dble 11 5 1 IPNCONGD IP: Nondurable Consumer Goods IP: cons nondble 12 5 1 IPBUSEQ IP: Business Equipment IP: bus eqpt 13 5 1 IPMAT IP: Materials IP: matls 14 5 1 IPDMAT IP: Durable Materials IP: dble matls 15 5 1 IPNMAT IP: Nondurable Materials IP: nondble matls 16 5 1 IPMANSICS IP: Manufacturing (SIC) IP: mfg 17 5 1 IPB51222s IP: Residential Utilities IP: res util 18 5 1 IPFUELS IP: Fuels IP: fuels 20 2 1 CUMFNS Capacity Utilization: Manufacturing Cap util Variables Measuring Income and Consumption (cid:150)Personal Income, personal consumption expenditures and PCE de(cid:135)ators are released monthly by the BEA. Retail sales are released by the Census Bureau. 37
Table 15: Variables Helpful in Estimating Consumption id T G FRED-MD Description GSI Description 1 5 1 RPI Real Personal Income PI 2 5 1 W875RX1 Real personal income ex transfer receipts PI less transfers 3 5 4 DPCERA3M086SBEA Real personal consumption expenditures Real Consumption 4* 5 4 CMRMTSPLx Real Manu. and Trade Industries Sales MT sales 5* 5 4 RETAILx Retail and Food Services Sales Retail sales 123 6 7 PCEPI Personal Cons. Expend.: Chain Price Index PCE de(cid:135) 124 6 7 DDURRG3M086SBEA Personal Cons. Expend: Durable goods PCE de(cid:135): dlbes 125 6 7 DNDGRG3M086SBEA Personal Cons. Expend: Nondurable goods PCE de(cid:135): nondble 126 6 7 DSERRG3M086SBEA Personal Cons. Expend: Services PCE de(cid:135): service Variables Measuring Employment and Unemployment (cid:150)The second table contains information on variables measuring employment, data produced by the Bureau of Labor Statistcs (BLS). The (cid:133)rsttworowsrefertodatafromtheCurrentPopulationSurvey(CPS).Therestofthetablerefersto variables from the Current Employment Statistics (CES) a program run each month that surveys approximately 143,000 businesses and government agencies, representing approximately 588,000 individual worksites. The last 3 variables contain miscellaneous information on the labor market. CLAIMS=unemployment claims, is a variable originally released at weekly frequency and comes from the states unemployment insurance system. HWI=Help-Wanted Index for United States is assembled by the Conference Board and recently it has been corrected by Barnichon (2010) [8]. Obvious candidates missing in the datasets are labor market indicators part of the FED labor market dashboard, such as data from the JOLTS survey. 38
Table 16: Employment Variables from Household CPS and Payroll CES Surveys id T G FRED-MD Description GSI Description 23 5 2 CLF16OV Civilian Labor Force Emp CPS total 24 5 2 CE16OV Civilian Employment Emp CPS nonag 25 2 2 UNRATE Civilian Unemployment Rate U: all 26 2 2 UEMPMEAN Average Duration of Unemployment (Weeks) U: mean duration 27 5 2 UEMPLT5 Civilians Unemployed - Less Than 5 Weeks U < 5 wks 28 5 2 UEMP5TO14 Civilians Unemployed for 5-14 Weeks U 5-14 wks 29 5 2 UEMP15OV Civilians Unemployed - 15 Weeks Over U 15+ wks 30 5 2 UEMP15T26 Civilians Unemployed for 15-26 Weeks U 15-26 wks 31 5 2 UEMP27OV Civilians Unemployed for 27 Weeks and Over U 27+ wks 33 5 2 PAYEMS All Employees: Total nonfarm Emp: total 34 5 2 USGOOD All Employees: Goods-Producing Industries Emp: gds prod 35 5 2 CES1021000001 All Employees: Mining and Logging: Mining Emp: mining 36 5 2 USCONS All Employees: Construction Emp: const 37 5 2 MANEMP All Employees: Manufacturing Emp: mfg 38 5 2 DMANEMP All Employees: Durable goods Emp: dble gds 39 5 2 NDMANEMP All Employees: Nondurable goods Emp: nondbles 40 5 2 SRVPRD All Employees: Service-Providing Industries Emp: services 41 5 2 USTPU All Employees: Trade, Transportation Utilities Emp: TTU 42 5 2 USWTRADE All Employees: Wholesale Trade Emp: wholesale 43 5 2 USTRADE All Employees: Retail Trade Emp: retail 44 5 2 USFIRE All Employees: Financial Activities Emp: FIRE 45 5 2 USGOVT All Employees: Government Emp: Govt 46 1 2 CES0600000007 Avg Weekly Hours : Goods-Producing Avg hrs 47 2 2 AWOTMAN Avg Weekly Overtime Hours : Manufacturing Overtime: mfg 48 1 2 AWHMAN Avg Weekly Hours : Manufacturing Avg hrs: mfg 49 1 2 NAPMEI ISM Manufacturing: Employment Index NAPM empl 127 6 2 CES0600000008 Avg Hourly Earnings : Goods-Producing AHE: goods 128 6 2 CES2000000008 Avg Hourly Earnings : Construction AHE: const 129 6 2 CES3000000008 Avg Hourly Earnings : Manufacturing AHE: mfg 32* 5 2 CLAIMSx Initial Claims UI claims 21* 2 2 HWI Help-Wanted Index for United States Help wanted indx 22* 2 2 HWIURATIO Ratio of Help Wanted/No. Unemployed Help wanted/unemp Variables Measuring Construction Activity (cid:150)The third table collects the variables that have leading properties in signaling changes in activity in the construction sector. Permits variables come from the Census(cid:146)building permits monhtly survey of 9,000 selected permit-issuing places adjusted once a year with an annual census of an additional 11,000 permit places that are not in the monthly sample. Housing starts come from the Survey of Construction, a multi-stage strati(cid:133)ed randomsamplethatselectsapproximately900buildingpermit-issuingo¢ ces, andasampleofmore than 70 land areas not covered by building permits. Data from the national association of home buildiers such as existing home sales were not included oin the dataset. 39
Table 17: Leading Indicators of the Construction Sector id T G FRED-MD Description GSI Descr 50 4 3 HOUST Housing Starts: Total New Privately Owned Starts: nonfarm 51 4 3 HOUSTNE Housing Starts, Northeast Starts: NE 52 4 3 HOUSTMW Housing Starts, Midwest Starts: MW 53 4 3 HOUSTS Housing Starts, South Starts: South 54 4 3 HOUSTW Housing Starts, West Starts: West 55 4 3 PERMIT New Private Housing Permits (SAAR) BP: total 56 4 3 PERMITNE New Private Housing Permits, Northeast (SAAR) BP: NE 57 4 3 PERMITMW New Private Housing Permits, Midwest (SAAR) BP: MW 58 4 3 PERMITS New Private Housing Permits, South (SAAR) BP: South 59 4 3 PERMITW New Private Housing Permits, West (SAAR) BP: West VariablesMeasuringOrdersandInventories (cid:150)ThesevariablesarefromtheM3surveyrunbythe U.S. Census Bureau. The M3 is based upon data reported from manufacturing establishments with $500 million or more in annual shipments. Units may be divisions of diversi(cid:133)ed large companies, large homogenous companies, or single-unit manufacturers in 89 industry categories. The M3 provides statistics on manufacturers(cid:146)value of shipments, new orders (net of cancellations), end-ofmonth order backlog (un(cid:133)lled orders), end-of-month total inventory, materials and supplies, workin-process, and (cid:133)nished goods inventories (at current cost or market value). Data are collected and tabulated predominantly by 6-digit NAICS (North American Industry Classi(cid:133)cation System). The most watched series from this survey is ANDENO=(cid:147)New Orders for Nondefense Capital Goods(cid:148) since it excludes certain highly volatile goods (and not so informative on the business cycle) from new orders. Such series unfortunately has a short history and it is excluded in our estimation. Table 18: Variables from the M3 Survey id T G FRED-MD Description GSI Description 3 5 4 DPCERA3M086SBEA Real personal consumption expenditures Real Consumption 4* 5 4 CMRMTSPLx Real Manu. and Trade Industries Sales MT sales 5* 5 4 RETAILx Retail and Food Services Sales Retail sales 64 5 4 ACOGNO New Orders for Consumer Goods Orders: cons gds 65* 5 4 AMDMNOx New Orders for Durable Goods Orders: dble gds 66* 5 4 ANDENOx New Orders for Nondefense Capital Goods Orders: cap gds 67* 5 4 AMDMUOx Un(cid:133)lled Orders for Durable Goods Unf orders: dble 68* 5 4 BUSINVx Total Business Inventories MT invent 69* 2 4 ISRATIOx Total Business: Inventories to Sales Ratio MT invent/sales Variables Measuring the Money Stock and Reserves (cid:150)These data come mainly from the FRB H.6 statistical release. 40
Table 19: Variables Measuring the Money Stock and Bank Reserves id T G FRED-MD Description GSI Description 70 6 5 M1SL M1 Money Stock M1 71 6 5 M2SL M2 Money Stock M2 72 5 5 M2REAL Real M2 Money Stock M2 (reaal) 73 6 5 AMBSL St. Louis Adjusted Monetary Base MB 74 6 5 TOTRESNS Total Reserves of Depository Institutions Reserves tot 75 7 5 NONBORRES Reserves Of Depository Institutions, Nonborrowed Reserves nonbor Variables Measuring Credit (cid:150)These variables are mainly drawn from various FRB statistical releases such as G.19 and G.20. Table 20: Variables Measuring Credit id T G FRED-MD Description GSI Descr 76 6 5 BUSLOANS Commercial and Industrial Loans, All Commercial Banks CI loan plus 77 6 5 REALLN Real Estate Loans at All Commercial Banks DCI loans 78 6 5 NONREVSL Total Nonrevolving Credit Owned and Securitized Outstanding Cons credit 79* 2 5 CONSPI Nonrevolving consumer credit to Personal Income Inst credit/PI 131 6 5 MZMSL MZM Money Stock N.A. 132 6 5 DTCOLNVHFNM Consumer Motor Vehicle Loans Outstanding N.A. 133 6 5 DTCTHFNM Total Consumer Loans and Leases Outstanding N.A. 134 6 5 INVEST Securities in Bank Credit at All Commercial Banks N.A. Variables Measuring Interest Rates (cid:150)The following table contains variables measuring interest rates, yields and spreads. Most variables are from statistical releases by the FRB such as the H.15. 41
Table 21: Interest Rates, Yields and Spreads id T G FRED-MD Description GSI Descr 84 2 6 FEDFUNDS E⁄ective Federal Funds Rate Fed Funds 85* 2 6 CP3Mx 3-Month AA Financial Commercial Paper Rate Comm paper 86 2 6 TB3MS 3-Month Treasury Bill: 3 mo T-bill 87 2 6 TB6MS 6-Month Treasury Bill: 6 mo T-bill 88 2 6 GS1 1-Year Treasury Rate 1 yr T-bond 89 2 6 GS5 5-Year Treasury Rate 5 yr T-bond 90 2 6 GS10 10-Year Treasury Rate 10 yr T-bond 91 2 6 AAA Moody(cid:146)s Seasoned Aaa Corporate Bond Yield Aaa bond 92 2 6 BAA Moody(cid:146)s Seasoned Baa Corporate Bond Yield Baa bond 93* 1 6 COMPAPFFx 3-Month Commercial Paper Minus FEDFUNDS CP-FF spread 94 1 6 TB3SMFFM 3-Month Treasury C Minus FEDFUNDS 3 mo-FF spread 95 1 6 TB6SMFFM 6-Month Treasury C Minus FEDFUNDS 6 mo-FF spread 96 1 6 T1YFFM 1-Year Treasury C Minus FEDFUNDS 1 yr-FF spread 97 1 6 T5YFFM 5-Year Treasury C Minus FEDFUNDS 5 yr-FF spread 98 1 6 T10YFFM 10-Year Treasury C Minus FEDFUNDS 10 yr-FF spread 99 1 6 AAAFFM Moody(cid:146)s Aaa Corporate Bond Minus FEDFUNDS Aaa-FF spread 100 1 6 BAAFFM Moody(cid:146)s Baa Corporate Bond Minus FEDFUNDS Baa-FF spread 101 5 6 TWEXMMTH Trade Weighted U.S. Dollar Index: Major Currencies Ex rate: avg 102* 5 6 EXSZUSx Switzerland / U.S. Foreign Exchange Rate Ex rate: Switz 103* 5 6 EXJPUSx Japan / U.S. Foreign Exchange Rate Ex rate: Japan 104* 5 6 EXUSUKx U.S. / U.K. Foreign Exchange Rate Ex rate: UK 105* 5 6 EXCAUSx Canada / U.S. Foreign Exchange Rate EX rate: Canada Variables Measuring Prices (cid:150)The variables are from the BLS CPI and PPI statistical releases. For PPI more than 100,000 price quotations per month are organized into three sets of PPIs: (1) Final demand-Intermediate demand (FD-ID) indexes, (2) commodity indexes, and (3) indexes for the net output of industries and their products. The CPIs are based on prices of food, clothing, shelter, fuels, transportation fares, charges for doctors(cid:146) anddentists(cid:146)services,drugs,andothergoodsandservicesthatpeoplebuyforday-to-dayliving. Pricesarecollectedeachmonthin87urbanareasacrossthecountryfromabout6,000housingunits and approximately 24,000 retail establishments-department stores, supermarkets, hospitals, (cid:133)lling stations, and other types of stores and service establishments. 42
Table 22: Measures of Prices id T G FRED-MD Description GSI Descr 106 6 7 PPIFGS PPI: Finished Goods PPI: (cid:133)n gds 107 6 7 PPIFCG PPI: Finished Consumer Goods PPI: cons gds 108 6 7 PPIITM PPI: Intermediate Materials PPI: int matls 109 6 7 PPICRM PPI: Crude Materials PPI: crude matls 110* 6 7 OILPRICEx Crude Oil, spliced WTI and Cushing Spot market price 111 6 7 PPICMM PPI: Metals and metal products: PPI: nonferrous 113 6 7 CPIAUCSL CPI : All Items CPI-U: all 114 6 7 CPIAPPSL CPI : Apparel CPI-U: apparel 115 6 7 CPITRNSL CPI : Transportation CPI-U: transp 116 6 7 CPIMEDSL CPI : Medical Care CPI-U: medical 117 6 7 CUSR0000SAC CPI : Commodities CPI-U: comm. 118 6 7 CUUR0000SAD CPI : Durables CPI-U: dbles 119 6 7 CUSR0000SAS CPI : Services CPI-U: services 120 6 7 CPIULFSL CPI : All Items Less Food CPI-U: ex food 121 6 7 CUUR0000SA0L2 CPI : All items less shelter CPI-U: ex shelter 122 6 7 CUSR0000SA0L5 CPI : All items less medical care CPI-U: ex med Variables Measuring the Stock Market (cid:150)These data are elaborated by Standard & Poor. Table 23: Measures of the Stock Market from Standard and Poor id T G FRED-MD Description GSI Descr 80* 5 8 SP 500 SP(cid:146)s Common Stock Price Index: Composite SP 500 81* 5 8 SP: indust SP(cid:146)s Common Stock Price Index: Industrials SP: indust 82* 2 8 SP div yield SP(cid:146)s Composite Common Stock: Dividend Yield SP div yield 83* 5 8 SP PE ratio SP(cid:146)s Composite Common Stock: Price-Earnings Ratio SP PE ratio Di⁄usion Indexes from Manufacturing and Consumer Surveys (cid:150)The last table mostly collects the di⁄usion indexes from the Institute for Supply Management (ISM)14. These variables are released the (cid:133)rst day of month,following the reference month, hence they are quite timely and are used by several institutions in the produciton of their high frequency forecasts. However, since the literature on large panels of macro variables carries out monthly pseudo-forecast experiments and we follow that tradition, we do not exploit the full potential of these variables. Hence the only variable that likely has the most forecasting power is (cid:147)new orders(cid:148)a natural measure of future activity. Notice that these variables are di⁄usion indexes, that is they essentially capture the fraction of respondents that say that activity is up15. Given that they are di⁄usion indexes (fractions) they are stable and they are left in levels in the estimation. The dataset does not include data from other manufacturing surveys used in the construction of activity indexes and nowcasting such as the Philly Fed BOS survey or the Richmond Fed survey as well as information from the services 14Formerly known as the National Association of Purchasing Managers (NAPM). 15The ISM also reports other interesting di⁄usion indexes such as "new export orders", or "level of inventories", but these variables are available only starting from the 1990s, too short a time series to include them in pseudo-outof-sample forecasting experiments. The same is true for the recently introduced di⁄usion indexes from the Markit survey. 43
surveys: most likely the choice of the authors was dictated by the span of available data. The last variable in the table is the closely watched Consumer Sentiment Index from the University of Michigan used in the forecast of comsuption expenditures. Other informative sub-indexes of the Michigan survey were not included in the dataset. Table 24: Diffusion Indexes from the ISM Manufacturing Survey and the UM Consumer Survey id T G FRED-MD Description GSI Descr 19 1 1 NAPMPI ISM Manufacturing: Production Index NAPM prodn 29 1 2 NAPMEI ISM Manufacturing: Employment Index NAPM empl 60 1 4 NAPM ISM : PMI Composite Index PMI 61 1 4 NAPMNOI ISM : New Orders Index NAPM new ordrs 62 1 4 NAPMSDI ISM : Supplier Deliveries Index NAPM vendor del 63 1 4 NAPMII ISM : Inventories Index NAPM Invent 112 1 7 NAPMPRI ISM Manufacturing: Prices Index NAPM com price 130* 2 4 UMCSENTx Consumer Sentiment Index Consumer expect 8 Appendix B: Regularized SIR Algorithm Inrelevancetotheforecastingmodel(2.1), theresponseisy , t = 1;:::;T;:::, andthepredictors t+h consist of a group of p exogenous variables x = (x ;:::;x ) and the current response value y t t1 tp 0 t along with L of its lags, which is denoted by W t = (y t 1 ;:::;y t L )0. (cid:0) (cid:0) 1. Carry-out PCA on the sample predictor matrix X : T p T (cid:2) a. Compute the spectral decomposition of (cid:6) = VDV, where V = (v ;:::;v ) are the (cid:6) 0 1 p eigenvectors, and D = diag((cid:21)^ ;:::;(cid:21)^ ) is the diagonal matrix with the eigenvalues of (cid:6) 1 p arranged in decreasing order. b b b b b b b b b b b. Let M be the number of principal components that capture most of the variability in X, either by formal tests such as Bai and Ng (2002) [2] or by simply surveying the scree plot, i.e. the plot of the ordered eigenvalues versus component number. A scree plot displays the proportion of the total variation in a dataset that is explained by each of the components in a principle component analysis. Using the scree plot, the number of components is estimated to be the number corresponding to the "elbow" of the plot. c. Let F = v X;:::;F = v X be the retained principal factors of X. 1 10 M M0 2. Let X = (F ;:::;F ;Y ;Y ;:::;Y ) = (X~ ;:::;X~ ) be the (M + L + 1) 1 t t1b tM t tb1 t L 0 1 M+L+1 0 (cid:0) (cid:0) (cid:2) vector of adjusted predictors, and let q = M +L+1 where L denotes the lags of y . t e 3. Set X (cid:22) = (X (cid:22)~ ;:::;X (cid:22)~ )T, where X (cid:22)~ = T X~ =T, i = 1;:::;q. 1 q i t=1 it (cid:22) P 4. For je= 1;:::;J, let X j = yt SJ X t =n j , where n j is the number of Y t (cid:146)s in S j . 2 5. Compute P e e J n j (cid:22) (cid:22) (cid:22) (cid:22) M = (X X)(X X) j j 0 T (cid:0) (cid:0) j=1 X c e e e e 44
6. Compute the SVD of M = U(cid:3)UT, where (cid:3) = diag(^l ;:::;^l ), ^l > ^l > ::: > ^l are the 1 q 1 2 q eigenvalues of M and U = (u ;:::;u ) is the q q orthonormal matrix of its eigenvectors 1 q (cid:2) that correspond to ^l ;^lc;:::;^lbb. b b 1 2 q c b b b 7. Estimate the dimension d of the regression as d^using any dimension estimation method that applies. (cid:22) 8. TheSIRpredictorsareSIR = (cid:6) 1u X;:::;SIR = (cid:6) 1u X,where(cid:6) = T (X X)(X (cid:22) 1 (cid:0) 1 d^ (cid:0) d t=1 t (cid:0) t (cid:0) X)=T is the sample covariance matrix of the adjusted predictors X~. 0 P e b e b e e e e e Appendix C: Covariance-Stationary Time Series Properties A sequence of random variables x is covariance stationary or weakly stationary if and only if jt (cid:22) j R: E(x jt ) = (cid:22) j ; t > 0 9 2 8 and 8 t 0 (cid:21) 0; 9 (cid:13) jt 0 2 R: cov(x jt ;x j;t (cid:0) t 0 ) = E[(x jt (cid:0) (cid:22) j )(x j;t (cid:0) t 0(cid:0) (cid:22) j )] = (cid:13) j;t (cid:0) t 0 = (cid:13) j (t (cid:0) t 0 ) = (cid:13) j (h); 8 t > t 0 In other words, all the terms of the sequence have mean (cid:22), and the hth lag autocovariance, cov(x ;x ); depends only on t and not on t, so that x has time invariant (cid:133)rst and secjt j;t t 0 jt (cid:0) 0 ond moments. Thus, if x is a weakly stationary time series, then the vector x = (x ;x ;:::;x ) jt t 1t 2t pt and the time-shifted vector x = (x ;x ;:::;x ) have the same mean vectors and cot+h 1;t+h 2;t+h p;t+h variance matrices for every integer h and positive integer t. A strictly stationary sequence is one in which the joint distributions of these two vectors are the same. Weak stationarity does not imply strict stationarity but a strictly stationary time series with E(x2 ) < t is also weakly stationary. jt 1 8 A useful result is that any function of a weakly (strictly) stationary time series is also a weakly (strictly) stationary time series. A stationary time series x is ergodic if sample moments converge jt in probability to population moments. A multivariate time series x = (x ;x ;:::;x ) is covariance stationary and ergodic if all of its t 1t 2t pt component time series are stationary and ergodic. The mean of x is de(cid:133)ned as the (T 1) vector t (cid:2) E(x t ) =(cid:22) = (E(x 1t );E(x 2t );:::;E(x pt )) 0 = (cid:22) 1 ;(cid:22) 2 ;:::;(cid:22) p 0 and the variance/covariance matrix (cid:0) (cid:1) (cid:6)(0) = var(x ) = (x (cid:22))(x (cid:22)) = E x x (cid:22)(cid:22) = t t t 0 t 0t 0 (cid:0) (cid:0) (cid:0) var(x ) cov(x ;x ) cov(x ;x ) 1t(cid:0) 1t 2t (cid:1) (cid:0) 1t pt (cid:1) (cid:1)(cid:1)(cid:1) cov(x ;x ) var(x ) cov(x ;x ) 2t 1t 2t 2t pt = 2 . . . . . . (cid:1) . (cid:1) . (cid:1) . . . . 3 6 7 6cov(x pt ;x 1t ) var(x pt ) 7 6 (cid:1)(cid:1)(cid:1) (cid:1)(cid:1)(cid:1) 7 6 7 6 7 4 5 (cid:6)(h) = cov(x ;x ) = E (x (cid:22))(x (cid:22)) = E x x (cid:22)(cid:22) t+h t t+h t+h 0 t+h 0t 0 (cid:0) (cid:0) (cid:0) If x is a stationary time series wi (cid:0) th mean (cid:22) and autoc (cid:1) ovaria (cid:0) nce function (cid:1) (cid:13) (h), X(cid:22) = jt j j j T X =T converges in mean square to (cid:22) if (cid:13)(T) 0 as T (see Prop. 2.4.1, p. 58 t=1 jt j ! ! 1 in Brockwell and Davis (2002), prop. 10.5, p. 279 in Hamilton (1994)). The consistency of the P estimator X(cid:22) is established by applying the proposition to each of the component time series x , jt 45
j = 1;:::;p (Prop. 7.3.1, p. 234, Brockwell and Davis (2002)). A su¢ cient condition to ensure ergodicity (consistency) for second moments is (cid:13) (h) < (Prop. 7.3.1, p. 234, Brockwell 1h= j jj j 1 and Davis (2002)). (cid:0)1 P The parameters (cid:22), (cid:6)(0); and (cid:6)(h) are estimated from X ;X ;:::;X using the sample mo- 1 2 T ments: T 1 X(cid:22) = X T t=1 X T 1 (cid:6)^(0) = X t X(cid:22) X t X(cid:22) 0 T (cid:0) (cid:0) t=1 X(cid:0) (cid:1)(cid:0) (cid:1) (cid:6)^(h) = T 1 t T =(cid:0)1 h X t+h (cid:0) X(cid:22) X t (cid:0) X(cid:22) 0 if 0 (cid:20) h (cid:20) T (cid:0) 1 ((cid:0)^( P h) 0 (cid:0) (cid:1)(cid:0) (cid:1) if (cid:0) T +1 (cid:20) h < 0 The ergodic theorem obtains that if x is a strictly stationary and ergodic time series then as t T ! 1 X(cid:22) p (cid:22) (8.1) ! (cid:6)^(0) p (cid:6)(0) (8.2) ! (cid:6)^(h) p (cid:6)(h) (8.3) ! Under more restrictive assumptions on the process x it can also be shown that X(cid:22) is approxit T matelynormallydistributedforlargeT. Determinationofthecovariancematrixofthisdistribution is quite complicated. For example, the following is a CLT for a covariance stationary m-dependent vector process (Villegas (1976), Thm. 5.1). A stochastic vector process x ;x ;::: is m-dependent 1 2 if the two sets of random vectors x ;:::;x and x ;:::;x are independent whenever s r > m. 1 r s n (cid:0) Theorem 4 lf x ;x ;::: is a stationary m-dependent second-order vector process, then: 1 2 (i) the distribution of pT(X(cid:22) (cid:22)) converges to a (possibly degenerate) normal distribution with T (cid:0) zero mean vector and covariance matrix m V = (cid:6)(h) h= m X(cid:0) where (cid:6)(h) is the covariance matrix of x and x ; t t+h (ii) the covariance matrix of T (X +X +:::+X )=pT converges to V when T increases t=1 1 2 T inde(cid:133)nitely. P REFERENCES [1] Adragni, K.P. and Cook R.D. (2009). (cid:147)Su¢ cient dimension reduction and prediction in regression(cid:148). Phil. Trans. R. Soc. A, 367, 4385-4405. [2] Bai,J.andNg,S.(2002).(cid:147)DeterminingtheNumberofFactorsinApproximateFactorModels(cid:148). Econometrica. Vol. 70, No. 1 (Jan., 2002) , pp. 191-221. 46
[3] Bai, J. and Ng, S. (2007). (cid:147)Determining the Number of Primitive Shocks in Factor Models(cid:148), Journal of Business and Economic Statistics, 2007, 25:1, p.52-60. [4] Bai, J. and Ng, S. (2008a). (cid:147)Forecasting Economic Time Series Using Targeted Predictors(cid:148), Journal of Econometrics 146, 304-317. [5] Bai, J. and Ng, S. (2008b). (cid:147)Large Dimensional Factor Analysis(cid:148), Foundations and Trends in Econometrics, 2008, 3:2, 89-163. [6] Banbura M. and Modugno, M. (2014). (cid:147)Maximum Likelihood Estimation of Factor Models on Datasets with Arbitrary Patterns of Missing Data(cid:148). J. Appl. Econ. 29: 133(cid:150)160 (2014). [7] Barbarino, A.andBura, E.(2015).(cid:147)AUnifyingFrameworkforBigData(cid:148).ForthcomingWorking Paper 2015. [8] Barnichon R. (2010). (cid:147)Building a composite Help-Wanted Index(cid:148). Economics Letters, Dec 2010. [9] Bernard-Michel, C., Gardes, L. and Girad, S. (2009). Gaussian Regularized Sliced Inverse Regression, Statistics and Computing, 19(1), 85-98. [10] Boivin, J. and Ng, S. (2006). (cid:147)Are more data always better for factor analysis?(cid:148). Journal of Econometrics 132, p. 169-194. [11] Brockwell, P. J. and Davis, R.A. (2002). Introduction to Time Series and Forecasting, 2nd edition, Springer-Verlag, New York. [12] Bura, E. and Cook, R.D. (2001a), (cid:147)Estimating the structural dimension of regressions via parametric inverse regression,(cid:148)Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63, 393(cid:150)410. [13] Bura, E. and Cook, R.D. (2001b), (cid:147)Extending SIR: The Weighted Chi-Square Test,(cid:147) Journal of the American Statistical Association, 96, 996-1003. [14] Bura,E.andForzani,L.(2015),(cid:147)Su¢ cientreductionsinregressionswithellipticallycontoured inverse predictors,(cid:148)Journal of the American Statistical Association, 110, 420-434. [15] Bura, E. and Yang, J. (2011), (cid:147)Dimension Estimation in Su¢ cient Dimension Reduction: A Unifying Approach,(cid:148)Journal of Multivariate Analysis, 102, 130-142. [16] Chamberlain, G. and Rothchild, M. (1983). (cid:147)Arbitrage, Factor Structure, and Mean-Variance Analysis on Large Asset Markets(cid:148). Econometrica, 51(5), 1281-1304. [17] Cook R.D. (1998a) Regression Graphics: Ideas for studying regressions through graphics. New York: Wiley. [18] Cook, R. D. (2000). (cid:147)SAVE: A method for dimension reduction and graphics in regression(cid:148). Communications in Statistics: Theory Methods, 29, 2109-2121. (Invited paper for a special millennium issue on regression.) [19] Cook R.D. (2007). (cid:147)Fisher lecture: Dimension reduction in regression(cid:148). Statistical Science, 22, 1-26. 47
[20] Cook, R.D., and Forzani, L. (2008), (cid:147)Principal Fitted Components for Dimension Reduction in Regression(cid:148), Statistical Science, 23, 485-501. [21] Cook, R. D., Bing, L. and Francesca Chiaromonte. (cid:147)Dimension Reduction in Regression without Matrix Inversion(cid:148). Biometrika (2007), 94, 3, pp. 569(cid:150)584. [22] Chiaromonte, F. and Martinelli, J. (2002). Dimension reduction strategies for analyzing global gene expression data with a response. Mathematical Biosciences, 176, 123(cid:150)144. [23] Cook, R.D., and Weisberg, S. (1991), (cid:147)Discussion of Sliced inverse regression for dimension reduction(cid:148), Journal of the American Statistical Association, 86, 328-332. [24] Cook,R.D.andYin,X.(2001).Dimensionreductionandvisualizationindiscriminantanalysis (with discussion), Australian & New Zealand Journal of Statistics, 43(2), 147-199. [25] DeMol, C., Giannone, D.andReichlin, L.(2008).(cid:147)Forecastingusingalargenumberofpredictors: Is Bayesian shrinkage a valid alternative to principal components?(cid:148). Journal of Econometrics, 146(2), 318-328. ISSN 0304-4076, http://dx.doi.org/10.1016/j.jeconom.2008.08.011. [26] Diaconis, P. and Freedman, D. (1984). (cid:147)Asymptotics of graphical projection pursuit(cid:148). Ann. Statist. 12 793-815. [27] Doz, C., Giannone, D. and Reichlin, L. (2011). (cid:147)A two-step estimator for large approximate dynamic factor models based on Kalman (cid:133)ltering(cid:148). Journal of Econometrics, 164(1), 188-205, ISSN 0304-4076, http://dx.doi.org/10.1016/j.jeconom.2011.02.012. [28] Doz, C., Giannone, D. and Reichlin, L. (2012). (cid:147)A Quasi(cid:151)Maximum Likelihood Approach for Large, Approximate Dynamic Factor Models(cid:148). The Review of Economics and Statistics. 94(4), 1014-1024. [29] Eaton, M.L. (1983). Multivariate Statistics. A Vector Space Approach. New York: John Wiley & Sons, Inc. [30] Eaton, M. L. (1986). (cid:147)A characterization of spherical distributions(cid:148). Journal of Multivariate Analysis, 20, 272-276. [31] Engle, R. and Watson, M. (1981).(cid:147)A One-Factor Multivariate Time Series Model of Metropolitan Wage Rates(cid:148). Journal of the American Statistical Association, 76(376), 774-781. [32] Forni, M., Hallin, M., Lippi, M. and Reichlin, L. (2005). (cid:147)The Generalized Dynamic Factor Model: One-Sided Estimation and Forecasting(cid:148). Journal of the American Statistical Association. 100(471), 830-840. [33] Jerome Friedman, Trevor Hastie, Noah Simon, Rob Tibshirani (2015). (cid:147)Lasso and Elastic-Net Regularized Generalized Linear Models(cid:148). Available on CRAN as glmnet. [34] Geweke, J. (1977). (cid:147)The Dynamic Factor Analysis of Economic Time Series Models in Latent variables in socio-economic models(cid:148), ed. by D. J. Aigner and A. S. Goldberger. North Holland. [35] Groen, J. and Kapetanios, G. (2014). (cid:147)Revisiting Useful Approaches to Data-Rich Macroeconomic Forecasting(cid:148). Federal Reserve Bank of New York Sta⁄ Reports No.237, May 2008. Revised 2014. 48
[36] Hall, P. and Li, K. C. (1993). (cid:147)On almost linearity of low dimensional projections from high dimensional data.(cid:148)The Annals of Statistics, 21, 867(cid:150)889. [37] Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press, Princeton, New Jersey. [38] Helland, I. (1988). (cid:147)On the Structure of Partial Least Squares Regression(cid:148). Commun. Statist. Simula. 17(2), 581-607 (1988). [39] Hoerl, A. E. and Kennard, R. W. (1970). (cid:147)Ridge Regression: Applications to Nonorthogonal Problems(cid:148). Technometrics, 12(1), 69-82. [40] Hotelling,H.(1933).(cid:147)Analysisofacomplexofstatisticalvariablesintoprincipalcomponents(cid:148). J. Educ. Psychol., 24, 417(cid:150)441, 498(cid:150)520. [41] Ipsen, I. C. F. and Meyer, C. D. (1998). The Idea behind Krylov Methods. The American Mathematical Monthly, 105(10), 889-899. [42] Jurado, K., Ludvigson, S. and Serena Ng (2015). (cid:147)Measuring Uncertainty(cid:148). American Economic Review, 105(3): 1177-1216. [43] Bryan T. Kelly and Seth Pruitt (2014). (cid:147)The three-pass regression (cid:133)lter: A new approach to forecasting using many predictors(cid:148). 2014/5 Fama-Miller Working Paper. [44] Leeb, H. (2013).(cid:147)On the conditional distributions of low-dimensional projections from highdimensional data(cid:148). The Annals of Statistics, 41, 464-483. [45] Li, K. C. (1991). (cid:147)Sliced inverse regression for dimension reduction (with discussion)(cid:148). Journal of the American Statistical Association, 86, 316-342. [46] Li, L. and Li, H. (2004). (cid:147)Dimension reduction methods for microarrays with application to censored survival data(cid:148). Bioinformatics, 20(18), 3406(cid:150)3412. [47] Lumley, T. (2009), (cid:147)Regression Subset Selection(cid:148). Available on CRAN as leaps. [48] Magnus, J. R. (1988). Linear Structures. Oxford University Press, New York . [49] McCracken, M.W. and Ng, S. (2015). (cid:147)FRED-MD: A Monthly Database for Macroeconomic Research(cid:148). Working Paper 2015-012A, June 2015. [50] Bjłrn-Helge Mevik, Ron Wehrens and Kristian Hovde Liland (2013). (cid:147)Partial Least Squares andPrincipalComponentregression(cid:148).AvailableonCRANaspls.Seealso(cid:147)ThePLSPackage: Principal Components and Partial Least Squares Regression in R(cid:148)by the same authors in Journal of Statistical Software (2007), vol.18 Issue 2. [51] Thomas Sargent and Christopher Sims (1977). (cid:147)Business Cycle Modeling Without Pretending to Have too Much A Priori Economic Theory(cid:148)in New Methods in business cycle research: Proceedingsfromaconference, ed.byChristopherSims.FederalReserveBankofMinneapolis. [52] Steinberger, L. and Leeb, H. (2015). (cid:147)On conditional moments of high-dimensional random vectors given lower-dimensional projections(cid:148). WP 2015. [53] Stock, J. H. and Watson, M. (1998). (cid:147)Di⁄usion Indexes(cid:148). NBER Working Paper Series #6702, August 1998. 49
[54] Stock, J. H. and Watson, M. (2002). (cid:147)Forecasting Using Principal Components from a Large Number of Predictors(cid:148). Journal of the American Statistical Association, 2002. [55] Stock, J. H. and Watson, M. (2002). (cid:147)Macroeconomic Forecasting Using Di⁄usion Indexes(cid:148). Journal of Business and Economic Statistics, April 2002, Vol. 20 No. 2, 147-162. [56] Stock, J. H. and Watson, M. (2005). (cid:147)Implications of Dynamic Factor Models for VAR Analysis(cid:148). Working Paper, Revised June 2005 [57] Stock, J. H. and Watson, M. (2006). (cid:147)Macroeconomic Forecasting Using Many Predictors(cid:148). Handbook of Economic Forecasting, Graham Elliott, Clive Granger, Allan Timmerman (eds.), North Holland, 2006. [58] Stock, J. H. and Watson, M. (2008). (cid:147)Forecasting in Dynamic Factor Models Subject to Structural Instability(cid:148)in The Methodology and Practice of Econometrics, A Festschrift in Honour of Professor David F. Hendry, Jennifer Castle and Neil Shephard (eds), 2008, Oxford: Oxford University Press. [59] Stock, J. H. and Watson, M. (2010). (cid:147)Dynamic Factor Models(cid:148). in Oxford Handbook of Forecasting, Michael P. Clements and David F. Hendry (eds), 2011, Oxford: Oxford University Press. [60] Stock, J. H. and Watson, M. (2012). (cid:147)Generalized Shrinkage Methods for Forecasting Using Many Predictors(cid:148). Journal of Business and Economic Statistics, 30:4 (2012), 481-493. [61] Strang, G. (1978). Linear Algebra and its Applications, 3rd ed. Philadelphia, PA: Saunders. [62] Villegas, C.(1976).(cid:147)Onamultivariatecentrallimittheoremforstationarybilinearprocesses(cid:148). Stochastic Processes and their Applications, 4(2), 121-133. [63] Wold, S. (1978). (cid:147)Cross validatory estimation of the number of components in factor and principal component models(cid:148). Technometrics 20, 397-405. 50
Cite this document
Alessandro Barbarino and Efstathia Bura (2015). Forecasting with Sufficient Dimension Reductions (FEDS 2015-074). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2015-074
@techreport{wtfs_feds_2015_074,
author = {Alessandro Barbarino and Efstathia Bura},
title = {Forecasting with Sufficient Dimension Reductions},
type = {Finance and Economics Discussion Series},
number = {2015-074},
institution = {Board of Governors of the Federal Reserve System},
year = {2015},
url = {https://whenthefedspeaks.com/doc/feds_2015-074},
abstract = {Factor models have been successfully employed in summarizing large datasets with few underlying latent factors and in building time series forecasting models for economic variables. When the objective is to forecast a target variable y with a large set of predictors x, the construction of the summary of the xs should be driven by how informative on y it is. Most existing methods first reduce the predictors and then forecast y in independent phases of the modeling process. In this paper we present an alternative and potentially more attractive alternative: summarizing x as it relates to y, so that all the information in the conditional distribution of y|x is preserved. These y-targeted reductions of the predictors are obtained using Sufficient Dimension Reduction techniques. We show in simulations and real data analysis that forecasting models based on sufficient reductions have the potential of significantly improved performance.},
}