feds · November 4, 2021

Better the Devil You Know: Improved Forecasts from Imperfect Models

Abstract

Many important economic decisions are based on a parametric forecasting model that is known to be good but imperfect. We propose methods to improve out-of-sample forecasts from a mis-specified model by estimating its parameters using a form of local M estimation (thereby nesting local OLS and local MLE), drawing on information from a state variable that is correlated with the misspecification of the model. We theoretically consider the forecast environments in which our approach is likely to offer improvements over standard methods, and we find significant fore- cast improvements from applying the proposed method across distinct empirical analyses including volatility forecasting, risk management, and yield curve forecasting. Accessible materials (.zip)

Finance and Economics Discussion Series Federal Reserve Board, Washington, D.C. ISSN 1936-2854 (Print) ISSN 2767-3898 (Online) Better the Devil You Know: Improved Forecasts from Imperfect Models Dong Hwan Oh and Andrew J. Patton 2021-071 Please cite this paper as: Oh, Dong Hwan, and Andrew J. Patton (2021). “Better the Devil You Know: Improved Forecasts from Imperfect Models,” Finance and Economics Discussion Series 2021-071. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2021.071. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

Better the Devil You Know: Improved Forecasts from Imperfect Models (cid:3) Dong Hwan Oh Andrew J. Patton Federal Reserve Board Duke University First draft: August 2021. This draft: October 2021. Abstract Many important economic decisions are based on a parametric forecasting model that is known to be good but imperfect. We propose methods to improve out-of-sample forecasts from a misspeci(cid:133)ed model by estimating its parameters using a form of local M estimation (thereby nesting local OLS and local MLE), drawing on information from a state variable that is correlated with the misspeci(cid:133)cation of the model. We theoretically consider the forecast environments in which our approach is likely to o⁄er improvements over standard methods, and we (cid:133)nd signi(cid:133)cant forecast improvements from applying the proposed method across distinct empirical analyses including volatility forecasting, risk management, and yield curve forecasting. Keywords: model misspeci(cid:133)cation, local maximum likelihood, volatility forecasting, value-atrisk and expected shortfall forecasting, yield curve forecasting. J.E.L. codes: C53, C51, C58, C14. (cid:3)We thank Tim Bollerslev, Ana Galvao, Mike McCracken, Rogier Quaedvlieg, Allan Timmermann (discussant) and participantsin theSoFiE SeminarSeries. Theviewsexpressedin thispaperarethoseoftheauthorsand donot necessarily re(cid:135)ect those of the Federal Reserve Board. Email: donghwan.oh@frb.gov, andrew.patton@duke.edu. 1

1 Introduction Many important economic decisions are based on a forecasting model that is known to be good but imperfect. Such a model may be retained for a variety of reasons: the model, and its (cid:135)aws, may be well-studied and understood, unlike its possible replacement; there may be institutional impediments to adopting new models; the competitive environment may be such that it is not possible to switch to a new model in time for it to be of help. For example, central banks maintain a decision-making infrastructure around a given model or class of models, as do risk management departments at large (cid:133)nancial institutions, and high-frequency trading algorithms have models physically built into the processing chips. In all of these cases, the model at the heart of these decisions is known to be good (else it would not have been embedded in the processes) however it is almost certainly also imperfect. We propose a method to improve the out-of-sample forecasts from a misspeci(cid:133)ed model by estimating the parameters in a way that emphasizes epochs that are similar to the one in which the forecast is being made. Our approach exploits information from a state variable that is correlated withthemisspeci(cid:133)cationofthemodel. Forexample,considerthecasethatthetruedatagenerating process (DGP) is a complicated nonlinear autoregressive process, and the model is a simple AR(1). Throughexperience,theforecastusermayknowthatwhenthetargetvariableisfarfromitsaverage level the degree of mean-reversion tends to be stronger than when it is around its average value. This information can be used to (cid:147)tilt(cid:148)the AR parameter from its usual OLS estimate when the targetvariableisindeedfurtherfromitsmean. Weprovideastructuredapproachforincorporating this useful information into the parameter estimate without altering the baseline model. Formally, our method can be interpreted as a form of nonparametric estimation of the parametersofthebaselinemodel. Itisafolktheoremineconomicforecastingthatnonparametricmethods perform poorly out-of-sample, as the increased estimation error overwhelms the improved (cid:133)t of the model. We consider this canonical trade-o⁄in a theoretical examination of our approach, and we identify two key aspects of the forecasting problem that in(cid:135)uence the ability of our approach to improve upon standard methods. Firstly, if the baseline model is (cid:147)too good,(cid:148)then there is little 2

room for improvement and usual estimation approach will dominate. Fortunately or unfortunately, even popular forecasting models are inevitably misspeci(cid:133)ed, leaving open the possibility for improvement. Secondly, if the forecast user(cid:146)s experience does not yield an informative state variable, then our estimator will converge to the usual estimator(cid:146)s probability limit, but accompanied by greater estimation error. Widely-used models inevitably accumulate a lot of practical experience abouttheirpropertiesandpitfalls, andsoitiscommonlythecasethataninformativestatevariable is available. We apply the proposed method to four economic forecasting problems. In the (cid:133)rst two applications we consider volatility forecasting, either using the seminal GARCH model of Bollerslev (1986), or the popular alternative for models using high-frequency data, the HAR model of Corsi (2009), estimated by QML. Our third application considers joint forecasts of Value-at-Risk and Expected Shortfall (VaR and ES), and so the target functional is a (2 1) vector, estimated using (cid:2) M-estimation. Finally, we consider yield curve forecasts using the popular Diebold and Li (2006) model, estimated by OLS, with maturities ranging from three months to ten years. These four applications illustrate the variety of environments (target functionals, dimensionality, estimation methods), and we show that our proposed method provides statistically signi(cid:133)cant improvements over standard methods. The estimation method proposed here is closely related to the local MLE of Tibshirani and Hastie (1987), Fan et al. (1998), and Fan et al. (2009), but unlike those approaches we do not modify the baseline model in an attempt to recover the DGP; instead we (cid:147)tilt(cid:148)the parameters of the model so that they better (cid:133)t the current environment, and produce better forecasts.1 Our approach is a mid-point between the fully parametric ML estimator and the fully nonparametric approach of Fan et al. (2009): we keep the model fully parametric, but we use nonparametric methods to optimally weight the observations used in the estimation window. In this sense, our approach is also similar to the (cid:147)relevance-weighted ML(cid:148)of Hu (1997), however we di⁄er in that our weights arise from the chosen kernel and bandwidth, and we allow the bandwidth to go to 1More speci(cid:133)cally, we follow Fan et al. (1998) in the kernel-weighting of the likelihoods, but we do not take an expansion of the functional of interest in the state variable. Instead, we retain the speci(cid:133)cation of that functional as given by the baseline model. 3

zero, making this a nonparametric estimator.2 Also related, but in a di⁄erent context, Kristensen and Mele (2011) propose a method to obtain derivative prices by approximating the pricing error implied by a simple and well-known method (the Black-Scholes formula). A well-known type of local estimation is rolling window estimation, which has been found to improve forecast performance in a variety of applications, particularly in the presence of structural breaks, see Pesaran et al. (2013), Inoue et al. (2017) and others. It is also similar to the use of exponential smoothing, see Brown (1956), Muth (1960), and Zumbach (2006), where more recent observationsaregivenahigherweightinestimationthanolderobservations. Bothmethodsattempt to capture the fact that as the DGP evolves through time, the best-(cid:133)tting approximating model will vary too. These methods correspond to using time as the state variable, and a one-sided rectangular or exponential kernel.3 Related, Ang and Kristensen (2012) and Inoue et al. (2020) consider the estimation of factor models and GARCH models, respectively, with parameters that vary smoothly over time, though those papers focus on model estimation rather than prediction. Dendramis et al. (2020) is perhaps the most closely-related paper to ours. That paper focuses on conditional mean forecasts made using ARMA models and estimated by OLS. The authors note that the gains they (cid:133)nd are somewhat small and not always a signi(cid:133)cant improvement over their benchmark AR(1) model. This is in contrast with the variety of target variables, functionals, and estimation methods that we consider, and the robust and strongly signi(cid:133)cant gains in forecast performance that we (cid:133)nd empirically. Further, we theoretically analyze the bias-variance trade-o⁄ present in a local estimation framework, and obtain predictions for when such a method is likely to work well in practice. Our approach is also related to work on bringing outside information to bear on a forecasting problem. Manganelli (2009) considers the case that the forecaster has a (cid:147)default decision(cid:148)and provides a structured method for tilting a model-based forecast towards the default decision. Gia- 2Blasques et al. (2016) also consider a weighted ML method, for applications where the vector of dependent variablescanbeseparatedintothoseofparticularinterestandtherest,andinestimationthelikelihoodoftheformer is overweighted relative to the latter. 3Theoretically,theinterpretationofthelocalestimatordi⁄ersintheseapplications: withastochasticstatevariable onemaystillassumestationarity,whilewhenusingtimeasthestatevariableonemustinsteadconsiderheterogeneity intheDGP,usuallyintheformofsmoothlyevolvingparameters. Empirically,eitherformofstatevariableisequally easy to handle, and we consider both in our empirical applications. 4

comini and Ragusa (2014) and Pettenuzzo et al. (2014) provide methods for adjusting model-based forecasts so that they satisfy constraints suggested by economic theory. The approach proposed in this paper requires less of the forecaster: no default decision and no economic theory, only a variable that is thought to be related to the degree of model misspeci(cid:133)cation. Exploiting the expertise of the forecast user to identify a state variable to improve the forecasts obtained from a baseline model is also related to professional forecasters(cid:146)use of both statistical modelsandexpertjudgment. Numerousstudies,seeAnget al. (2007)andFaustandWright(2009) for example, have found that professional forecasters regularly outperform standard model-based forecasts. Our tilting of the model parameters may be interpreted as a form a (cid:147)structured(cid:148)expert judgment, and the generally superior performance of our proposed method is consistent with this literature. The remainder of the paper is structured as follows. In Section 2 we present our estimator and theoretically consider the bias-variance trade-o⁄for local and non-local estimation methods in outof-sample forecasting. In Section 3 we apply our estimator to four economic forecasting problems and Section 4 concludes. A supplemental appendix contains additional details and results. 2 Local estimation and out-of-sample forecasting We consider a target variable Y ; and target functional g . For example, g could be the t+1 t t 2 G mean, variance, median, a quantile, etc. It may also, with some changes in notation and methods, be a predictive density, though we will focus on point forecasting. The target functional may also be a vector, e.g. if Y is a vector and g is its mean, or if Y is a scalar and g is the vector t+1 t t+1 t containing the Value-at-Risk and Expected Shortfall. The forecaster(cid:146)s information set is ; and t F naturally g is -measurable. We focus on one-step-ahead forecasts, but all the results below can t t F be extended to general h- step-ahead forecasts, for h < . 1 Let L be a loss function (scoring rule) that elicits the desired target functional, i.e., that g ty = argmin E[L(Y t+1 ;g) t ] (1) g jF 2G 5

For example, if the target functional is the mean, then L can be the squared forecast error.4 The baselinemodelisaparametricmodelforthetargetfunctional, g ((cid:18));andweassumetheparameter t of the model is obtained via M-estimation minimizing the same loss function:5 T 1 ^(cid:18) = argmin L(Y ;g ((cid:18))) (2) T t t 1 (cid:18) (cid:2) T (cid:0) 2 t=1 X where (cid:18) (cid:2) R p: We assume that the sample runs from t = 0;1;:::;T; yielding T observations for 2 (cid:18) estimation. Under standard conditions the usual estimator converges at rate pT to a well-de(cid:133)ned probability limit, ^(cid:18)(cid:3); and has a Normal asymptotic distribution: ^(cid:18)(cid:3) argmin E[L(Y t+1 ;g t ((cid:18)))] (3) (cid:17) (cid:18) (cid:2) 2 pT ^(cid:18) ^(cid:18)(cid:3) D N (0;(cid:6)) (4) T (cid:0) ! (cid:16) (cid:17) 2.1 Incorporating information from a state variable Denote the forecaster(cid:146)s state variable as S t , with support R d: This variable must be, naturally, S (cid:26) -measurable, and may or may not be one of the variables in the baseline model. We consider an t F estimator de(cid:133)ned by: T 1 ~(cid:18) (s) = argmin L(Y ;g ((cid:18)))K(s S ;h ), for s Int( ) (5) h;T t t 1 t 1 T (cid:18) (cid:2) T (cid:0) (cid:0) (cid:0) 2 S 2 t=1 X where K is the kernel function, h is a bandwidth parameter that shrinks with the sample size, T and Int( ) is the interior of the set . Under a variety of regularity conditions, see e.g. Fan et al. S S (2009), the limit of this estimator is: ~(cid:18)(cid:3)(s) argmin E[L(Y t+1 ;g t ((cid:18))) S t = s] (6) (cid:17) (cid:18) (cid:2) j 2 4As discussed in Gneiting (2011) and Patton (2020), in many cases there are an in(cid:133)nite number of loss functions that elicit a given functional. 5Matchingtheestimationandevaluationlossfunctionsisintuitiveandcanleadtoimprovedforecasts,seeGranger (1969)andWeiss(1996)forexample,howeverinsomeapplicationstheremaybegainsfromusinganalternativeloss function for estimation, see Hansen and Dumitrescu (2021). 6

With the bandwidth shrinking at an appropriate rate, which di⁄ers depending on assumptions about smoothness and temporal dependence, the rate of convergence for the estimator is T1=2 (cid:13) (cid:0) for some (cid:13) (0;1=2): 2 T1=2 (cid:0) (cid:13) ~(cid:18) h;T (s) ~(cid:18)(cid:3)(s) = p (1) s Int( ) (7) (cid:0) O 8 2 S (cid:16) (cid:17) Forthepurposesofouranalysisbelow, werequireonlythattheestimatorisconsistent(so(cid:13) < 1=2) but converges more slowly than the parametric rate ((cid:13) > 0). Naturally, in applied work one would like to (cid:133)nd the local estimator with the fastest rate of convergence, and in our applications we use cross-validation to choose the bandwidth that minimizes average loss.6 2.2 The special case of correct speci(cid:133)cation Consider the special case that the baseline model is correctly speci(cid:133)ed and point identi(cid:133)ed for the target functional. This implies ! ^(cid:18)(cid:3) (cid:2) s.t. g ty = g t (^(cid:18)(cid:3)) a.s. t (8) 9 2 8 Now consider the population local estimator using today(cid:146)s value of the state variable ~(cid:18)(cid:3)(S t ) argmin E[L(Y t+1 ;g t ((cid:18))) S t ] (9) (cid:17) (cid:18) (cid:2) j 2 which implies E L Y t+1 ;g t ~(cid:18)(cid:3)(S t ) S t E[L(Y t+1 ;g t ((cid:18))) S t ] a.s. t; (cid:18) (cid:2) (10) j (cid:20) j 8 8 2 h (cid:16) (cid:16) (cid:17)(cid:17) i However, since g ty = argmin g E[L(Y t+1 ;g) t ]; we also have 2G jF E L Y t+1 ;g ty t = E L Y t+1 ;g t (^(cid:18)(cid:3)) t E[L(Y t+1 ;g t ((cid:18))) t ] a.s. t; (cid:18) (cid:2) (11) jF F (cid:20) jF 8 8 2 h (cid:16) (cid:17) i h (cid:16) (cid:17)(cid:12) i 6We focus on the case of a stochastic state varia(cid:12)ble here but the results below go through when conditioning (cid:12) instead on time, as the fundamental trade-o⁄ between a better local approximation and greater estimation error remains the same. The rate of convergence of the local estimator when using time as the state variable can again be shown to be T1=2 (cid:0) (cid:13) for some (cid:13) (0;1=2) under a variety of conditions, see Ang and Kristensen (2012) for example. 2 7

Since S ; we can combine Equation (10) with the law of iterated expectations (LIE) to infer t t 2 F E L Y t+1 ;g t (^(cid:18)(cid:3)) S t = E L Y t+1 ;g t ~(cid:18)(cid:3)(S t ) S t a.s. t (12) 8 h (cid:16) (cid:17)(cid:12) i h (cid:16) (cid:16) (cid:17)(cid:17)(cid:12) i (cid:12) (cid:12) (cid:12) (cid:12) and thus, by the point-identi(cid:133)cation assumption, that ~(cid:18)(cid:3)(S t ) = ^(cid:18)(cid:3). Noting that this must be true (a.s.) for all t; this implies that ~(cid:18)(cid:3)(s) is (cid:135)at in s: That is, the local M estimator reduces to the usual M estimator when the baseline model is correctly speci(cid:133)ed. 2.3 Out-of-sample forecasting and a bias-variance trade-o⁄ We now consider out-of-sample forecast accuracy using the local estimator and the usual, nonlocal, estimator. We obtain a form of bias-variance trade-o⁄, which illuminates the conditions under which the local estimator is likely to outperform the usual estimator. By local estimation optimization, we have E L Y t+1 ;g t ~(cid:18)(cid:3)(S t ) S t E[L(Y t+1 ;g t ((cid:18))) S t ] a.s. t; (cid:18) (cid:2) (13) j (cid:20) j 8 8 2 h (cid:16) (cid:16) (cid:17)(cid:17) i and by evaluating the right-hand side at the non-local estimator and invoking the LIE we obtain E L Y t+1 ;g t ~(cid:18)(cid:3)(S t ) E L Y t+1 ;g t (^(cid:18)(cid:3)) (14) (cid:20) h (cid:16) (cid:16) (cid:17)(cid:17)i h (cid:16) (cid:17)i This simply shows that the OOS average loss from the local estimator will weakly dominate that from the usual estimator in population.7;8 The gains accrue as the local estimator can vary with the realized value of the state variable, while the usual estimator is (cid:133)xed. As shown in the previous section, when the model is correctly speci(cid:133)ed we have ~(cid:18)(cid:3)(s) = ^(cid:18)(cid:3) s and so local and non-local 8 estimators are identical and yield identical expected loss. Next we consider the variance of the estimators, and the deleterious impact that estimation 7Note that this is true even though OOS performance is computed using non-weighted losses, that is, the kernel function used in the local estimator does not appear. 8It is also possible to look at the di⁄erence in forecast performance conditional on the value of a state variable, for example by using the methods of Li, et al. (2021), or as a function of time, as in Giacomini and Rossi (2010) and Richter and Smetanina (2020). In our empirical applications we consider both unconditional and conditional performance, but our focus in this section is on overall (unconditional) performance in the OOS period. 8

error has on expected loss. It is this aspect that often makes forecasts from nonparametric models worsethanthosefromparametricmodels. Wedosousingasecond-orderTaylorseriesexpansionof the time T +1 expected loss incurred using ~(cid:18) h;T (S T ); centered on the limiting parameter ~(cid:18)(cid:3)(S T ). For ease of presentation we assume that dim((cid:18)) = 1; which can easily be relaxed, and we assume that the loss function is di⁄erentiable.9 L Y ;g ~(cid:18) (S ) (15) T+1 T h;T T (cid:16) (cid:16) (cid:17)(cid:17) @L Y T+1 ;g T ~(cid:18)(cid:3)(S T ) (cid:25) L Y T+1 ;g T ~(cid:18)(cid:3)(S T ) + (cid:16) @(cid:18)(cid:16) (cid:17)(cid:17) ~(cid:18) h;T (S T ) (cid:0) ~(cid:18)(cid:3)(S T ) (16) (cid:16) (cid:16) (cid:17)(cid:17) (cid:16) (cid:17) 1 @2L Y T+1 ;g T ~(cid:18)(cid:3)(S T ) 2 + 2 (cid:16) @(cid:18)2(cid:16) (cid:17)(cid:17) ~(cid:18) h;T (S T ) (cid:0) ~(cid:18)(cid:3)(S T ) (cid:16) (cid:17) Taking conditional expectations we then (cid:133)nd E L Y T+1 ;g T ~(cid:18) h;T (S T ) T E L Y T+1 ;g T ~(cid:18)(cid:3)(S T ) T (17) F (cid:25) F h (cid:16) (cid:16) (cid:17)(cid:17)(cid:12) i h (cid:16) (cid:16) (cid:17)(cid:17)(cid:12) i2 (cid:12) (cid:12) +H~ T(cid:3) (S T ) ~(cid:18) h;T (S T ) (cid:0) ~(cid:18)(cid:3)((cid:12) (cid:12) S T ) 1 @2L (cid:16) Y T+1 ;g T ~(cid:18)(cid:3)(S T ) (cid:17) where H~ T(cid:3) (S T ) (cid:17) E 22 (cid:16) @(cid:18)2(cid:16) (cid:17)(cid:17)(cid:12)F T 3 (18) (cid:12) (cid:12) 4 (cid:12) 5 (cid:12) (cid:12) The (cid:133)rst-order term in the expansion drops out as E @L Y T+1 ;g T ~(cid:18)(cid:3)(S T ) =@(cid:18) T = 0 a.s. F by the de(cid:133)nition of ~(cid:18)(cid:3)(S T ): H~ T(cid:3) (S T ) is a Hessian-lik h e ter (cid:16) m and is (cid:16) positive (cid:17) d (cid:17) e(cid:133)nit (cid:12) (cid:12) (cid:12) e in i standard estimation problems. Then, taking unconditional expectations we obtain E L Y T+1 ;g T ~(cid:18) h;T (S T ) E L Y T+1 ;g T ~(cid:18)(cid:3)(S T ) (19) (cid:25) h (cid:16) (cid:16) (cid:17)(cid:17)i h (cid:16) (cid:16) (cid:17)(cid:17)i 2 +E[ H~ T(cid:3) (S T ) ~(cid:18) h;T (S T ) ~(cid:18)(cid:3)(S T ) ] (cid:0) (cid:16) (cid:17) The (cid:133)rst term on the right-hand side is the OOS average loss for the local estimator evaluated at the population parameter, and the second term is positive and of the order T 1+2(cid:13) : p (cid:0) O 9Ifthetargetvariableiscontinuously distributed,non-di⁄erentiablelossfunctionsliketheF(cid:0)issler-Zie(cid:1)gellossused in Value-at-Risk and Expected Shortfall estimation, can be accommodated by approximating the expected loss. 9

Next we consider the usual, non-local, estimator using similar steps, and obtain: 2 E L Y T+1 ;g T ^(cid:18) T E L Y T+1 ;g T ^(cid:18)(cid:3) +E[H^ T(cid:3) ^(cid:18) T ^(cid:18)(cid:3) ] (cid:25) (cid:0) h (cid:16) (cid:16) (cid:17)(cid:17)i h (cid:16) @2L Y (cid:16) ;g (cid:17)(cid:17)i^(cid:18)(cid:3) (cid:16) (cid:17) 1 T+1 T where H^ T(cid:3) (cid:17) E 22 (cid:16) @(cid:18)2 (cid:16) (cid:17)(cid:17)(cid:12)F T 3 (cid:12) (cid:12) 4 (cid:12) 5 (cid:12) (cid:12) The expected loss using the estimated parameter is again equal to the average loss based on the infeasible population parameter, and a positive term related to estimation error. In this case, the estimation error term is of order T 1 : p (cid:0) O Finally, consider the di⁄erence be(cid:0)twee(cid:1)n the OOS losses using the above two approximations: E L Y T+1 ;g T ~(cid:18) h;T (S T ) L Y T+1 ;g T ^(cid:18) T (20) (cid:0) h (cid:16) (cid:16) (cid:17)(cid:17) (cid:16) (cid:16) (cid:17)(cid:17)i E L Y T+1 ;g T ~(cid:18)(cid:3)(S T ) L Y T+1 ;g T ^(cid:18)(cid:3) + p T (cid:0) 1+2(cid:13) (cid:25) (cid:0) O h (cid:16) (cid:16) (cid:17)(cid:17) (cid:16) (cid:16) (cid:17)(cid:17)i (cid:0) (cid:1) The (cid:133)rst term on the right-hand side is non-positive, as the local estimator has weakly smaller expected loss than the usual estimator when both are evaluated at population parameters. The second term is dominated by the magnitude of the estimation error in the local estimator, which is of the order T 1+2(cid:13) : Since this term is positive, it increases the expected loss using estimated p (cid:0) O parameters, and(cid:0) we ob(cid:1)serve the usual trade-o⁄ in forecasting: a more (cid:135)exible model leads to improved (cid:133)t, at a cost of increased estimation error. Whether one of these terms outweighs the other depends on features speci(cid:133)c to each application, and we discuss these next. 2.4 Empirical predictions from the theoretical analysis Firstly, consider the case that the baseline model is correctly speci(cid:133)ed. In that case Section 2.2 showed that ~(cid:18)(cid:3)(s) = ^(cid:18)(cid:3) s; and we have 8 E[ L Y T+1 ;g T ~(cid:18)(cid:3)(S T ) L Y T+1 ;g T (^(cid:18)(cid:3)) ] = 0 (21) (cid:0) (cid:16) (cid:16) (cid:17)(cid:17) (cid:16) (cid:17) localestimatorloss non-localestimatorloss | {z } | {z } 10

Inthiscase,thereisnoimprovementinthe(cid:133)tfromusinglocalestimation,andincreasedestimation error causes local estimation to have worse OOS performance. More generally, when the baseline model is (cid:147)very good(cid:148)the scope for an improvement in (cid:133)t is reduced, and the possibility that any such improvements are more than o⁄set by increased estimation error is increased. Secondly, consider the case that the state variable contains no information about variation in the (cid:133)t of the misspeci(cid:133)ed model. We quantify this by considering the population (cid:133)rst-order conditions (FOCs) for the estimation methods. If the scores of the usual, non-local, estimator are mean independent of the state variable S , i.e., t @L Y t+1 ;g t (^(cid:18)(cid:3)) @L Y t+1 ;g t (^(cid:18)(cid:3)) E 2 (cid:16) @(cid:18) (cid:17)(cid:12) S t 3 = E 2 (cid:16) @(cid:18) (cid:17)3 (22) (cid:12) (cid:12) 4 (cid:12) 5 4 5 (cid:12) (cid:12) then the local estimation(cid:146)s FOC is satis(cid:133)ed when ~(cid:18)(cid:3)(S t ) = ^(cid:18)(cid:3); since the RHS of the above equation equals zero by the FOC of the usual estimator. Thus a worthless state variable leads to ~(cid:18)(cid:3)(s) being (cid:135)at in s. This is the same outcome as in the correctly-speci(cid:133)ed case, although from a di⁄erent source, namely the use of a poor state variable.10 Since ~(cid:18)(cid:3)(S t ) = ^(cid:18)(cid:3) in this case, there is obviously no improvement in the (cid:133)t from using local estimation, and the estimation error term discussed in the previous section causes local estimation to have worse OOS performance. More generally, when the state variable is only weakly informative about model misspeci(cid:133)cation the gains from local estimation are lower, and the possibility that any such gains are more than o⁄set by increased estimation error is increased. 2.5 A stylized example Toillustratetheaboveideas, consideranonlinearAR(1)processastheDGPandastandardAR(1) asthebaselinemodel. Concretely,weuseastationarycopula-basedMarkovprocess(see,e.g.,Chen and Fan (2006) and Beare (2010)), with standard Normal marginal distributions and a Clayton 10In the correctly-speci(cid:133)ed case, the scores are a MDS with respect to ; and since S the LIE implies t t t F 2 F E @L Y t+1 ;g t ^(cid:18)(cid:3) =@(cid:18) S t =0foranychoiceofS t ,implyingthatinthiscasethereareno usefulstatevariables. h (cid:16) (cid:16) (cid:17)(cid:17) (cid:12) i (cid:12) (cid:12) 11

copula linking adjacent realizations: (Y ;Y ) = C ((cid:8);(cid:8);(cid:20)) (23) t t 1 Clayton (cid:0) where (cid:8) is a standard Normal CDF, and (cid:20) is the parameter of the Clayton copula. We set (cid:20) = 5 which implies (cid:133)rst-order autocorrelation of about 0.85, and consider an estimation sample of T = 1000: The conditional mean of Y given Y is nonlinear in Y for this process, and in the t t 1 t 1 (cid:0) (cid:0) upper panel of Figure 1 we see that it is increasing and concave. The upper panel of Figure 1 also shows the (cid:133)tted linear AR(1) prediction obtained by OLS. If we use Y as the state variable for local OLS estimation then the local estimator asympt 1 (cid:0) totically recovers the true conditional expectation function, since the truth is a nonlinear AR(1). That is, in this example local estimation completely (cid:133)xes the misspeci(cid:133)cation of the linear AR(1) model. This estimator is denoted (cid:147)Local OLS 1,(cid:148)and the upper panel of Figure 1 con(cid:133)rms that this estimator closely tracks the true conditional expectation function.11 We also consider a local estimator using the second lag of the dependent variable, which is correlated with the ideal state variable but imperfect. The resulting estimated conditional expectation function is approximately correct for Y < 0, where (cid:133)rst-order dependence is particularly strong for this process, but is t 1 (cid:0) noticeably incorrect for Y > 0; where dependence is weaker and the state variable is worse. t 1 (cid:0) The lower panel of Figure 1 presents the out-of-sample RMSE for the two local estimators across a range of bandwidth parameters. For the optimal choice of bandwidth (h = 0:41) the RMSE of (cid:133)rst local estimator is almost equal to the RMSE of the optimal forecast, which of course represents the lower bound on RMSE. The RMSE of the second local estimator is greater than that of the (cid:133)rst, consistent with this estimator using a worse state variable, and it is below the usual OLS estimator(cid:146)s RMSE for all but the smallest choices of bandwidth. (The optimal bandwidth is h = 0:62:) As the bandwidth grows the two local estimators generate RMSE that converges to that of the OLS estimator, as in that case the local estimators reduce to the OLS estimator. [ INSERT FIGURE 1 ABOUT HERE ] 11Foreachofthe(cid:147)localOLS(cid:148)estimatedconditionalexpectationfunctionsintheupperpanelweusethebandwidths identi(cid:133)ed as optimal according to the lower panel of Figure 1. 12

3 Empirical applications Weconsiderournewestimationmethodinfourdi⁄erentempiricalapplications. Firstly,weconsider thewidely-usedGARCHmodelofBollerslev(1986). Inthisapplicationthetargetvariable(returns) and the target functional (conditional variance) are both scalars, and the model is estimated using quasi maximum likelihood (QML). In our second application we consider a popular high-frequency successor to the GARCH model, namely the HAR model of Corsi (2009). In this application the target variable functional is again a scalar, and estimation is again done via QML. Our third application considers forecasts of Value-at-Risk and Expected Shortfall (VaR and ES), and so the target functional is a (2 1) vector, and the model is estimated using M-estimation. Finally, we (cid:2) consider yield curve forecasts using the popular (cid:147)dynamic Nelson-Siegel(cid:148)model of Diebold and Li (2006). In this case the target variable is a (12 1) vector of yields for bonds with maturities (cid:2) ranging from three months to ten years and the target functional is the conditional mean of that vector, estimated using OLS. These four applications illustrate the variety of environments (target functionals, dimensionality, estimation methods), and we show that our proposed method provides statistically signi(cid:133)cant improvements over standard methods. Across all four applications, for stochastic state variables we use a Gaussian kernel: x2 K G (x;h) = exp (cid:0)2h2 , x 2 R, h > 0 (24) (cid:26) (cid:27) We consider values for the bandwidth, h; in the range 0:01(cid:27) to 3(cid:27) , where (cid:27) is the standard S S S deviation of the state variable. A small value of h makes the model parameters more (cid:147)local,(cid:148)but also decreases their precision since the e⁄ective sample size is smaller, and as h diverges the local estimator approaches the benchmark non-local method. We also consider an in(cid:133)nite bandwidth by comparing the average loss from the best (cid:133)nite bandwidth with that from the non-local method. Whenusingtimeasastatevariableweuseaone-sidedexponentialkernelwithbandwidthparameter (cid:21) and window length m: K (j;(cid:21)) = (cid:21)j(1 (cid:21))=(1 (cid:21)m)1 j < m , j 0;1;2;::: (25) E (cid:0) (cid:0) f g 2 13

We consider values for (cid:21) ranging from 0.98 to 0.9999. Smaller values of (cid:21) imply that older data are given less weight in estimation, making the model parameters more local (in time) but subject to greater estimation error. As (cid:21) 1 the weight function becomes (cid:135)at and the local estimator ! approachesthebenchmarknon-localestimator. Weconsiderthelimitingcaseof(cid:21) = 1bycomparing the smallest average loss from a bandwidth less than 1 with the loss from the non-local method. To select the optimal bandwidth parameter(s) for each state variable, we split the estimation sample into a (cid:147)training sample(cid:148)(the (cid:133)rst half) for estimation of the model parameters with a variety of bandwidths, and a (cid:147)validation sample(cid:148)(the second half) to select the optimal bandwidth parameter(s).12 We then use the selected bandwidth parameter when evaluating the model in the out-of-sample (OOS) period, eliminating look-ahead bias in both the model parameters and bandwidth parameters. Model parameters are re-estimated daily throughout the OOS period using a rolling window of data, while bandwidth parameters are kept (cid:133)xed at their optimized value from the validation sample. In all applications we consider four stochastic state variables, motivated by our applications to volatility or risk forecasting and yield curve forecasting. We consider two measures of volatility: 5-minute realized volatility (RV) on the S&P 500 index,13 and VIX, a measure of S&P 500 index volatility extracted from options prices. We also consider two measures derived from the yield curve: the Federal Funds Rate (FFR) and the di⁄erence between 10-year and 2-year government bond yields (denoted 10Y-2Y), representing measures of the (cid:147)level(cid:148)and (cid:147)slope(cid:148)of the yield curve. To mitigate skewness we use the natural logarithm of the two volatility measures. We also consider time as a state variable, and four bivariate state variables comprised of time and each of the four stochastic state variables, leading to a total of nine possible state variables. As the kernel for the bivariate state variables we use the product of the univariate kernel for each of the variables. In our main analyses, we compare the various estimation methods in each application using OOS average loss. Importantly, OOS losses are unweighted, and so the local estimator has no inherent advantage; any forecast performance improvements are attributable to a favorable bias- 12For the bandwidth h we use a coarse grid of width 0.1 from 0.1(cid:27) to 3(cid:27) to (cid:133)nd an approximate solution and S S then consider a (cid:133)ner grid of width 0.01 in an interval 0:1 from the approximate solution. For the bandwidth (cid:21) we (cid:6) consider a grid of width 0.0025 from 0.98 to 1, but we replace 1 with 0.999, 0.9995 and 0.9999. 13This data is taken from the Oxford-Man Realised Library. 14

variance trade-o⁄relative to the benchmark method, in the spirit of the analysis in Section 2. We useGiacomini-White(2006)(GW)teststocompareeachmethodtothenon-localmethodusingthe full sample for estimation, and we estimate the set of best methods using the model con(cid:133)dence set (MCS)ofHansenet al. (2011).14 Diggingdeeperintothecomparisonofthecompetingmethods, in Section 3.5 we consider conditional analyses of forecast performance, investigating whether relative performance varies with the state variable. 3.1 GARCH forecasts The GARCH model of Bollerslev (1986) is a very popular model for forecasting asset return volatility, and in a variety of applications, and against a variety of alternatives (see Hansen and Lunde (2005)), it has proven hard to beat.15 Assuming the conditional mean is zero, the GARCH model for the conditional volatility of asset return Y is: t Y = (cid:27) " (26) t t t (cid:27)2 = !+(cid:12)(cid:27)2 +(cid:11)Y2 t t 1 t 1 (cid:0) (cid:0) The benchmark method estimates the model parameters using QML, which is equivalent to minimizing the in-sample average QLIKE loss function: Y2 Y2 L Y2;(cid:27)2 = t log t 1 (27) t t (cid:27)2 (cid:0) (cid:27)2 (cid:0) t t (cid:0) (cid:1) For this analysis we use daily returns on the S&P 500 index over the period January 2000 to June 2021, a total of T = 5349 observations. We use the period 2000-2010 (2737 observations) as the estimation sample, which is then further split into two to select the bandwidth parameters, and the remainder (2612 observations) as the out-of-sample period. 14We use Newey-Weststandard errorswith ten lagsforthe GW test,and we use the stationary bootstrap with an average block length of ten for the MCS. 15TherearemanypapersthathavebuiltontheoriginalGARCHmodel,wedonotattempttoconductahorserace of volatility models here. Rather we illustrate how our method improves upon on the seminal GARCH model, and, aside from one exception discussed at the end of this section, leave applying the method to extensions for future research. 15

As described above, we consider a total of nine possible state variables for local estimation. For non-local estimation we consider estimation windows of length 250, 500, 1000 and the full estimation period (2737 observations). By considering both long and short estimation windows for the baseline model, we can see whether the proposed local method out-performs a well-known way of (cid:147)localizing(cid:148)estimation; namely, using a shorter estimation window. Table 1 presents the out-of-sample performance of the GARCH(1,1) model estimated using a variety of methods. The rows of this table are ordered by average OOS QLIKE loss, reported in the third-last column. The local method with the best performance in the validation sample (the second half of the in-sample period) is marked in the (cid:133)rst column with . The last two columns (cid:3) report Giacomini-White t-statistics of each model relative to the benchmark non-local model, and an indicator (X or ) for whether a given method is included in the 95% model con(cid:133)dence set. (cid:2) We observe that the benchmark method, which uses non-local QML and the full estimation window, is ranked last in this set of estimation methods. Every local method aside from those using the two yield curve state variables (FFR and 10Y-2Y) has signi(cid:133)cantly lower OOS loss than the benchmark method, according to the GW test. The local method with the best performance in the validation sample uses time and RV as state variables, and it turns out to also have the lowest averagelossintheOOSperiod. Comparingthebenchmarknon-localmethodwiththelocalmethod selected using the validation sample we obtain a GW t-statistic of -10.32, strong evidence that the local method out-performs the non-local benchmark.16 When we consider this set of estimation methods as a whole, we (cid:133)nd only one method is included in the model con(cid:133)dence set: local QML using VIX and time as the state variables. This small MCS indicates a high degree of precision in identifying the best-performing method. The theoretical analysis in Section 2.4 revealed that when a state variable that is only weakly related to the degree of misspeci(cid:133)cation in the model is considered, local estimation is likely to fare poorly compared with non-local estimation, as the deleterious e⁄ect of nonparametric estimation error will not be o⁄set by improved (cid:133)t. This appears to be the case in this application when using 16The best non-local estimation in the OOS turns out to be one that uses a window of length 500, much shorter than the total data available, making this also a type of local method. That method also signi(cid:133)cantly beats the benchmark model, with a GW statistic of -4.758, however this method is not included in the MCS. 16

the Fed Funds Rate (FFR) as a state variable: when combined with time it performs better than thebenchmark, butwhenconsideredonitsowntheOOSaveragelossisnotsigni(cid:133)cantlylowerthan the benchmark, and is actually higher than using non-local QML on a shorter estimation window (500or250observations). Theotheryieldcurve-basedstatevariable(10Y-2Y)faressimilarlywhen combined with time, and when considered on its own there is no bandwidth between zero and three standard deviations that is better than the non-local method in the validation sample, and so its optimal bandwidth is set to in(cid:133)nity; that is, when using 10Y-2Y as a state variable the optimal bandwidth is such that this conditioning variable is ignored. [ INSERT TABLE 1 ABOUT HERE ] To better understand the source of the improvement in performance of the best local method, Figure 2 presents the local QML estimates of the GARCH parameters when RV ranges over its support, and compares them with the usual, constant, QML estimates of these parameters. To facilitate interpretation we look at three functions of these parameters: the model-implied average volatility( !=(1 (cid:11) (cid:12))), reactionofvolatilitytonews((cid:11));andpersistenceofvolatility((cid:11)+(cid:12)). (cid:0) (cid:0) We see thapt the local QML estimate of the level of volatility is increasing in RV, consistent with RV providing useful information about future volatility. In the second panel we see that the reaction to news from local QML is generally lower than from non-local QML, and it is highest when RV is around 40, indicating that it is these times where the squared return is most informative about future volatility. We also observe a drop in the persistence of volatility when RV is high; above about 35. This is consistent with some successful extensions of the GARCH model, e.g., where volatility is modeled as having a fast- and a slow-moving component (see Engle and Lee (1999) and Christo⁄ersen et al. (2008)) with sharp increases in volatility being attributable to the lesspersistent component, or, related, where volatility is modeled as having a jump and a continuous component (Andersen et al. (2007)), with the jump component found to be less persistent. [ INSERT FIGURE 2 ABOUT HERE ] In the supplemental appendix we consider a (cid:147)local(cid:148)analysis of the GARCH-X model, using 17

VIX2 as the (cid:147)X(cid:148)variable. (We use VIX2 rather than VIX so that all regressors in the model are measures of variance.) Table S1 shows that 12 methods signi(cid:133)cantly (at the 0.05 level) beat the improvedbenchmarkGARCH-XQMLmethod, whichranks17th outofthe26competingmethods. We (cid:133)nd (cid:133)ve methods are included in the 95% MCS, and all of these methods are local, using VIX, RV, FFR and/or time as state variables. This con(cid:133)rms that the proposed local method improves the benchmark including an additional variable in the model, and thereby altering the model, and also illustrates how to apply our method to an extension of a baseline model. 3.2 HAR volatility forecasts Wenextconsiderawidely-usedhighfrequency-basedvolatilityforecastingmodel,theheterogeneous autoregressive (HAR) model of Corsi (2009). This model speci(cid:133)es one-period-ahead volatility to be a function of the one-day, one-week, and one-month lags of volatility: 1 5 1 22 RV = (cid:12) +(cid:12) RV +(cid:12) RV +(cid:12) RV +e (28) t 0 d t (cid:0) 1 w5 j=1 t (cid:0) j m22 j=1 t (cid:0) j t X X By exploiting the information in high frequency data, this model has been widely found to outperform the GARCH model based on daily data. We use (cid:133)ve-minute realized volatility on the S&P 500 index over the period January 2000 to June 2021, and, as in the GARCH analysis in the previous section, we use 2000-2010 as the estimation sample (which is then further split into two to select the bandwidth parameters) and the remaining as the out-of-sample period. We also consider the same set of state variables: time, RV, VIX, FFR, 10Y-2Y, as well as bivariate state variables using time and each of the four stochastic state variables. Table 2 presents results on the out-of-sample forecast performance of the various estimation methods. The benchmark method ranks 9th out of the 13 estimators, and is signi(cid:133)cantly beaten, at the 0.05 level, by two local methods, based on VIX or time and VIX.17 The latter of these is 17In the validation sample we (cid:133)nd that when time is combined with RV, FFR or 10Y-2Y, the e⁄ective optimal bandwidths for the latter variables are in(cid:133)nite, and so these local methods reduce to one that just uses time as a state variable. In the OOS period, this means that there are apparently four methods tied for third place, though of course the latter three are redundant given the (cid:133)rst, and so it is perhaps more correct to say that the benchmark model ranks 6th out of the 10 unique competing methods. 18

the local method that is selected using the validation sample, and the GW statistic comparing this method to the benchmark is -2.66, strongly rejecting the benchmark in favor of the local estimator. The 95% model con(cid:133)dence set contains just one estimator, the local method using time and VIX as state variables. These results reveal that even the more challenging HAR model can be improved by recognizing that it, too, is misspeci(cid:133)ed, and by tilting the parameters of the model to re(cid:135)ect the current environment as captured by the state variable.18 [ INSERT TABLE 2 ABOUT HERE ] To illustrate how local and non-local estimation leads to di⁄erent forecasts, Figure 3 presents volatility forecasts over the last 18 months of the sample period obtained from the best local and non-local HAR models in Table 2. We see that for much of the period, volatility is low and the two methods yield very similar forecasts. The methods di⁄er most markedly during the market turmoil in March 2020, where we osberve that the local HAR produces forecasts that increase more quickly as market turbulence rose, and then decrease more quickly in the subsequent weeks. [ INSERT FIGURE 3 ABOUT HERE ] 3.3 VaR and ES forecasting We now consider models for forecasting two key quantities in risk management: Value-at-Risk (VaR) and Expected Shortfall (ES). For a given probability level (cid:11); usually set at 5%, these two measures are de(cid:133)ned as the (cid:11)-quantile and the expected value conditional on being below the (cid:11)-quantile, both conditional on information set : t 1 F (cid:0) Y t t 1 s F t (29) jF (cid:0) [VaR t ;ES t ] F t(cid:0) 1((cid:11)) ; E[Y t Y t VaR t ; t 1 ] (30) (cid:17) j (cid:20) F (cid:0) 18Table S2 in the supplemental appendix pre(cid:2)sents results when the HAR-X model is ta(cid:3)ken as the baseline model. We (cid:133)nd 21 methods signi(cid:133)cantly beat the benchmark, and the 95% MCS includes just two methods, both local versions of the HAR-X model using RV or time and RV as state variables. 19

While VaR is simply a quantile of the conditional distribution of the asset return under analysis, and thus estimation and forecasting of this measure can be done using the large literature on quantile forecasting (see Komunjer, 2013, for a review), models for ES are relatively lacking. This isperhapsinpartduetothefactthatthisriskmeasureisnot(cid:147)elicitable(cid:148)(Gneiting,2011),meaning that without strong assumptions there is no loss function that allows for its direct estimation. This hurdlewasovercomebyFisslerandZiegel(2016), whoproposedaclassoflossfunctionsthatallows for the joint estimation of VaR and ES. We will focus on a leading member of this class, the (cid:147)FZ0(cid:148) loss function considered in Nolde and Ziegel (2017) and Patton et al. (2019): 1 v L(y;v;e;(cid:11)) = 1 y v (v y)+ +log( e) 1 (31) (cid:0)(cid:11)e f (cid:20) g (cid:0) e (cid:0) (cid:0) With this loss function in hand, researchers can estimate models for VaR and ES directly (rather than indirectly via, for example, models for the entire predictive distribution) and competing forecasts of VaR and ES can be compared via their out-of-sample average FZ0 loss. Throughout, we consider a probability level, (cid:11); of 5%. We take as the baseline model the zero-mean GARCH model, see Equation (26). Using this model, forecasts for VaR and ES are obtained as: [VaR ;ES ] = [a;b] (cid:27) (32) t t t (cid:1) where b < a < 0 are the tail proportionality coe¢ cients linking VaR and ES to volatility. If these parameters are estimated along with those of the GARCH model by minimizing the in-sample average FZ0 loss we obtain the (cid:147)GARCH-FZ(cid:148)model of Patton et al. (2019). We found that (cid:147)localizing(cid:148)these coe¢ cients works poorly for forecasting, perhaps unsurprisingly as it combines nonparametricsandtailestimation, twodata-intensivetasks.19 Instead, weestimate[a;b]usingthe standardized residuals based on the standard QML GARCH series, and only localize the GARCH 19Table S.3 in the supplemental appendix is analogous to Table 3, discussed below, using the GARCH-FZ as the benchmarkmodel. Thereweseethatsomelocalmethodssigni(cid:133)cantlybeatthenon-localbenchmark,butoverallthe performance is worse, and for this reason we focus on GARCH-EDF as the baseline model. 20

model parameters. This leads to the GARCH-EDF model, Equation (26) and: t 1 a^ ;^b F^ 1((cid:11)) ; " 1 " V[aR (33) t t (cid:17) "(cid:0);t (cid:11)t s s (cid:20) " " # h i X s=1 n o where " Y =(cid:27) ; F^ 1 is the sample (cid:11)-quantile of " ; and the GARCH process parameters are t t t "(cid:0);t t (cid:17) estimated by minimizing the FZ0 loss function. For the non-local estimation we obtain parameters byminimizingthein-sampleaverageFZ0lossfunctionusingthefullestimationwindow,orwindows oflength250,500or1000observations. ForlocalM estimation,wefollowthesamemethodasinthe previous sections: we consider a total of nine possible state variables, with bandwidth parameters optimized using the second half of the estimation sample. Table 3 presents results on the out-of-sample forecast performance of the various estimation methods. The local method selected using the validation sample, which uses time and VIX as state variables, turns out to also perform best in the OOS period, and it signi(cid:133)cantly beats the benchmark, whichisranked9th;withaGWstatisticof-3.23. Twoothermethodshavesigni(cid:133)cantly lower OOS than the benchmark, and both are local methods, using RV or VIX as state variables. The MCS contains four methods: the three local methods just discussed, as well as non-local estimation with a window of length 1000, though the latter of these does not signi(cid:133)cantly beat the benchmark method. Similar to the GARCH and HAR applications, we do not (cid:133)nd that the yield curve-based state variables (FFR and 10Y-2Y) are helpful for forecasting these risk measures; in this application the optimal bandwidths for these state variables is found to be in(cid:133)nity. [ INSERT TABLE 3 ABOUT HERE ] 3.4 Yield curve forecasting In our (cid:133)nal empirical application we consider the popular (cid:147)dynamic Nelson-Siegel(cid:148)model for predicting the term structure of bond yields proposed by Diebold and Li (2006). Denoting y ((cid:28)) as t the yield on a bond with maturity (cid:28) at time t; this model starts from the Nelson and Siegel (1987) 21

model for a term structure of yields: 1 exp (cid:21) (cid:28) 1 exp (cid:21) (cid:28) t t y ((cid:28)) = (cid:12) +(cid:12) (cid:0) f(cid:0) g +(cid:12) (cid:0) f(cid:0) g exp (cid:21) (cid:28) +e (34) t 1;t 2;t (cid:21) (cid:28) 3;t (cid:21) (cid:28) (cid:0) f(cid:0) t g t t t (cid:18) (cid:19) (cid:18) (cid:19) This speci(cid:133)cation has four free parameters: the betas a⁄ect the level, slope and curvature of the yield curve, while (cid:21) determines (among other things) the maturity at which the curvature factor t has a turning point. These parameters can be estimated jointly, period-by-period, using nonlinear least squares, or if (cid:21) is (cid:133)xed at some pre-determined value the remaining parameters can be t obtained analytically using OLS. We follow Diebold and Li (2006) and set (cid:21) = 0:0609 t so that t 8 the curvature term peaks at 30 months and the model can be estimated by OLS. Moving beyond describing yield curves to predicting them, Diebold and Li (2006) proposed T modeling the observed sequences of (cid:12) , for i = 1;2;3; as AR(1) processes: i;t t=1 (cid:8) (cid:9) (cid:12) = (cid:30) +(cid:30) (cid:12) +e (35) i;t+1 0i 1i i;t i;t+1 Thatis, oneachdayintheestimationwindowthevector (cid:12) ;(cid:12) ;(cid:12) isobtainedfromthecross- 1;t 2;t 3;t sectionofyields, andthenfromthetimeseriesofthesepa(cid:2)rametersthep(cid:3)redictedvalueofthevector for the next period is obtained by estimating an AR(1) model via OLS. Inserting those forecasts into the Nelson-Siegel functional form then provides a forecast for the next-period yield curve, and combined the equations (34) and (35) comprise the (cid:147)dynamic Nelson-Siegel(cid:148)(DNS) model. We consider local versions of the DNS model, where the three AR(1) models are estimated via local OLS based on one of the nine state variables used in the previous analyses. Local OLS estimation of this model simpli(cid:133)es to weighted OLS (see, e.g., Cleveland and Devlin (1988) and Fan et al. (1998)), with the weights coming directly from the state variable and the kernel, and, as for OLS, the estimated local parameters are available in closed form. We use the same state variable (and same bandwidth value) for all three AR(1) models, although that could be relaxed.20 We additionally consider the usual, non-local, DNS model, estimated on the full sample, as well as 20WechoosethebandwidthtominimizethesumoftheMSEsacrossthethreeAR(1)models,howeveritispossible toconsiderdi⁄erentstatevariables,withdi⁄erentbandwidths,foreachofthethreeAR(1)modelsforthebetas. We have not considered this extension. 22

windows of length 250, 500 and 1000 observations. We use daily data over the period January 2000 to June 2021, and we consider bonds with maturities of three and six months, and one to ten years, a total of twelve maturities.21 We summarize the predictive performance of this model by summing the squared OOS forecast errors across maturities. Table 4 presents the results for two forecast horizons, one day and twenty days. The results in Panel A, for the one-day horizon, are humbling for the local methods: the two best methods are non-local OLS estimation using (relatively short) windows of 250 and 500 observations. This negativeresultconnectstothetheoreticalanalysisinSection2.4, inthatthebestnon-localmethod in this application has an R2 of 0.964, leaving very little room for improvement by a competing method. The best local method from the validation sample uses time and RV as state variables, anditsigni(cid:133)cantlybeatsthebenchmarkmethod(GWstatisticof-12.87), howeveritissigni(cid:133)cantly beaten by the non-local method with a window of 500 observations, with a t-statistic of 5.53. In Panel B of Table 4 we present results for the 20-day horizon, and for this more challenging forecasting problem we see that local estimation leads to improved OOS performance. The benchmark method ranks 8th out of the 13 estimators, and it is signi(cid:133)cantly beaten by six alternative methods.22 Comparing the best non-local method with the local method selected using the validation sample, which is one that uses time and VIX as state variables, we obtain a GW t-statistic of -6.91; strong evidence that the local method out-performs the non-local benchmark. The 95% model con(cid:133)dence set contains, e⁄ectively, just one estimator, the local method using time as the state variable. Combined, the results from the yield curve forecasting application highlight the upsides and the downsidesoflocalestimation. Whenthebaselinemodelisverygood,asitisfortheone-dayforecast horizon, there is little scope for an alternative estimation method to o⁄er any gains. However for 21We obtain one- to ten-year yields from https://www.federalreserve.gov/data/nominal-yield-curve.htm and data on three- and six-month yields, as well as the FFR and 10Y-2Y from the St. Louis Fed (cid:147)FRED(cid:148)database. 22In this application, the Federal Funds Rate turns out to be an uninformative state variable: the validation sample-optimalbandwidth is in(cid:133)nity, both when considered alone and when considered jointly with time. The other yield curve state vairable, 10Y-2Y, also adds nothing when combined with time. As the local method using time alone performs best in the OOS period, this leads to an apparent tie for (cid:133)rst place, though naturally the second and third models add nothing beyond the (cid:133)rst. 23

moredi¢ cultforecastingproblems, alternativeestimationmethodslikethelocalmethodsproposed here o⁄er the possibility of yielding improved forecasts. [ INSERT TABLE 4 ABOUT HERE ] 3.5 Conditional comparisons of forecast performance Inalloftheaboveanalyseswefocusedontheaverage out-of-sample(OOS)performanceoflocaland non-localmethodsforestimatingaforecastingmodel. However,iftheforecastuserhasanideafora state variable that may be useful for tilting the estimated model parameters, this variable may also beusefulforpredictingwhichmethodislikelytooutperforminthenextperiod. Weinvestigatethis idea in three ways: via linear regression, nonparametric regression, and a test of uniform predictive performance. Ineachcasewecomparethelocalmethodwiththebestperformanceinthevalidation sample (these are marked with in each of Tables 1 to 4) to the benchmark non-local method. The (cid:3) state variable used is the same as that in the local method: RV for the GARCH and yield curve (h=1) applications, and VIX for the HAR, VaR-ES and yield curve (h=20) applications. Table 5 presents the results of a simple linear regression of OOS loss di⁄erences on a constant and the lagged state variable, as proposed in Giacomini and White (2005). We de-mean the state variable so that the intercept of this regression corresponds to the di⁄erence in average OOS loss, andthet-statisticsassociatedwiththeinterceptareexactlytheGWstatisticsfortheunconditional comparisons in Tables 1 to 4. The t-statistics on the slope coe¢ cient reveal whether the state variable can (linearly) predict future di⁄erences in realized losses. For all (cid:133)ve applications the state variable is either RV or VIX, and we see that the slope coe¢ cient is positive in all of these cases, indicating that the local method does relatively worse when volatility is high. Only in the GARCH and VaR-ES applications, however, is the slope coe¢ cient signi(cid:133)cantly di⁄erent from zero. [ INSERT TABLE 5 ABOUT HERE ] TogainamorenuancedunderstandingoftherelationshipbetweenOOSlossdi⁄erencesandthe state variable, Figures 4 and 5 present a simple nonparametric kernel smooth of this relationship, 24

along with pointwise 95% con(cid:133)dence intervals.23 These plots allow us to see if the loss di⁄erence particularlypositiveornegativeinsomepartofthesupportofthestatevariable. Intheupperpanel of Figure 4 we see that local QML strongly outperforms non-local QML for GARCH models when volatility is relatively low. When annualized RV is above about 15% the di⁄erence in performance is approximately zero, and for RV above about 20% non-local QML shows some evidence of outperforming local QML.24 Similar results hold for the VaR-ES comparison. In Figure 5, as well as the middle panel of Figure 4, we see that the predicted OOS loss di⁄erence is almost constant in the state variable. For the one-day horizon yield curve forecasts, in theupperpanel ofFigure5, weobserve someevidencethat theoutperformanceofthe local method is particularly strong when volatility is low (the loss di⁄erence is more negative), but for RV above about 8% the relationship is approximately (cid:135)at. Finally, we use the recently-proposed (cid:147)conditional superior predictive ability(cid:148)(CSPA) test of Li et al. (2021) to test whether the non-local method has weakly lower expected loss across the entire support of the state variable: H 0 : E L Y t+1 ;g t ~(cid:18) h;t (S t ) L Y t+1 ;g t (^(cid:18) t ) S t = s 0 s Int( ) (36) (cid:0) (cid:21) 8 2 S h (cid:16) (cid:16) (cid:17)(cid:17) (cid:16) (cid:17)(cid:12) i (cid:12) (cid:12) aswellthehypothesiswheretheinequalityinequation(36)isreversed. IntheGARCHapplication, we reject the (cid:133)rst null (p-value less than 0.01) and conclude that non-local QML does not weakly dominate local QML uniformly, which is unsurprising given the estimated average loss presented in the upper panel of Figure 4. We fail to reject the reverse hypothesis (p-value of 0.99), meaning that local QML may indeed dominate non-local QML, and combined these results indicate that local QML is strongly preferred to non-local QML. We (cid:133)nd the same outcomes for the HAR and both yield curve (h = 1 and h = 20) applications: local estimation is strongly preferred to nonlocal estimation. In contrast, in the VaR-ES application we fail to reject either null at the 0.05 level, despite local estimation dominating non-local estimation unconditionally, and outperforming 23The estimate and con(cid:133)dence intervals are computed using Theorem 2.2 of Li and Racine (2007). 24It is possible to construct a (cid:147)hybrid(cid:148)forecast based on the local and non-local methods by switching between them according to which method is predicted to have lower loss in the subsequent period, see Giacomini and White (2005) and Timmermann and Zhu (2021) for example. We do not pursue this extension here. 25

pointwise for low values of VIX as in Figure 4. This outcome may be due to a relative lack of power in this application, which is focused on the 5% tail of the distribution of returns. [ INSERT FIGURES 4 AND 5 ABOUT HERE ] 4 Conclusion This paper proposes an estimation method to improve the forecasts produced by a misspeci(cid:133)ed forecasting model, without altering the form of the underlying model. In many decision-making environments, the statistical model is (cid:147)hardwired,(cid:148)at least in the short term, and substituting it for a new and improved model is not possible. This may be because changing the model requires regulatory approval, or approval from a high-level committee, or because the time taken to embed a new model in the decision-making process is long relative to the competitive environment. We overcome this hurdle by maintaining the functional form of the baseline model and improving its (cid:133)t by upweighting past observations that look more similar to the forecast date, and downweighting observations that are more dissimilar, drawing on methods like local OLS estimation and local MLE, see Tibshirani and Hastie (1987), Cleveland and Devlin (1988) and Fan et al. (1998), as well as older methods like exponential smoothing, see Brown (1956) and Muth (1960). We theoretically compare out-of-sample forecasts from the proposed estimation method with thosefromthebaselinemodelandobserveafamiliarbias-variancetrade-o⁄. Interestingly, thebiasvariance trade-o⁄for the proposed method goes in the opposite direction to the usual one for outof-sample forecasting: the proposed estimation method (generally) adds variance to the forecast, in the hope of reducing the bias from using the misspeci(cid:133)ed baseline model. Our theoretical analysis shedslightontheconditionsthatarelikelytobefavorableforthelocalestimationmethodproposed here. Speci(cid:133)cally, the baseline model cannot be (cid:147)too good(cid:148)and the forecaster(cid:146)s state variable summarizing the environment at the forecast date cannot be (cid:147)too bad.(cid:148) Weapplytheproposedmethodtofoureconomicforecastingproblems. The(cid:133)rsttwoapplications consider volatility forecasting, using daily data and the famous GARCH model of Bollerslev (1986) orhighfrequencydataandthepopularHARmodelofCorsi(2009). Thethirdapplicationistorisk 26

management, and focuses on joint forecasts of Value-at-Risk and Expected Shortfall. The fourth application is to yield curve forecasts, made using the (cid:147)dynamic Nelson-Siegel(cid:148)model proposed by Diebold and Li (2006). We (cid:133)nd that our proposed method provides statistically signi(cid:133)cant improvements over the baseline methods in almost all cases. References [1] Andersen, T.G., T. Bollerslev and F.X. Diebold, 2007, Roughing it up: Disentangling continuous and jump components in measuring, modeling and forecasting asset return volatility, Review of Economics and Statistics, 89(4), 701-720. [2] Ang, A., G. Bekaert and M. Wei, 2007, Do macro variables, asset markets, or surveys forecast in(cid:135)ation better? Journal of Monetary Economics, 54(4), 1163-1212. [3] Ang, A. and D. Kristensen, 2012, Testing conditional factor models, Journal of Financial Economics, 106, 132-156. [4] Basel Committee on Banking Supervision, 2010, Basel III: A Global Regulatory Framework for More Resiliant Banks and Banking Systems, Bank for International Settlements. http://www.bis.org/publ/bcbs189.pdf. [5] Beare, B.K., 2010, Copulas and temporal dependence, Econometrica, 78, 395-410. [6] Blasques F., S.J. Koopman, M. Mallee, and Z. Zhang, 2016, Weighted maximum likelihood for dynamic factor analysis and forecasting with mixed frequency data, Journal of Econometrics, 193, 405-417. [7] Bollerslev, T., 1986, Generalized autoregressive conditional heteroskedasticity, Journal of Econometrics, 31, 307(cid:150)327. [8] Brown, R.G., 1956, Exponential smoothing for predicting demand. Arthur D. Little Inc., Cambridge, Massachusetts. [9] Capistran,C.andA.Timmermann,2009,Forecastcombinationwithentryandexitofexperts, Journal of Business & Economic Statistics, 2009, 27, 429-440. [10] Chen, X., and Y. Fan, 2006, Estimation of copula-based semiparametric time series models, Journal of Econometrics, 130, 307-335. [11] Christo⁄ersen, P., K. Jacobs, C. Ornthanalai and Y. Wang, 2008, Option valuation with longrun and short-run volatility components, Journal of Financial Economics, 90, 272-297. [12] Cleveland, W.S.andS.J.Devlin, 1988, Locallyweightedregression: Anapproachtoregression analysis by local (cid:133)tting, Journal of the American Statistical Association, 83, 596(cid:150)610. [13] Corsi, F., 2009, A simple approximate long-memory model of realized volatility, Journal of Financial Econometrics, 7(2), 174-196. 27

[14] Dendramis,Y.,G.KapetaniosandM.Marcellino,2020,Asimilarity-basedapproachformacroeconomic forecasting, Journal of the Royal Statistical Society, Series A, 183(3), 801-827. [15] Diebold, F.X. and C. Li, 2006, Forecasting the term structure of government bond yields, Journal of Econometrics, 130, 337-364. [16] Diebold, F.X. and R. Mariano, 1995, Comparing predictive accuracy, Journal of Business and Economic Statistics, 13, 253-265. [17] Engle, R.F. and G.G.J. Lee, 1999, A permanent and transitory component model of stock return volatility, in Cointegration, Causality, and Forecasting: A Festschrift in Honour of Clive W. J. Granger. R.F. Engle and H. White, eds. Oxford University Press, pp. 475(cid:150)97. [18] Fan, J. Y. Wu and Y. Feng, 2009, Local quasi-likelihood with a parametric guide, Annals of Statistics, 37(6B), 4153-4183. [19] Fan, J., M. Farmen and I. Gijbels, 1998, Local maximum likelihood estimation and inference, Journal of the Royal Statistical Society, Series B, 60(3), 591-608. [20] Faust, J. and J.H. Wright, 2009, Comparing Greenbook and reduced form forecasts using a large realtime dataset, Journal of Business & Economic Statistics, 27(4), 468-479. [21] Fissler, T. and J.F. Ziegel, 2016, Higher order elicitability and Osband(cid:146)s principle, Annals of Statistics, 44(4), 1680-1707. [22] Giacomini, R. and H. White, 2006. Tests of conditional predictive ability, Econometrica, 74, 1545-1578. [23] Giacomini, R. and B. Rossi, 2010, Forecast comparisons in unstable environments, Journal of Applied Econometrics, 25(4), 595-620. [24] Giacomini, R. and G. Ragusa, 2014, Theory-coherent forecasting, Journal of Econometrics, 182, 145-155. [25] Gneiting, T., 2011, Making and evaluating point forecasts, Journal of the American Statistical Association, 106, 746(cid:150)762. [26] Granger, C.W.J., 1969, Prediction with a generalized cost of error function, OR, 20(2), 199- 207. [27] Hansen, P.R. and A. Lunde, 2005, A forecast comparison of volatility models: Does anything beat a GARCH (1,1)? Journal of Applied Econometrics, 20(7), 873-889. [28] Hansen, P.R., A. Lunde, and J.M. Nason, 2011, The model con(cid:133)dence set, Econometrica, 79(2), 453-497. [29] Hansen, P.R. and E.-I. Dumitrescu, 2021, How should parameter estimation be tailored to the objective? Journal of Econometrics, forthcoming. [30] Hu, F., 1997, The asymptotic properties of the maximum-relevance weighted likelihood estimators, Canadian Journal of Statistics, 25(1), 45-59. 28

[31] Hu, F. and J. V. Zidek, 2002, The weighted likelihood, Canadian Journal of Statistics, 30(3), 347-371. [32] Inoue, A., L. Jin and B. Rossi, 2017, Rolling window selection for out-of-sample forecasting with time-varying parameters, Journal of Econometrics, 196, 55-67. [33] Inoue,A.,L.Jin,D.Pelletier,2020,Local-linearestimationoftime-varying-parameterGARCH models and associated risk measures, Journal of Financial Econometrics, 19(1), 202-234. [34] Komunjer, I., 2013, Quantile prediction, in G. Elliott and A. Timmermann (eds), Handbook of Economic Forecasting, Volume 2, 961-994, Elsevier, Oxford. [35] Kristensen, D. and A. Mele, 2011, Adding and subtracting Black-Scholes: A new approach to approximating derivative prices in continuous-time models, Journal of Financial Economics, 102, 390-415. [36] Li,Q.andJ.S.Racine,2007,NonparametricEconometrics,PrincetonUniversityPress,Princeton. [37] Li, J., Z. Liao and R. Quaedvlieg, 2021, Conditional superior predictive ability, Review of Economic Studies, forthcoming. [38] Manganelli, S., 2009, Forecasting with judgment, Journal of Business & Economic Statistics, 27(4), 553-563. [39] Muth,J.F.,1960,Optimalpropertiesofexponentiallyweightedforecasts,Journal of the American Statistical Association, 55(290) 299-306. [40] Nelson,C.R.andA.F.Siegel,1987,Parsimoniousmodelingofyieldcurve,Journal of Business, 60, 473-489. [41] Nolde, N. and J.F. Ziegel, 2017, Elicitability and backtesting: Perspectives for banking regulation, Annals of Applied Statistics, 11(4), 1833-1874. [42] Patton, A.J., 2020, Comparing possibly misspeci(cid:133)ed forecasts, Journal of Business & Economic Statistics, 38(4), 796-809. [43] Patton, A.J., J.F. Ziegel and R. Chen, 2019, Dynamic semiparametric models for expected shortfall (and value-at-risk), Journal of Econometrics, 211(2), 388-413. [44] Pesaran,M.H.,A.Pick,andM.Pranovich,2013.Optimalforecastsinthepresenceofstructural breaks. Journal of Econometrics, 177, 134-152. [45] Pettenuzzo, D., A. Timmermann and R. Valkanov, 2014, Forecasting stock returns under economic constraints, Journal of Financial Economics, 114, 517-553. [46] Richter,S.andE.Smetanina,2020,Forecastevaluationandselectioninunstableenvironments, working paper, Chicago Booth. [47] Tibshirani, R. and T. Hastie, 1987, Local likelihood estimation, Journal of the American Statistical Association, 82(398), 559-567. 29

[48] Timmermann, A. and Y. Zhu, 2021, Monitoring forecasting performance, Journal of Econometrics, forthcoming. [49] Weiss, A.A., 1996, Estimating time series models using the relevant cost function, Journal of Applied Econometrics, 11(5), 539-560. [50] Zumbach, G., 2006, The RiskMetrics 2006 methodology, working paper, RiskMetrics Group, Geneva, Switzerland. 30

Table 1: Out-of-sample forecast performance for GARCH(1,1) models Method details Forecast performance Rank StateVar Bwidth Window AvgLoss GW stat MCS 1 (cid:3) time,RV 0.9995,0.34 full 0.320 -10.316 X 2 RV 0.37 full 0.325 -10.395 (cid:2) 3 time,VIX 0.995,0.28 full 0.333 -6.195 (cid:2) 4 VIX 0.32 full 0.349 -6.001 (cid:2) 5 time 0.995 full 0.371 -5.427 (cid:2) 6 - - 500 0.375 -4.758 (cid:2) 7 - - 250 0.376 -2.817 (cid:2) 8 time,10Y-2Y 0.9975,0.25 full 0.380 -3.449 (cid:2) 9 time,FFR 0.9975,0.49 full 0.381 -3.855 (cid:2) 10 - - 1000 0.382 -4.494 (cid:2) 11 FFR 1.81 full 0.400 -1.592 (cid:2) 12 - - full 0.402 F (cid:2) =12 10Y-2Y full 0.402 0.000 1 (cid:2) Notes: Thistablepresentsmeasuresofforecastperformanceovertheout-of-sampleperiod(January2011toJune2021)fromGARCH(1,1)modelsestimatedusingeitherQML(non-local),orlocal QML. The rows are ordered by average OOS QLIKE loss, reported in the third-last column. The local method with the best performance in the validation sample (the second half of the estimation sample) is marked in the (cid:133)rst column with . The local estimators use the state variable(s) given (cid:3) in the second column and bandwidth parameter(s) from the third column, which are selected using the validation sample. The fourth column reports the window of data used in estimation, where (cid:147)full(cid:148)implies the entire in-sample period (2737 observations). The penultimate column reports Giacomini-White t-statistics of each model relative to the benchmark method (marked with F), which is taken as the non-local method using the full estimation window, with negative t-statistics indicating lower average loss. The (cid:133)nal column includes a check mark if a given method is included in the 95% model con(cid:133)dence set, and a cross otherwise. 31

Table 2: Out-of-sample forecast performance for HAR models Method details Forecast performance Rank StateVar Bwidth Window AvgLoss GW stat MCS 1 (cid:3) time,VIX 0.999,0.62 full 0.246 -2.655 X 2 VIX 1.8 full 0.252 -4.610 (cid:2) 3 time 0.995 full 0.252 -0.291 (cid:2) =3 time,RV 0.995, full 0.252 -0.291 1 (cid:2) =3 time,FFR 0.995, full 0.252 -0.291 1 (cid:2) =3 time,10Y-2Y 0.995, full 0.252 -0.291 1 (cid:2) 7 10Y-2Y 1.91 full 0.253 -1.318 (cid:2) 8 RV 2.86 full 0.253 -0.362 (cid:2) 9 - - full 0.253 F (cid:2) 10 - - 500 0.253 0.046 (cid:2) 11 FFR 2.3 full 0.253 0.922 (cid:2) 12 - - 250 0.255 0.642 (cid:2) 13 - - 1000 0.300 1.056 (cid:2) Notes: This table presents measures of forecast performance over the out-of-sample period (January 2011 to June 2021) from HAR models estimated using either QML (non-local), or local QML. The rows are ordered by average OOS QLIKE loss, reported in the third-last column. The local method with the best performance in the validation sample (the second half of the estimation sample) is marked in the (cid:133)rst column with . The local estimators use the state variable(s) given (cid:3) in the second column and bandwidth parameter(s) from the third column, which are selected using the validation sample. The fourth column reports the window of data used in estimation, where (cid:147)full(cid:148)implies the entire in-sample period (2737 observations). The penultimate column reports Giacomini-White t-statistics of each model relative to the benchmark method (marked with F), which is taken as the non-local method using the full estimation window, with negative t-statistics indicating lower average loss. The (cid:133)nal column includes a check mark if a given method is included in the 95% model con(cid:133)dence set, and a cross otherwise. 32

Table 3: Out-of-sample forecast performance for VaR-ES models Method details Forecast performance Rank StateVar Bwidth Window AvgLoss GW stat MCS 1 (cid:3) time,VIX 0.9995,1.25 full -3.869 -3.227 X 2 RV 1.4 full -3.868 -4.423 X 3 VIX 1.24 full -3.863 -2.013 X 4 - - 1000 -3.861 -0.627 X 5 time 0.9975 full -3.861 -0.593 (cid:2) =5 time,RV 0.9975, full -3.861 -0.593 1 (cid:2) =5 time,FFR 0.9975, full -3.861 -0.593 1 (cid:2) =5 time,10Y-2Y 0.9975, full -3.861 -0.593 1 (cid:2) 9 - - full -3.855 F (cid:2) 10 10Y-2Y full -3.855 0.000 1 (cid:2) 11 FFR full -3.855 0.000 1 (cid:2) 12 - - 500 -3.844 0.581 (cid:2) 13 - - 250 -3.102 1.517 (cid:2) Notes: This table presents measures of forecast performance over the out-of-sample period (January 2011 to June 2021) from GARCH(1,1) models estimated either M estimation or local M estimation and the FZ0 loss function in Equation (31). The rows are ordered by average OOS FZ0 loss, reported in the third-last column. For a given model, the local method with the best performance in the validation sample (the second half of the estimation sample) is marked in the (cid:133)rst column with . The local estimators use the state variable(s) given in the second column and (cid:3) bandwidthparameter(s)fromthethirdcolumn,whichareselectedusingthevalidationsample. The fourth column reports the window of data used in estimation, where (cid:147)full(cid:148)implies the entire insample period (2737 observations). The penultimate column reports Giacomini-White t-statistics of each model relative to the benchmark method (marked with F), which is taken as the non-local method using the full estimation window, with negative t-statistics indicating lower average loss. The (cid:133)nal column includes a check mark if a given method is included in the 95% model con(cid:133)dence set, and a cross otherwise. 33

Table 4: Out-of-sample forecast performance for yield curve models Method details Forecast performance Rank StateVar Bwidth Window AvgLoss GW stat MCS Panel A: One-day forecast horizon 1 - - 500 0.157 -9.499 X 2 - - 250 0.158 -5.071 X 3 time,VIX 0.999,1.91 full 0.158 -11.618 (cid:2) 4 time, RV 0.9995,1.46 full 0.158 -12.868 (cid:3) (cid:2) 5 time,10Y-2Y 0.999,1.21 full 0.158 -14.128 (cid:2) 6 time,FFR 0.999,1.01 full 0.158 -10.490 (cid:2) 7 time 0.999 full 0.158 -14.523 (cid:2) 8 RV 1.43 full 0.158 -9.099 (cid:2) 9 - - 1000 0.158 -1.605 (cid:2) 10 FFR 0.9 full 0.158 -2.721 (cid:2) 11 VIX 1.84 full 0.158 -4.440 (cid:2) 12 10Y-2Y 1.26 full 0.158 -5.881 (cid:2) 13 - - full 0.158 F (cid:2) Panel B: Twenty-day forecast horizon 1 time 0.999 full 0.241 -6.542 X =1 time,FFR 0.999, full 0.241 -6.542 X 1 =1 time,10Y-2Y 0.999, full 0.241 -6.542 X 1 4 time,RV 0.999,2.18 full 0.242 -6.304 (cid:2) 5 time,VIX 0.9995,1.7 full 0.244 -6.911 (cid:3) (cid:2) 6 VIX 1.4 full 0.248 -2.399 (cid:2) 7 10Y-2Y 2.08 full 0.250 -0.422 (cid:2) 8 - - full 0.250 F (cid:2) =8 FFR full 0.250 0.000 1 (cid:2) 10 RV 1.5 full 0.250 1.095 (cid:2) 11 - - 500 0.250 0.172 (cid:2) 12 - - 1000 0.253 1.364 (cid:2) 13 - - 250 0.262 2.567 (cid:2) Notes to Table 4: This table presents measures of one- and twenty-day-ahead forecast performance over the out-of-sample period (January 2011 to June 2021) from dynamic Nelson-Siegel models estimated using either OLS or local OLS. The rows in each panel are ordered by average OOS RMSE, multiplied by 100, reported in the third-last column. The local method with the best performance in the validation sample is marked in the (cid:133)rst column with . The local estimators (cid:3) use the state variable(s) given in the second column and bandwidth parameter(s) from the third column, which are selected using the validation sample. The fourth column reports the window of data used in estimation, where (cid:147)full(cid:148)implies the entire in-sample period (2737 observations). The penultimate column reports Giacomini-White t-statistics of each model relative to the benchmark method(markedwithF), whichistakenasthenon-localmethodusingthefullestimationwindow, with negative t-statistics indicating lower average loss. The (cid:133)nal column includes a check mark if a given method is included in the 95% model con(cid:133)dence set, and a cross otherwise. 34

Table 5: Conditional comparisons of forecasting models Yield curve GARCH HAR VaR-ES h=1 h=20 Intercept 0:082 0:007 0:014 0:894 29:593 (cid:0) (cid:0) (cid:0) (cid:0) (cid:0) (std. err.) (0:008) (0:003) (0:004) (0:069) (4:282) [t-stat] [ 10:316] [ 2:655] [ 3:227] [ 12:868] [ 6:911] (cid:0) (cid:0) (cid:0) (cid:0) (cid:0) Slope 0:091 0:029 0:035 0:009 0:0716 (std. err.) (0:009) (0:025) (0:017) (0:144) (0:719) [t-stat] [10:440] [1:182] [2:104] [0:061] [0:010] NotestoTable5: Thistablepresentstheestimatedparametersandstandarderrorsfromalinear regression of out-of-sample loss di⁄erences on a constant and the lagged state variable, across the (cid:133)ve applications considered in this paper. The methods compared in each column are the local method with the best performance in the validation sample (marked with in each of Tables 1 (cid:3) to 4) and the the non-local method using the full estimation sample. The state variable used for the comparison is the same one that appears in the local method: RV for the GARCH and yield curve(h=1)application, VIXfortheHARandVaR-ESapplication, and10Y-2Yfortheyieldcurve (h=20) application. 35

Conditional mean of Y(t) given Y(t 1) 3 Data 2 Cond mean OLS Local OLS 1 1 Local OLS 2 ) t ( 0 Y 1 2 3 3 2 1 0 1 2 3 Y(t 1) RMSE of different estimators relative to OLS 1.1 Cond mean OLS Local OLS 1 1.05 Local OLS 2 1 0.95 0.9 0 0.5 1 1.5 2 2.5 3 Bandwidth for local OLS Figure 1: The upper panel presents the expected value of Y given Y according to the DGP in t t 1 (cid:0) equation (23), and estimates of this using a linear AR(1) estimated by OLS and local OLS with two di⁄erent state variables: Y and Y : The lower panel presents the RMSE of the di⁄erent t 1 t 2 (cid:0) (cid:0) estimators as a function of the local OLS bandwidth parameters. 36

Level of volatility 50 Local QML 40 QML 30 20 10 0 0 10 20 30 40 50 Reaction of volatility to news 0.15 0.1 0.05 0 0 10 20 30 40 50 Persistence of volatility 1 0.95 0.9 0.85 0.8 0 10 20 30 40 50 RV Figure 2: This plot shows the local QML estimates of transformations of the GARCH(1,1) parameters (!;(cid:12);(cid:11)) as a function of realized volatility (RV). Also shown are the (non-local) QML parameter estimates. The upper, middle and lower panels plot !=(1 (cid:11) (cid:12)); (cid:11), and (cid:0) (cid:0) ((cid:11)+(cid:12)) respectively. p 37

SP500 index volatility, 1/2020 6/2021 180 HAR 160 Local HAR RV 140 ) % ( lo 120 v d e 100 z ila u 80 n n A 60 40 20 Jan20 Apr20 Jul20 Oct20 Jan21 Apr21 Jul21 Figure 3: This (cid:133)gure shows the predicted volatility from a HAR model estimated using local or non-local QML, along with realized volatility, over the last 18 months of the sample period. . 38

Comparing forecasts from local and non local models GARCH models 0.2 f fid s 0 s o l d e tc e 0.2 p x E Cond expected loss Approx 95% C.I. 0.4 0 5 10 15 20 25 30 35 40 45 RV HAR models 0.04 f 0.02 fid s s o 0 l d e tc e 0.02 p x E 0.04 0 5 10 15 20 25 30 35 40 45 VIX VaR ES models 0.15 0.1 f fid s s 0.05 o l d e tc 0 e p x E 0.05 0.1 0 5 10 15 20 25 30 35 40 45 VIX Figure 4: This (cid:133)gure presents estimates of the expected out-of-sample loss di⁄erences from models estimated via local or non-local methods, conditional on realized volatility (top panel) or VIX (lower two panels). Positive loss di⁄erences indicate the non-local method is preferred. 39

Comparing forecasts from local and non local DNS models Horizon = 1 day 0 Cond expected loss Approx 95% C.I. e 0.01 c n e r e ffid 0.02 s s o l d 0.03 e t c e p x 0.04 E 0.05 0 5 10 15 20 25 30 35 40 45 RV Horizon = 20 days 0.2 0 e c n e 0.2 r e ffid 0.4 s s o l d 0.6 e t c e 0.8 p x E 1 1.2 0 5 10 15 20 25 30 35 40 45 VIX Figure 5: This (cid:133)gure presents estimates of the expected out-of-sample loss di⁄erence of a dynamic Nelson-Siegel (DNS) model estimated via local OLS or non-local OLS, conditional on realized volatlity (top panel) or VIX (bottom panel). Positive loss di⁄erences indicate the non-local method is preferred. 40

Supplemental Appendix for Better the Devil You Know: Improved Forecasts from Imperfect Models by Dong Hwan Oh and Andrew J. Patton 12 October 2021 S.1

Table S1: Out-of-sample forecast performance for GARCH-X models Method details Forecast performance Rank Model StateVar Bwidth Window AvgLoss GW stat MCS 1 GARCH-X time,RV 0.9999,0.4 full 0.293 -9.329 X 2 GARCH-X RV 0.41 full 0.294 -9.382 X 3 GARCH-X time,FFR 0.98,0.39 full 0.309 -5.017 X 4 GARCH-X time,VIX 0.98,0.89 full 0.313 -4.013 X 5 GARCH-X time 0.98 full 0.313 -4.653 X 6 GARCH time,RV 0.9995,0.34 full 0.320 -4.890 (cid:2) 7 GARCH-X - - 250 0.324 -4.999 (cid:2) 8 GARCH RV 0.37 full 0.325 -4.239 (cid:2) 9 GARCH-X time,10Y-2Y 0.9825,0.18 full 0.329 -1.828 (cid:3) (cid:2) 10 GARCH time,VIX 0.995,0.28 full 0.333 -2.294 (cid:2) 11 GARCH-X - - 500 0.334 -5.330 (cid:2) 12 GARCH-X - - 1000 0.335 -7.398 (cid:2) 13 GARCH VIX 0.32 full 0.349 -1.363 (cid:2) 14 GARCH-X VIX 2.63 full 0.351 -8.426 (cid:2) 15 GARCH-X 10Y-2Y 0.26 full 0.358 -0.191 (cid:2) 16 GARCH-X FFR 0.44 full 0.359 -0.018 (cid:2) 17 GARCH-X - - full 0.359 F (cid:2) 18 GARCH time 0.995 full 0.371 1.428 (cid:2) 19 GARCH - - 500 0.375 2.049 (cid:2) 20 GARCH - - 250 0.376 1.711 (cid:2) 21 GARCH time,10Y-2Y 0.9975,0.25 full 0.380 2.191 (cid:2) 22 GARCH time,FFR 0.9975,0.49 full 0.381 2.632 (cid:2) 23 GARCH - - 1000 0.382 2.892 (cid:2) 24 GARCH FFR 1.81 full 0.400 4.623 (cid:2) 25 GARCH 10Y-2Y 2.6 full 0.400 4.724 (cid:2) 26 GARCH - - full 0.402 4.844 (cid:2) Notes: This table presents measures of forecast performance over the out-of-sample period (January 2011 to June 2021) from GARCH and GARCH-X models estimated using either QML (non-local), or local QML. All GARCH-X models use VIX2 as the extra variable. The rows are ordered by average OOS QLIKE loss, reported in the third-last column. The local method with the best performance in the validation sample (the second half of the estimation sample) is marked in the(cid:133)rstcolumnwith . Thelocalestimatorsusethestatevariable(s)giveninthethirdcolumnand (cid:3) bandwidth parameter(s) from the fourth column, which are selected using the validation sample. The (cid:133)fth column reports the window of data used in estimation, where (cid:147)full(cid:148)implies the entire insample period (2737 observations). The penultimate column reports Giacomini-White t-statistics of each model relative to the benchmark method (marked with F), which is taken as the non-local method using the full estimation window, with negative t-statistics indicating lower average loss. The (cid:133)nal column includes a check mark if a given method is included in the 95% model con(cid:133)dence set, and a cross otherwise. S.2

Table S2: Out-of-sample forecast performance for HAR-X models Method details Forecast performance Rank Model StateVar Bwidth Window AvgLoss GW stat MCS 1 HAR-X RV 0.79 full 0.232 -7.001 X 2 (cid:3) HAR-X time,RV 0.9975,0.8 full 0.232 -6.843 X 3 HAR-X VIX 0.63 full 0.236 -6.722 (cid:2) 4 HAR-X time,10Y-2Y 0.9875,0.8 full 0.241 -6.085 (cid:2) 5 HAR-X time,VIX 0.9925,0.73 full 0.245 -5.723 (cid:2) 6 HAR time,VIX 0.999,0.62 full 0.246 -5.256 (cid:2) 7 HAR-X - - 250 0.248 -5.576 (cid:2) 8 HAR-X time 0.995 full 0.248 -5.433 (cid:2) =8 HAR time,RV 0.995, full 0.251 -4.918 1 (cid:2) =8 HAR time,10Y-2Y 0.995, full 0.251 -4.918 1 (cid:2) =8 HAR time,FFR 0.995, full 0.252 -4.918 1 (cid:2) 12 HAR VIX 1.8 full 0.252 -4.829 (cid:2) 13 HAR time 0.995 full 0.252 -4.909 (cid:2) 14 HAR 10Y-2Y 1.91 full 0.253 -4.789 (cid:2) 15 HAR RV 2.86 full 0.253 -4.743 (cid:2) 16 HAR - - full 0.253 -4.757 (cid:2) 17 HAR - - 500 0.253 -4.785 (cid:2) 18 HAR FFR 2.3 full 0.253 -4.743 (cid:2) 19 HAR - - 250 0.255 -4.742 (cid:2) 20 HAR-X time,FFR 0.99,0.32 full 0.263 -3.563 (cid:2) 21 HAR-X - - 500 0.273 -3.097 (cid:2) 22 HAR - - 1000 0.300 -0.533 (cid:2) 23 HAR-X - - 1000 0.307 -0.782 (cid:2) 24 HAR-X - - full 0.325 F (cid:2) 25 HAR-X 10Y-2Y 1.96 full 0.351 2.564 (cid:2) 26 HAR-X FFR 1.62 full 0.372 3.734 (cid:2) Notes: This table presents measures of forecast performance over the out-of-sample period (January 2011 to June 2021) from HAR and HAR-X models estimated using either QML (nonlocal), or local QML. All HAR-X models use VIX2 as the extra variable. The rows are ordered by average OOS QLIKE loss, reported in the third-last column. The local method with the best performance in the validation sample (the second half of the estimation sample) is marked in the (cid:133)rst column with . The local estimators use the state variable(s) given in the third column and (cid:3) bandwidth parameter(s) from the fourth column, which are selected using the validation sample. The (cid:133)fth column reports the window of data used in estimation, where (cid:147)full(cid:148)implies the entire insample period (2737 observations). The penultimate column reports Giacomini-White t-statistics of each model relative to the benchmark method (marked with F), which is taken as the non-local method using the full estimation window, with negative t-statistics indicating lower average loss. The (cid:133)nal column includes a check mark if a given method is included in the 95% model con(cid:133)dence set, and a cross otherwise. S.3

Table S.3: Out-of-sample forecast performance for GARCH-FZ models Method details Forecast performance Rank StateVar Bwidth Window AvgLoss GW stat MCS 1 RV 1.96 full -3.862 -4.136 X 2 - - 1000 -3.861 -0.619 X 3 VIX 1.67 full -3.860 -2.148 X 4 10Y-2Y 2.72 full -3.856 -1.249 (cid:2) 5 - - full -3.855 F (cid:2) =5 FFR full -3.855 0.000 1 (cid:2) 7 time 0.995 full -3.846 0.508 (cid:2) 8 - - 500 -3.836 0.876 (cid:2) 9 time,RV 0.99,2.02 full -3.830 1.071 (cid:2) 10 time,VIX 0.9925,1.21 full -3.829 1.253 (cid:3) (cid:2) 11 time,FFR 0.9925,2.24 full -3.828 1.221 (cid:2) 12 time,10Y-2Y 0.9925,1.04 full -3.825 1.333 (cid:2) 13 - - 250 -3.812 1.308 (cid:2) Notes: This table presents measures of forecast performance over the out-of-sample period (January 2011 to June 2021) from GARCH-FZ models estimated using either M estimation or local M estimation and the FZ0 loss function in Equation (31). The rows are ordered by average OOS FZ0 loss, reported in the third-last column. For a given model, the local method with the best performance in the validation sample (the second half of the estimation sample) is marked in the (cid:133)rst column with . The local estimators use the state variable(s) given in the second column (cid:3) andbandwidthparameter(s)fromthethirdcolumn, whichareselectedusingthevalidationsample. Thefourthcolumnreportsthewindowofdatausedinestimation,where(cid:147)full(cid:148)impliestheentireinsample period (2737 observations). The penultimate column reports Giacomini-White t-statistics of each model relative to the benchmark method (marked with F), which is taken as the non-local method using the full estimation window, with negative t-statistics indicating lower average loss. The (cid:133)nal column includes a check mark if a given method is included in the 95% model con(cid:133)dence set, and a cross otherwise. S.4

Cite this document

APA

Dong Hwan Oh and Andrew J. Patton (2021). Better the Devil You Know: Improved Forecasts from Imperfect Models (FEDS 2021-071). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2021-071

BibTeX

@techreport{wtfs_feds_2021_071,
  author = {Dong Hwan Oh and Andrew J. Patton},
  title = {Better the Devil You Know: Improved Forecasts from Imperfect Models},
  type = {Finance and Economics Discussion Series},
  number = {2021-071},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2021},
  url = {https://whenthefedspeaks.com/doc/feds_2021-071},
  abstract = {Many important economic decisions are based on a parametric forecasting model that is known to be good but imperfect. We propose methods to improve out-of-sample forecasts from a mis-specified model by estimating its parameters using a form of local M estimation (thereby nesting local OLS and local MLE), drawing on information from a state variable that is correlated with the misspecification of the model. We theoretically consider the forecast environments in which our approach is likely to offer improvements over standard methods, and we find significant fore- cast improvements from applying the proposed method across distinct empirical analyses including volatility forecasting, risk management, and yield curve forecasting. Accessible materials (.zip)},
}