feds · March 22, 2018

Spectral backtests of forecast distributions with application to risk management

Abstract

We study a class of backtests for forecast distributions in which the test statistic is a spectral transformation that weights exceedance events by a function of the modeled probability level. The choice of the kernel function makes explicit the user's priorities for model performance. The class of spectral backtests includes tests of unconditional coverage and tests of conditional coverage. We show how the class embeds a wide variety of backtests in the existing literature, and propose novel variants as well. In an empirical application, we backtest forecast distributions for the overnight P&L of ten bank trading portfolios. For some portfolios, test results depend materially on the choice of kernel. Accessible materials (.zip)

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C. Spectral backtests of forecast distributions with application to risk management Michael B. Gordy and Alexander J. McNeil 2018-021 Please cite this paper as: Gordy, Michael B., and Alexander J. McNeil (2018). “Spectral backtests of forecast distributions with application to risk management,” Finance and Economics Discussion Series 2018-021. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2018.021. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

Spectral backtests of forecast distributions with application to risk management∗ Michael B. Gordy Federal Reserve Board, Washington DC Alexander J. McNeil The York Management School, University of York February 21, 2018 Abstract Westudyaclassofbacktestsforforecastdistributionsinwhichtheteststatistic is a spectral transformation that weights exceedance events by a function of the modeled probability level. The choice of the kernel function makes explicit the user’s priorities for model performance. The class of spectral backtests includes testsofunconditionalcoverageandtestsofconditionalcoverage. Weshowhowthe classembedsawidevarietyofbacktestsintheexistingliterature,andproposenovel variants as well. In an empirical application, we backtest forecast distributions for the overnight P&L of ten bank trading portfolios. For some portfolios, test results depend materially on the choice of kernel. JEL Codes: C52; G21; G28; G32 Keywords: Backtesting; Volatility; Risk management ∗We thank Harrison Katz for excellent research assistance. We have benefitted from discussion with Mike Giles, Marie Kratz, Hsiao Yen Lok, David Lynch, David McArthur, Michael Milgram, and Johanna Ziegel. The opinions expressed here are our own, and do not reflect the views of the Board of Governors or its staff. Address correspondence to Alexander J. McNeil, The York Management School, University of York, Freboys Lane, York YO10 5GD, UK, +44 (0) 1904 325307, alexander.mcneil@york.ac.uk. 1

1 Introduction Inmanyforecastingexercises, fittingsomerangeofquantilesoftheforecastdistribution may be prioritized in model design and calibration. In risk management applications, which will motivate this study, accuracy near the median of the distribution or in the “good tail” of high profits is generally much less important than accuracy in the “bad tail” of large losses. Even within the region of primary interest, preferences may be nonmonotonic in probabilities. For example, the modeller may care a great deal about assessing the magnitude of once-in-a-decade market disruptions, but care much less about quantiles in the extreme tail that are consequent to unsurvivable cataclysmic events. Inthispaper,westudyaclassofbacktestsforforecastdistributionsinwhichthe test statistic weights exceedance events by a function of the modeled probability level. The choice of the kernel function makes explicit the priorities for model performance. The backtest statistic and its asymptotic distribution are analytically tractable for a very large family of kernel functions. Our approach unifies a wide variety of existing approaches to backtesting. In the area of risk management, the time-honored test statistic (dating back to Kupiec, 1995) is simply a count of “VaR exceedances,” i.e., indicator variables equal to one whenever the realized trading loss is in excess of the day-ahead value-at-risk (VaR) forecast. In our framework, this corresponds to a Dirac delta kernel function in which all weight is concentrated at exactly the target VaR level (e.g., at α = 0.99). At the other extreme, the tests applied in Diebold et al. (1998) represent a special case in which weights are uniform across all probability levels. The likelihood-ratio test of Berkowitz (2001) represents an intermediate case of a kernel truncated to tail probabilities. The class of spectral backtests encompasses discrete kernels, which selectively weight forecasts at a discrete set of probability levels, as well as continuous kernels, which apply positive weight throughout an interval of levels. Perhaps of greater importance in practice, the class allows for both tests of unconditional coverage and tests of conditional coverage. The application of a weighting function is this paper bears some similarity to the approach of Amisano and Giacomini (2007) and Gneiting and Ranjan (2011) in the literatureoncomparisonsofdensityforecasts. Inbothofthosepapers,weightsareapplied to a forecast scoring rule to obtain measures of forecast performance that accentuate the tails (or other regions) of the distribution. However, the measure for any one forecasting method has no absolute meaning and is designed to facilitate comparison with othermethodsusingthegeneralcomparativetestingapproachproposedbyDieboldand Mariano (1995). In contrast, our tests are absolute tests of forecast quality in the spirit of Diebold et al. (1998). While the comparative testing approach is clearly useful for theinternal refinementoftheforecastingmethodbytheforecaster, theabsolutetesting approach in this paper facilitates the external evaluation of the forecaster’s results by another agent, such as a regulator. 2

Our investigation is motivated in part by a major expansion in the data available to regulators for the backtesting exercise. Prior to 2013, banks in the US reported to regulators VaR exceedances at the 99% level. The new Market Risk Rule mandates that banksreport for each trading day the probability associatedwith the realized P&L in the prior day’s forecast distribution, which is equivalent to providing the regulator with VaR exceedances at every level α ∈ [0,1]. The expanded reporting regime allows us to assess the tradeoff between power and specificity in backtesting. If a regulator is concerned narrowly with the validation of reported VaR at level α, than a count of VaR exceedances is a sufficient statistic for a test for unconditional coverage. However, if the regulator is willing to assign positive weight to probability levels in a neighborhood of α, we can construct more powerful backtests. Furthermore, our approach is consistent with a broader view of the risk manager’s mandate to forecast probabilities over a range of large losses. The formal guidance of US regulators to banks on internal model validationexplicitlyrequires“checkingthedistributionoflossesagainstotherestimated percentiles” (Board of Governors of the Federal Reserve System, 2011, p. 15). The reforms mandated by the Fundamental Review of the Trading Book (Basel Committee on Bank Supervision, 2013) introduce a distinct set of challenges. Due to begin parallel run in 2018, the FRTB replaces 99%-VaR with 97.5%-Expected Shortfall (ES) as the determinant of capital requirements. While there has been a lot of debate around the question of whether or not ES is amenable to direct backtesting (Gneiting, 2011; Acerbi and Szekely, 2014; Fissler and Ziegel, 2015; Fissler et al., 2016), our contribution addresses a different issue. We devise tests of the forecast distribution from which risk measures are estimated and not tests of the risk measure estimates. When VaRifofprimaryinterestitmaybenotedthatsomelimitingspecialcasesofourtesting methodology are equivalent to VaR exceedance tests. When ES is of primary interest it may be argued that a satisfactory forecast of the tail of the loss distribution is of even greater importance, since the risk measure depends on the whole tail. Two other aspects of FRTB are relevant to our contribution. First, although estimates of ES will be the cornerstone of the risk capital calculation, the model approval process will continue to be based on VaR estimates and VaR exceedances. Second, FRTB requires banks to go beyond the mandatory VaR backtesting regime to consider multiple levels or other features of the tail. Without being prescriptive, the Basel Committee explicitly mentions a number of possible directions for the extended model validationrequirementsincludingtheuseofprobabilityintegraltransformvalues(Basel Committee on Bank Supervision, 2016, Appendix B), which also serve as the input in our class of backtests. For convenience in exposition, we mostly assume henceforth that the backtest is conducted by a regulator who is interested primarily in assessing the bank’s 99%-VaR forecast, but our conclusions hinge little on the choice of risk measure, and furthermore apply as much to internal assessments of forecasting performance as 3

to external assessment by regulators. In Section 2, we lay out the statistical setting for the risk manager’s forecasting problem and the data to be collected for backtesting. The transformation that underpins the class of spectral backtests is introduced in Section 3. Spectral backtests of unconditional coverage are described in Section 4. In Section 5, we develop tests of conditional coverage based on the martingale difference property. As an application to real data, in Section 6 we backtest ten bank models for overnight P&L distributions for trading portfolios. 2 Theory and practice of risk measurement We assume that a bank models profit and loss (P&L) on a filtered probability space (Ω,F,(F t ) t∈N 0 ,P) where F t represents the information available to the risk manager at time t, N = N∪{0} and N denotes the non-zero natural numbers. For any time 0 t ∈ N, L is an F -measurable random variable representing portfolio loss (i.e., negative t t P&L) in currency units. We denote the conditional loss distribution given information to time t−1 by F (x) = P(L (cid:54) x | F ). t t t−1 The loss distribution cannot be assumed to be time-invariant. The distribution of returns on the underlying risk factors (e.g., equity prices, exchange rates) is timevarying, most notably due to stochastic volatility. Furthermore, F depends on the t composition of the portfolio. Because the portfolio is rebalanced in each period, F can t evolve over time even when factor returns are iid. For t ∈ N we can define the process (U ) by U = F (L ) using the probability t t t t integral transform (PIT). Under the assumption that the conditional loss distributions at each time point are continuous, the result of Rosenblatt (1952) implies that the process (U t ) t∈N is a sequence of iid standard uniform variables. The risk manager builds a model F(cid:98)t of F t based on information up to time t−1. Reported PIT-values are the corresponding rvs (P t ) obtained by setting P t = F(cid:98)t (L t ) for t ∈ N. If the models F(cid:98)t form a sequence of ideal probabilistic forecasts in the sense of Gneiting et al. (2007), i.e.coincidingwiththeconditionallawsF ofL foreveryt,thenweexpectthereported t t PIT-values to behave like an iid sample of standard uniform variates.1 Reported PIT-values contain information about VaR exceedances at any level α. To see this note that P t (cid:62) α ⇐⇒ L t (cid:62) V(cid:100)aR α,t (1) where V(cid:100)aR α,t := F(cid:98) t ←(α) is an estimate of the α-VaR constructed at time t−1 by cal- 1InthestatisticalforecastingliteraturetestsbasedontheuniformityandindependenceofPITvalue arealsoreferredtoasteststhatasequenceofmodelsiscalibrated in probability (Gneitingetal.,2007; Gneiting and Ranjan, 2011). 4

culating the generalized inverse of F(cid:98)t at α. Relationship (1) always holds for any model F(cid:98)t , whether continuous or discrete.2 Thus, we would expect well-designed tests that use reported PIT-values to be more powerful than VaR exceedance tests in detecting deficiencies in the models F(cid:98)t . Our tests are agnostic with respect to the procedures and models used by the bank in forecasting. In practice, there is considerable heterogeneity in methodology. For nearly two decades, most large banks have relied primarily on some variant of historical sampling(HS),whichisanonparametricmethodbasedonre-samplingofhistoricalriskfactor changes or returns. A sufficient condition for the “plain-vanilla” HS estimator F(cid:98) t HS to be a consistent estimator of F t for all t is that the returns are iid; however the approach does not account for serial dependence in returns such as time-varying volatility. For this reason, some banks adopt filtered historical simulation (FHS) as suggested by Hull and White (1998) and Barone-Adesi et al. (1998). In this approach, thehistoricalrisk-factorreturnsarenormalizedbytheirestimatedvolatilities,whichare typically obtained by taking an exponentially-weighted moving-average of past returns. Banks that do not use HS or FHS typically adopt a parametric model for the joint distribution of risk-factor changes.3 In our empirical application, testing for delayed response to changes in volatility is of special interest. Assuming a roughly symmetric loss distribution centered at zero, the frequent switching between positive and negative values will tend to cause PIT values to be serially uncorrelated, even when volatility is misspecified in the model. However, extreme PIT-values (i.e., near 0 or 1) will tend to beget extreme PIT-values in high volatility periods, and middling PIT-values (i.e., near 1⁄ ) will tend to beget 2 middlingPIT-valuesinlowvolatilityperiods. Thispatterncanbeinferredbyexamining autocorrelation in the transformed values |2P −1|. We will exploit this transformation t in implementing tests of conditional coverage in Section 6. There are relatively few empirical studies of bank VaR forecasting. Berkowitz and O’Brien (2002) show that VaR estimates by US banks are conservative (i.e., there are fewer exceedances than expected) and that the forecasts underperform simple timeseries models applied to daily P&L. In a sample of Canadian banks in 1999–2005, Pérignon et al. (2008) record only two 99%-VaR exceedances in 7354 observations. Pérignon and Smith (2010) report similar results for a larger international sample in 1996–2005. For the subsample of banks employing HS, they also show that reported VaRhaslittlepredictivepowerforsubsequentvolatilityinP&L.Berkowitzetal.(2011) applyasuiteofbackteststoaproprietarysampleoffourbusinesslinesofasinglebankin 2001–2004. While they find some evidence of excessive conservatism and/or clustering 2WecanreplacetheweakinequalitieswithstrictinequalitiesifthemodelsF(cid:98)t arestrictlyincreasing andcontinuous. Sinceitissomewhatmorecommontoconsidertheevent{L t >V(cid:100)aR α,t }tobeaVaR exceedance,wewilldefineaVaRexceedanceintermsofthereportedPIT-valueastheevent{P >u}. t 3The classic RiskMetrics approach can be considered a progenitor of this class of models. 5

of VaR exceedances in three of the four business lines, the exercise also demonstrates the limited power of backtests in sample sizes of two to three years. The importance of sample size is evident in the contrasting results of O’Brien and Szerszen (2017). In a sample of five large US banks from 2001–2014, tests of unconditional coverage reject VaR forecasts as excessively conservative for all banks in the pre-crisis and post-crisis periods,forwhichthesamplesspannedatleast1000tradingdaysperbank. Inthecrisis period,testsofunconditionalcoveragerejectVaRforecastsasinsufficientlyconservative for all five banks, and independence is rejected for four of the banks. This pattern is consistent with a failure to model stochastic volatility. 3 Spectral transformations of PIT exceedances The tests in this paper are based on transformations of indicator variables for PIT exceedances.4 The transformations take the form (cid:90) 1 W = 1 dν(u) (2) t {Pt>u} 0 whereν isafinitemeasuredefinedon[0,1]whichisdesignedtoapplyweighttodifferent levels in the interval (0,1], typically in the region of the standard VaR level α = 0.99. We refer to ν as the kernel measure for the transform. From (2), we can easily derive the closed-form expression W = ν([0,P )) (3) t t which shows that W is increasing in P . t t 3.1 Weighting schemes For the weighting scheme in (2) we consider three possibilities: Discrete weighting in which the kernel measure takes the form ν = (cid:80)m γ δ for i=1 i αi m (cid:62) 1. This places positive mass γ ,...,γ at the ordered values α < ··· < α 1 m 1 m leading to m (cid:88) W = γ 1 . (4) t i {Pt>αi} i=1 Continuous weighting in which the measure has density dν(u) = g(u)du on the interval [α ,α ] ⊂ [0,1], where the function g satisfies 1 2 Assumption 1. (i) g(u) = 0,u ∈/ [α ,α ], (ii) g is continuous and (iii) g(u) > 1 2 0,u ∈ (α ,α ). 1 2 4Wedrawontheintegraltransformliteratureindescribingourbacktestas“spectral.” Ourapproach is unconnected to the spectral density test of Durlauf (1991). The latter is a test of the martingale propertythatexamineswhetherthespectrum(inthesenseofthetransformedautocovariancesequence) is flat. 6

In this case we have (cid:90) α2 W = g(u)1 du. (5) t {Pt>u} α1 We refer to g as the kernel density. It plays the same role as the “kernel function” in the nonparametricstatisticsliterature, butweuse theterm inthe moregeneral sense of the integral transform literature. When g satisfies the additional requirement that (cid:82)α2g(u)du = 1, it is a normalized kernel density. In nonparametric α1 statistics, the kernel is often defined to be normalized and symmetric, but we do not impose either requirement here. As in the nonparametric statistics literature, the interval [α ,α ] is referred to as 1 2 the kernel window. Note that g is strictly positive inside the kernel window, but may equal zero at the boundary points. This allows us to accommodate functions such as the Epanechnikov kernel that vanish at the boundaries. Writing G for the integral of g, (3) can be expressed as W = G(α ∨(P ∧α )) (6) t 1 t 2 Since G is strictly increasing inside the kernel window, (6) implies that W is a t strictly increasing function of the truncated PIT-value P∗ = α ∨(P ∧α ). t 1 t 2 Continuous weighting can be viewed as a way of building tests that incorporate information from reported PIT-values in a neighborhood of a particular VaR level α. Letg∗beanormalizedkerneldensityon[0,1],anddefineafamilyofnormalized kernel densities g on the intervals [α−ϵ/2,α+ϵ/2] by α,ϵ (cid:18) (cid:19) 1 u−α+ϵ/2 g (u) = g∗ . (7) α,ϵ ϵ ϵ Then we have that the measures ν defined by g converge to Dirac measure α,ϵ α,ϵ δ as ϵ → 0+, and lim W = 1 almost surely. Thus, classic tests based α ϵ→0 t {Pt>α} on the exceedance indicator 1 can be seen as limiting cases of more general {Pt>α} continuous tests as the width ϵ of the kernel window vanishes to zero. Combined discrete and continuous weighting. It is of course possible to consider a measure that is given by the sum of a discrete weighting and a continuous weighting scheme. We consider one test of this kind in Section 4.3. In this general case, the notion of the kernel window generalizes as the support of the kernel measure. 3.2 Univariate and multivariate transformations We consider tests based on univariate and multivariate spectral transformations of the data. Aunivariatetransformationappliesasinglekernelmeasureν andyieldsspectrally 7

transformed PIT-values W ,...,W according to (2). A multivariate transformation 1 n corresponds to a set of distinct kernel measures ν ,...,ν . The transformed PIT values 1 j are then vector-valued variables W ...,W where 1 n (cid:90) 1 W = (W ,...,W )′, W = 1 dν (u), j = 1,...,j. (8) t t,1 t,j t,i {Pt>u} i 0 Spectrally transformed PIT values satisfy simple product rules that we will later exploit in calculating variances of the (W ) and covariance matrices of the (W ). Cont t sider two discrete kernel measures ν and ν which share the same support. Then the 1 2 product W W is a spectral transformation of P on the same support, and the kernel t,1 t,2 t weights are easily calculated as summarized in the following result. Proposition 3.1. Fix a set of distinct levels 0 < α < ··· < α < 1, and let 1 m γ = (γ ,...,γ )′ be a set of positive weights. The set of spectrally transformed i i,1 i,m PIT values defined by W = (cid:80)m γ 1 is closed under multiplication and t,i ℓ=1 i,ℓ {Pt>α ℓ } W W = (cid:80)m γ∗1 where γ∗ are positive weights satisfying t,1 t,2 ℓ=1 ℓ {Pt>α ℓ } ℓ ℓ ℓ (cid:88) (cid:88) γ∗ = γ γ +γ γ −γ γ . ℓ 1,ℓ 2,ℓ′ 2,ℓ 1,ℓ′ 1,ℓ 2,ℓ ℓ′=1 ℓ′=1 If (cid:80)m γ = (cid:80)m γ = 1, then (cid:80)m γ∗ = 1. ℓ=1 1,ℓ ℓ=1 2,ℓ ℓ=1 ℓ An analogous product rule holds for the set of spectral transformations with continuous kernels on the same kernel window. Proposition 3.2. Fix a kernel window [α ,α ] ⊂ [0,1], and let g be a kernel den- 1 2 i sity on [α ,α ] satisfying Assumption 1. The set of spectrally transformed PIT values 1 2 defined by W = (cid:82)α2g (u)1 du is closed under multiplication and W W = t,i α1 i {Pt>u} t,1 t,2 (cid:82)α2g∗(u)1 du where α1 {Pt>u} g∗(u) = g (u)G (u)+g (u)G (u). 1 2 2 1 If g and g are normalized kernel densities on [α ,α ], then so is g∗. 1 2 1 2 ProofsforthesepropositionandothermathematicalresultsarefoundinAppendixA. 3.3 Spectral backtests We will refer to any backtest based on spectrally transformed PIT exceedances as a spectral backtest. This encompasses a great variety of tests but two general testing approaches will feature prominently in our presentation: Z-tests and likelihood ratio tests (LRTs). 8

To formulate these tests we state the null hypothesis in this paper to be H : W ∼ F0 and W ⊥⊥ F , ∀t, (9) 0 t W t t−1 where F0 denotes the distribution function of W in (8) when P is uniform; obviously W t t this subsumes the univariate case where we will simply write W for the spectrallyt transformed variables. The null hypothesis (9) implies thatW ,...,W are iid random 1 n variables but also requires that W is independent of all information in the time t−1 t informationsetF ,suchasthevaluesP forj > 0. Observethatournullhypothesis t−1 t−j isstrictlyweakerthananullhypothesisthatthe(P )areiidUniform. Thisisbyintent. t Since the regulator is free to choose ν in accordance with her priorities, she should not object to departures from uniformity and serial independence that arise outside her chosen kernel window. Z-tests. In the univariate case these are based on the asymptotic normality of W = n n−1(cid:80)n W under the null hypothesis (9). Using Propositions 3.1 and 3.2, we t=1 t calculate µ = E(W ) and σ2 = var(W ) in the null model F0 . It then follows W t W t W triviallyfromthecentrallimittheorem(CLT)that, underthenullhypothesis(9), √ n(W −µ ) Z = n W −− d −→ N(0,1). (10) n σ W n→∞ In the multivariate case (dimW = j) we have t √ (cid:0) (cid:1) d n W −µ −−−→ N (0,Σ ) n W j W n→∞ where W = n−1(cid:80)n W and µ and Σ are the mean vector and covariance n t=1 t W W matrix of the null distribution F0 . Hence a test can be based on assuming for W large enough n that T = n (cid:0) W −µ (cid:1)′ Σ−1(cid:0) W −µ (cid:1) ∼ χ2, (11) n n W W n W j where we refer to T as a j-spectral Z-test statistic. n Likelihood ratio tests. These are based on parametric models F (· | θ) that nest W the model in the null hypothesis (9). In other words F0 = F (·,θ ) for some W W 0 value θ . Writing L (θ | W) for the likelihood function, the test is based on the 0 W asymptotic distribution of the statistic L (θ | W) LR = W 0 (12) W,n L (θˆ| W) W where θˆdenotes the maximum likelihood estimate. 9

An important difference between the two classes of test is that the Z-tests are sensitive to the choice of weighting scheme whereas the likelihood ratio tests are not. Consider the univariate case for simplicity. The only aspect of the kernel measure ν that determines the likelihood test statistic LR is its support; the actual weighting W,n scheme applied on the support plays no role. For example, in the case of continuous weighting, it is the kernel window [α ,α ] that determines the test statistic and not the 1 2 kerneldensityg. Apartfromthechoiceofthesupportofthemeasuretheonlydiscretion we have over the likelihood ratio test is the choice of nesting family F (· | θ). W This is a consequence of the well-known invariance of the likelihood ratio test under strictly increasing tranformations. To make this assertion clearer we will now give a version of the invariance result in the case of univariate continuous weighting, which will facilitate some of our later arguments. Theorem 3.3. Let F (p | θ) be a parametric model for the reported PIT values P P ...,P that nests the uniform model as a special case corresponding to θ = θ . Let 1 n 0 P∗ = α ∨(P ∧α ) denote the corresponding truncated PIT values and W = T(P∗) t 1 t 2 t t the values that are obtained under any transformation T which is strictly increasing and continuous on [α ,α ] such as (6). 1 2 Let L (θ | P∗) denote the likelihood for the truncated PIT values under F (p | θ) P P and let L (θ | W) denote the likelihood for the (W ) values under the distribution W t F (w | θ) implied by F (p | θ). Then the maximizing values of L (θ | P∗) and W P P L (θ | W) are the same and the corresponding likelihood ratio test statistics of the null W hypothesis H : θ = θ against the alternative H : θ ̸= θ coincide regardless of the 0 0 0 0 choice of the transformation T. 4 Tests of unconditional coverage It is common to divide backtesting methods into tests of unconditional calibration and tests of conditional calibration. In the context of VaR backtesting, an unconditional test is a test that exceedances are Bernoulli events with the correct probability of occurrencewhileaconditionaltestisatestthatexceedanceshavethecorrectconditional probabilityofoccurrence,whichisequivalenttorequiringthattheyarealsoindependent events. For spectrally transformed PIT-values, an unconditional test would test for the distribution F0 implied by the uniformity of the PIT-values while a conditional test W would explicitly test for both the correct distribution and the independence of W and t F for all t. t−1 In this section we present a number of unconditional tests based on the Z-test and LR-test ideas discussed in Section 3. It is important to note that the convergence results on which these tests are based, although mostly stated under iid assumptions, doholdinsituationswheretheindependenceassumptionisrelaxed. ConsidertheZ-test 10

convergenceresultin(10)andrecallthemartingaleCLTof Billingsley(1961): if(X )is t a stationary and ergodic process adapted to a filtration (F ) satisfying the martingalet √ difference property E(X | F ) = 0, then nX −− d −→ N(0,σ2 ) where σ2 denotes the t t−1 X X n→∞ variance of X . Thus, the same convergence in (10) would be obtained if (W −µ ) is t t W a stationary and ergodic martingale difference sequence, which would entail that (W ) t √ is an uncorrelated sequence. More generally, provided that lim var( nW ) ≈ σ2 n→∞ n W the test statistic Z in (10) will have no power to detect serial dependence. If, however, √ there is persistent positive serial correlation in (W ) leading to lim var( nW ) > t n→∞ n σ2 then the test statistic Z will have some power to detect dependencies; however, W more targeted tests of the independence property are available and are the subject of Section 5. An early paper on backtesting in a risk-management setting is Kupiec (1995), who proposed a binomial likelihood ratio test for the number of VaR exceedances. Ziggel et al. (2014) offer a refinement of this count-based test. Campbell (2006) recommended testing exceedances at multiple levels, and introduced the Pearson chi-squared test in thiscontext. PérignonandSmith(2008)proposedamultilevellikelihoodratiotestgeneralizingthebinomialtestofKupiec(1995). AmultinomialLRTalsounderliesthework of Colletaz et al. (2013) on the concept of a “risk map” to describe VaR exceedances at two different levels. Kratz et al. (2016) provide a comparison of unconditional multilevel tests (including Pearson and LRT) in a typical set-up for backtesting trading book models and advocate the use of Nass’s variant on the Pearson test for control of size and power. Crnkovic and Drachman (1996) appear to have been first to advocate the use of PIT-values for backtesting risk management models. They also allow for a weighting function that plays the role of our kernel density, but the distribution for the resulting test statistic must be simulated.5 The seminal paper of Diebold et al. (1998) described a number of tests for the uniformity and independence of PIT values. Berkowitz (2001) advocated a likelihood-ratio test based on fitting a truncated normal distribution to probit-transformed PIT-values for regulatory application. Most closely related to our work, Du and Escanciano (2017) and Costanzino and Curran (2015) have proposed test statistics for spectral risk measures which can be viewed as special cases of our univariate spectral Z-test approach. Both papers consider a mathematical framework that permits a variety of kernels but focus on the case of a uniform kernel and interpret the tests in terms of backtesting expected shortfall. In contrast, we provide a general methodology that allows a bespoke choice of one or more kernels according to testing priorities, show how this embeds many existing tests and new tests and show how the framework may be easily generalized to the conditional 5The test of Crnkovic and Drachman (1996) is based on a weighted Kuiper distance between the distributionofPITvaluesandtheuniform. Theyrefertotheirweightingschemeasa“worry” function, and propose that it should place higher weight on extreme PIT values. 11

case.6 Other contributions using PIT-values include Kerkhof and Melenberg (2004), who derive VaR and expected shortfall backtesting statistics by applying a functional delta method to the empirical distribution function of PIT-values and Zumbach (2006), who refers to PIT-values as probtiles. In Section 4.1 we describe unconditional coverage tests based on discrete kernels. Continuous kernels are considered in Section 4.2. Mixed kernels emerge in Section 4.3 through the study of tests based on a truncated probitnormal distribution. 4.1 Discrete weighting Discrete tests are based on the univariate transformation W = (cid:80)m γ 1 as det i=1 i {Pt>αi} fined in (4) and the multivariate transformation W = (1 ,...,1 )′ in (8) t {Pt>α1} {Pt>αm} for the same set of ordered levels α < ··· < α . Obviously, when m = 1 (and γ = 1) 1 m 1 bothtransformationsyieldW = 1 ,sothatweobtainiidBernoulli(1−α)variables t {Pt>α} under the null hypothesis (9). This is the basis for standard VaR exceedance testing based on the binomial distribution. The case m > 1 yields multinomial tests. We consider first the binomial case followed by the multinomial case, in each case treating the LRT followed by the Z-test. A two-sided binomial LRT of the null p = 1−α against the alternative p ̸= 1−α can be based on the asymptotic chi-squared distribution of the LR statistic under the null in (12); this is the approach taken in Kupiec (1995) and Christoffersen (1998). Note that the traffic-light system and model approval rules under Basel (see, e.g., Basel Committee on Bank Supervision, 2016, Appendix B) are actually based on a one-sided LRT of the null hypothesis against the simple alternative p = p for p > 1−α; this 1 1 amounts to comparing the exception count (cid:80)n W to a critical value defined by the t=1 t binomial distribution. The Z-test statistic (10) for W = 1 coincides with the binomial score test t {Pt>α} statistic √ (cid:0) (cid:1) n W −(1−α) Z = n . (13) n (cid:112) α(1−α) Kratz et al. (2016) give a comparison of different binomial tests and find that the binomial score test perfoms best for the probability levels and sample sizes that are of typical regulatory interest. When m > 1 the variables W = (cid:80)m γ 1 take the ordered values Γ < t i=1 i {Pt>αi} 0 Γ < ··· < Γ where Γ = 0 and Γ = (cid:80)k γ for k = 1,...,m. Under the null 1 m 0 k i=1 i 6Du and Escanciano (2017) also show how the asymptotic distribution of the test can be adapted to account for estimation error. We view this as less relevant in our setting since a regulator will tend totakethestrictlinethatbacktestsshouldpenalizeafailuretoestimatemodelsaccuratelyevenwhen the models used are essentially correct in form. 12

hypothesis (9) the distributions of W and W satisfy t t P(W = Γ ) = P(1′W = i) = α −α , i ∈ {0,1,...,m}, (14) t i t i+1 i where α = 0 and α = 1. In both cases this describes a multinomial distribution. 0 m+1 The multinomial generalization of the binomial LRT of Kupiec (1995) as proposed by Pérignon and Smith (2008) is nested in our framework. The test depends on the spectrallytransformedPITvaluesthroughtheobservedcellcountsO = (cid:80)n 1 i t=1 {Wt=Γi} (univariate transformation) or O = (cid:80)n 1 (multivariate transformation). i t=1 {1′Wt=i} Note in the former case that the cumulative weights Γ play no role in the resulting test i statistic, a consequence of the invariance property of the LRT noted in Section 3.3. The univariate and multivariate tranformations do however result in different Ztests which can be considered as alternative generalizations of the binomial score test. In the univariate case we can apply Proposition 3.1 to obtain m i (cid:88) (cid:88) W2 = γ∗1 where γ∗ = 2γ γ −γ2 = 2γ Γ γ2, t i {Pt>αi} i i j i i i i i=1 j=1 from which it is straightforward to calculate that the first two moments of W are given t by m m (cid:88) (cid:88) µ = γ (1−α ), σ2 = γ∗(1−α )−µ2 . W i i W i i W i=1 i=1 Hence we can construct a Z-test based on the statistic Z in (10) and vary the weights n γ to emphasise different levels α . i i In the multivariate case, if we construct an m-spectral Z-test as in (11), then we obtain the classical Pearson chi-squared statistic as proposed by Campbell (2006). Theorem 4.1. n(W −µ )′Σ−1(W −µ ) = (cid:88) m (O i −nθ i )2 n W W n W nθ i i=0 where O = (cid:80)n 1 and θ = α −α for i = 0,...,m. i t=1 {1′Wt=i} i i+1 i The Pearson test statistic S = (cid:80)m (O −nθ )2/(nθ ) is usually compared with m i=0 i i i a chi-squared distribution with m degrees of freedom; Theorem 4.1 in fact provides a proof of the asymptotic law of the Pearson test by showing that it can be written as an m-spectral Z-test.7 7Pearson’stestisknowntoperformpoorlywhencellcountsaresmall,whichistypicallythecasein our tail-focussed applications. Nass’s variant on the test (Nass, 1959), which is based on an improved approximation to the distribution of S gives improved results; see Cai and Krishnamoorthy (2006) m and Kratz et al. (2016) for more details of the approximation. 13

4.2 Continuous weighting In this section, W takes the form of (5) for a kernel density g satisfying Assumption t 1; we also consider a bispectral test where W = (W ,W )′ is constructed from two t t,1 t,2 different kernel densities on the same kernel window. In the univariate case, we apply the Z-test approach described in (10). It follows from the application of Proposition 3.2 in the case where W = W = W that, under t,1 t,2 t the null hypothesis (9), (cid:90) α2 (cid:90) α2 E(W ) = g(u)(1−u)du and E(W2) = 2g(u)G(u)(1−u)du. t t α1 α1 These moments are straightforward to calculate analytically for a wide variety of kernel densities, e.g., based on linear, quadratic, or exponential functions, or on beta-type densitiesoftheform(u−α )a−1(α −u)b−1 fora,b > 0. Thus,ourcompactpresentation 1 2 of the continuous spectral Z-test subsumes a very large family of possible tests. The bispectral generalization is a new test that extends the idea of the continuous spectralZ-test. ForabivariatespectraltransformationW = (W ,W )′ basedontwo t t,1 t,2 distinct kernel densities g and g with the same kernel window it is straightforward to 1 2 calculateµ = E(W )andΣ = cov(W ). Theoff-diagonalelementofthematrixΣ W t W t W requires the calculation of E(W W ) which can be achieved using Proposition 3.2. t,1 t,2 The test is based on assuming for large enough n the statistic T of (11) is distributed n χ2 under H . 2 0 The intuition for the bispectral test is that by considering two different spectral transformations we can test for two different features of the distribution of reported PIT values in the tail. Obviously, we could consider higher dimensional generalizations buttheempiricalresultsofSection6andthesimulationresultsinourcompanionpaper show that the bivariate test works well. 4.3 Tests based on truncated probitnormal distribution The tests in this section nest the null hypothesis (9) in a model where the underlying reported PIT values P ,...,P have a probitnormal distribution satisfying Φ−1(P ) ∼ 1 n t N(µ,σ2). Writing θ = (µ,σ)′, the distribution function and density of P are respect tively (cid:16) (cid:17) (cid:32) Φ−1(p)−µ (cid:33) φ Φ−1( σ p)−µ F (p | θ) = Φ , f (p | θ) = , p ∈ [0,1], (15) P σ P φ(Φ−1(p))σ which gives a flexible family containing the uniform distribution, which corresponds to θ = θ = (0,1)′. Other choices of nesting model are possible, for example a beta 0 distribution. 14

The test statistics are based on the PIT values truncated to the interval [α ,α ], 1 2 that is, the values P∗ = α ∨(P ∧α ). The likelihood contribution L(θ | P∗) of an t 1 t 2 t observation P∗ in the truncated model can be written as t  F (α | θ) P∗ = α ,   P 1 t 1  L(θ | P∗) = f (P∗ | θ) α < P∗ < α , (16) t P t 1 t 2    F¯ (α | θ) P∗ = α . P 2 t 2 See (A.1) for the explicit likelihood of the sample P∗,...,P∗. 1 n We first consider an LRT that θ = θ against the alternative that θ ̸= θ . Recall 0 0 that (6) shows that spectrally transformed PIT values W are given by continuous, t strictly increasing transformations of the P∗. Theorem 3.3 implies that the LR test t of the null hypothesis that the truncated PIT values P∗ have a truncated uniform t distribution, against the alternative that they do not, is equivalent to a whole family of LR tests for the spectrally transformed PIT values under continuous weighting. In the case where α = 1, this test is also equivalent to the test proposed by Berkowitz (2001); 2 in the case where α < 1 we obtain a generalization of the Berkowitz test–a Berkowitz 2 interval test.8 An alternative to the LRT is the classical score test, which has the advantage that no maximization of the likelihood is required. It will turn out that this test is also a bispectral Z-test. Denote the observed score vector for P∗ by t (cid:18) ∂ ∂ (cid:19)′ S (θ) = lnL(θ | P∗), lnL(θ | P∗) (17) t ∂µ t ∂σ t and let S (θ ) = 1 (cid:80)n S (θ ) be the mean of the observed score vectors under the n 0 n t=1 t 0 null. The score test follows from the asymptotic distribution √ d (cid:0) (cid:1) nS (θ ) −−−→ N 0,I(θ ) , n 0 2 0 n→∞ where I(θ) denotes the expected Fisher information matrix. Consequently, for large n we have approximately that nS (θ )′I(θ )−1S (θ ) ∼ χ2 n 0 0 n 0 2 An analytical expression for I(θ ) is provided in Appendix B. 0 Thefollowingresultshowsthatthisisabispectraltestwiththestructure(11)under a generalization that allows some additional point mass at the endpoints of the interval 8Berkowitz (2001) models the data Φ−1(P∗) with a normal N(µ,σ2) distribution truncated to t [Φ−1(α ),∞). This coincides with our approach because Φ−1 is a continuous and strictly increasing 1 transformation and Theorem 3.3 again applies. 15

[α ,α ]. 1 2 Theorem 4.2. S (θ ) = W −µ , almost surely, where W can be expressed as t 0 t W t,i (cid:90) α2 W = γ 1 +γ 1 + g (u)1 du t,i i,1 {Pt>α1} i,2 {Pt>α2} i {Pt>u} α1 for γ , γ and g (u) with analytical solution. i,1 i,2 i 5 Tests of conditional coverage Whereas unconditional tests are focused on testing for the hypothesized distribution F0 of the spectrally transformed PIT-values, conditional backtests are joint tests of W the correct distribution and the independence of W and F for all t, as asserted by t t−1 the null hypothesis (9). We have noted in Section 4 that the Z-tests presented there may have some limited power to detect the presence of serial dependencies. The aim in this section is to propose conditional extensions of our spectral tests that explicitly addresstheindependenceofW andF . Thesetestsshouldhavemorepowertodetect t t−1 departures from the null hypothesis resulting from a failure to use all the information in F t−1 when building the predictive model F(cid:98)t . In the context of risk management, where models often fail to address time-varying volatility in adequate fashion, there is a particular need for tests of this kind. In his early paper on backtesting, Kupiec (1995) proposed a test for independence of VaR exceedances based on the fact that the spacings between them should be geometrically distributed. This latter property follows from the fact that a series of VaR exceedances should behave like a Bernoulli trials process, that is iid Bernoulli events with independent geometric waiting times.9 The tests that we develop below follow an alternative regression-based approach to testingconditional coverage. Christoffersen (1998) proposed an earlytest in thisveinin which the iid Bernoulli hypothesis for VaR exceedances is tested against the alternative hypothesis that VaR exceedances show first-order Markov dependence; this has been generalizedtoamultileveltestbyLeccaditoetal.(2014). TheChristoffersontestcanbe viewed as a likelihood-ratio test that the parameters in a simple linear regression model are zero. An especially influential regression-based test is the dynamic quantile (DQ) test of Engle and Manganelli (2004), in which exceedance indicators are regressed on lagged exceedance indicators and lagged estimates of VaR to assess the null hypothesis of independent exceedances occurring at the desired rate. Our martingale difference framework generalizes the DQ test and includes a variant on the Christoffersen (1998) test. 9ChristoffersenandPelletier(2004)furtherdevelopedtheideaoftestingthespacingsbetweenVaR exceedancesusingthefactthatadiscretegeometricdistributioncanbeapproximatedbyacontinuous exponential distribution. See McNeil et al. (2015) for more details of the theory. 16

There are a number of other tests that are related to, but not directly subsumed by theregression-basedtestingapproachwedevelopbelow. Berkowitzetal.(2011)suggest adaptingtheDQtesttouseastandardlinkfunctionformodellingbinaryresponsedata resulting in a generalized linear regression model. Dumitrescu et al. (2012) build on this idea by considering the application to backtesting of the dynamic binary model of Kauppi and Saikkonen (2008). Hurlin and Topkavi (2007) propose a multivariate portmanteau test based on the autocorrelations of VaR exceedances at different lags and different confidence levels. Leccadito et al. (2014) propose a generalization of the Pearson multilevel test to test for independence of numbers of level exceedances across time periods. Du and Escanciano (2017) develop a Box-Pierce-type test based on a backtest statistic for expected shortfall that takes PIT values as input. Berkowitz et al. (2011) provide a comprehensive overview of tests of conditional coverage and advocate the DQ and geometric spacing tests in particular. Inthefollowingsubsections,weconsidertestingfortheindependenceoftransformed reported PIT-values within a regression or conditional framework. We introduce the notation (W(cid:102)t ) for the sequence of transformed reported PIT-values W(cid:102)t = W t − µ W centered at their theoretical mean µ under the null hypothesis (9). Recall from W Section2thatthefiltration(F )representstheinformationavailabletotheriskmanager t and that P t is F t -measurable. We test that (W(cid:102)t ) has the martingale difference (MD) property with respect to (F ): t E(W(cid:102)t | F t−1 ) = 0 (18) which is necessary for (9) to hold. 5.1 Conditional spectral Z-test When MD property (18) holds, we must have E(h t−1 W(cid:102)t ) = 0 for any F t−1 -measurable random variable h . We form the k+1-dimensional lagged vector t−1 h = (1,h(P ),...,h(P ))′ t−1 t−1 t−k for a function h, to which we refer as a conditioning variable transformation. To guarantee the existence of the second moment of h , we assume that (P ) is covariancet−1 t stationaryandthathisbounded.10 Particularexamplesthatwewilluseinourempirical analysis are h(p) = 1 for some α and h(p) = |2p−1|c for c > 0. {p>α} We base our test on the vector-valued process Y t = h t−1 W(cid:102)t for t = k +1,...,n. Under the null hypothesis (9), (Y ) is a MD sequence satisfying E(Y | F ) = 0. We t t t−1 want to test that Y ,...,Y are close to the zero vector on average. The conditional k+1 n 10The restriction on h can be relaxed considerably, but in practice we find that bounded functions lead to more stable tests. 17

predictive test of Giacomini and White (2006) which was developed for comparing forecasting methods can be applied in this context. Let Y = (n−k)−1(cid:80)n Y n,k t=k+1 t and let Σˆ denote a consistent estimator of Σ := cov(Y ). Giacomini and White show Y Y t that under very weak assumptions, for large enough n and fixed k, (n−k) Y ′ Σˆ−1 Y ∼ χ2 . (19) n,k Y n,k k+1 Giacomini and White (2006) use the estimator ΣˆGW = (n−k)−1(cid:80)n Y Y′ but we Y t=k+1 t t can use the fact that E(W(cid:102) t 2 | F t−1 ) = σ W 2 for all t under the null hypothesis (9) to form an alternative estimator. We compute that Σ = E(cov(Y | F )) = E(cid:0)E(cid:0) Y Y′ | F (cid:1)(cid:1) Y t t−1 t t t−1 (cid:16) (cid:16) (cid:17)(cid:17) = E h t−1 h′ t−1 E W(cid:102) t 2 | F t−1 = σ W 2 H (20) where H = E(cid:0) h h′ (cid:1) which suggests the estimator Σˆ = σ2 Hˆ where11 t−1 t−1 Y W n (cid:88) Hˆ = (n−k)−1 h h′ . (21) t−1 t−1 t=k+1 The decomposition in (20) has the advantage that it generalizes our unconditional spectral Z-test, which may be thought of as the case k = 0. The case k = 1 may be viewed as a Z-test version of the first-order Markov chain test of Christoffersen (1998). Moreover, as we now show, our conditional test contains as a special case the dynamic quantile (DQ) test statistic proposed by Engle and Manganelli (2004). Let X be the (n − k) × (k + 1) matrix whose rows are given by h for t = k + 1,...,n. Let t−1 W(cid:102) = (W(cid:102)k+1 ,...,W(cid:102)n )′. It follows that n (cid:88) Σˆ = σ2 (n−k)−1 h h′ = σ2 (n−k)−1X′X Y W t−1 t−1 W t=k+1 and Y n,k = (n−k)−1X′W(cid:102) so that (19) may be rewritten as σ−2W(cid:102) ′X(X′X)−1X′W(cid:102) ∼ χ2 . (22) W k+1 The DQ test statistic of Engle and Manganelli (2004) corresponds to the binomial score case, i.e., the case where W = 1 and the CVT is h(p) = 1 .12 t {Pt>α} {p>α} 11We have also experimented with the test obtained under the stronger hypothesis that the P are t uniform,whichallowsustocalculateH =diag(1,E(h(P )2),...,E(h(P )2))analytically. Theresulting t t testhaspoorersizeandissomewhatinconflictwithourgeneralphilosophythatweshouldfocustests for uniformity in the region where we require the risk model to perform. 12Engle and Manganelli (2004) allow as well for lagged VaR values to be included as regressors, but change in portfolio composition implies that lagged VaR values are less informative than lagged PIT values. 18

For an alternative interpretation of our test, consider the time series regression model k (cid:88) W(cid:102)t = β 0 + β i h(P t−i )+ϵ t , t = k+1,...,n (23) i=1 for which X is the design matrix. Under the standard assumptions for time series regressionandassuminghomoscedasticerrorswithknownvarianceσ2 ,theleastsquares W estimator of β = (β 0 ,...,β k )′ is (X′X)−1X′W(cid:102) and this is asymptotically normal with covariance matrix σ2 (X′X)−1. Thus expression (22) describes the natural chi-squared W test that β = 0. 5.2 Conditional bispectral Z-test The conditional spectral Z-test generalizes to a conditional bispectral Z-test. We construct two sets of transformed reported PIT-values (W ,W ) for t = 1,...,n, and t,1 t,2 form the vector Y of length k +k +2 given by t 1 2 (cid:16) (cid:17)′ Y t = h′ t−1,1 W(cid:102)t,1 ,h′ t−1,2 W(cid:102)t,2 , (24) where W(cid:102)t,i = W t,i − µ W,i and h t−1,i = (1,h i (P t−1 ),...,h i (P t−ki ))′. Parallel to the previous section, let Y = (n−k)−1(cid:80)n Y for k = k ∨k , and let Σˆ denote a n,k t=k+1 t 1 2 Y consistent estimator of Σ := cov(Y ). By the theory of Giacomini and White (2006), Y t for n large and (k ,k ) fixed, 1 2 (n−k) Y ′ Σˆ−1 Y ∼ χ2 . (25) n,k Y n,k k1+k2+2 Working under the null hypothesis, we can generalize (20) to Σ = A ◦H, where Y W ◦ denotes element-by-element multiplication (Hadamard product). The matrices are (cid:32) E(cid:0) h h′ (cid:1) E(cid:0) h h′ (cid:1)(cid:33) H = t−1,1 t−1,1 t−1,1 t−1,2 E(cid:0) h h′ (cid:1) E(cid:0) h h′ (cid:1) t−1,2 t−1,1 t−1,2 t−1,2 and (cid:32) (cid:33) σ2 J σ J A = W,1 k1+1,k1+1 W,12 k1+1,k2+1 (26) W σ J σ2 J W,12 k2+1,k1+1 W,2 k2+1,k2+1 (cid:16) (cid:17) where J m,n denotes the m×n matrix of ones and σ W,12 = E W(cid:102)t,1 W(cid:102)t,2 . Our tests use the estimator Σˆ = A ◦Hˆ, where Hˆ generalizes (21) as Y W n (cid:88) Hˆ = (n−(k ∨k ))−1 (h′ ,h′ )′(h′ ,h′ ). (27) 1 2 t−1,1 t−1,2 t−1,1 t−1,2 t=(k1∨k2)+1 19

5.3 Conditional probitnormal score test The theory of the conditional bispectral test carries over to the probitnormal case. Letting θ = (µ,β ,...,β ,σ)′, consider a regression extension of (15) in which 1 k (cid:32) Φ−1(p)−µ− (cid:80)k β h(p ) (cid:33) F (p | θ,p ,...,p ) = Φ i=1 k t−i (28) Pt|Pt−1,...,P t−k t−1 t−k σ andwritef forthecorrespondingconditionaldensity. Thisgivesadynamic Pt|Pt−1,...,P t−k model in which we can test for θ = θ = (0,...,0,1)′. 0 As in Section 4.3, we model truncated PIT values P∗ = α ∨(P ∧α ), but here t 1 t 2 we condition on information about past PIT values. The likelihood contribution of an observation P∗ in the truncated model can be written as t  F (α | θ,P ,...,P ) P∗ = α ,    Pt|Pt−1,...,P t−k 1 t−1 t−k t 1 L(θ | P t ∗,P t−1 ,...,P t−k ) =  f Pt|Pt−1,...,P t−k (P t ∗ | θ,P t−1 ,...,P t−k ) α 1 < P t ∗ < α 2 ,   F¯ (α | θ,P ,...,P ) P∗ = α . Pt|Pt−1,...,P t−k 2 t−1 t−k t 2 (29) Thefollowingresultshowsthatthescoretestofthenullhypothesis(9)intheregression model described by (29) takes precisely the form (24) for a conditional bispectral test. Proposition 5.1. The score statistic S˜ (θ) for the model described by (29) satisfies t (cid:16) (cid:17)′ S˜ t (θ 0 ) = h′ t−1,1 W(cid:102)t,1 ,W(cid:102)t,2 where h′ t−1,1 = (1,h(P t−1 ),...,h(P t−k ))′, W(cid:102)t,i = S t,i (θ 0 ) and S (θ ) denotes a component of the score vector in (17). t,i 0 6 Application to bank-reported PIT values We apply our spectral backtests to a set of ten samples of PIT values reported by US banks to the Federal Reserve Board. Due to the generality of our framework, design of such an empirical exercise involves choices along several dimensions, most notably with respecttotesttype(Z-testvsLRT),kernelfunction,andkernelwindow. Toguidethese choices, we have conducted an extensive set of simulation analyses, which are available from the authors in a companion paper. For the tests of unconditional coverage, we summarize our key findings as follows. First, power typically increases with the width of the kernel window, but counterexamples abound. Intuitively, a test is most powerful in rejecting a false model when the kernel function weights heavily on probability levels for which the inverse cdf of the risk manager’s model diverges from the true model inverse cdf. If widening the window leads to increased weight in the neighborhood of a crossing between the two cdfs, power may diminish. As historical simulation in particular tends to understate the tails of the distribution, in practice we expect that the most powerful tests will weight heavily on 20

extreme probability levels. However, this can come at the expense of the stability of the test, in the sense that the outcome can be determined by the presence or absence of one or two very large reported PIT-values. Furthermore, testing at extreme tail values of α runs counter to the primary regulatory motivation for the backtest, which is to verify the bank’s 99% VaR. Second, multinomial and truncated probitnormal LR tests are outperformed by the corresponding score tests. They are similar in power, but the LRT tends to be oversized. Overall, the Pearson and truncated probitnormal score tests are among the most powerful in our study, so in the exercises below we include these tests and exclude the corresponding LR tests. Third, for the discrete tests, we find that 3-level tests perform as well as 5-level tests. Therefore, we focus on the 3-level case in the multinomial tests below. Fourth, bispectral tests tend to be more powerful than (single-kernel) spectral tests. However, when the two kernels are too similar in shape, the gain in information from combiningthesekernelsisinsufficienttocompensatefortheincreaseddegreesoffreedom in the χ2 test. 6.1 Data Our data consist of ten confidential backtesting samples provided by US banks to the FederalReserveBoardatthesubportfoliolevel. Mandatoryreportingtobankregulators pursuant to the Market Risk Rule took effect on January 1, 2013. For each significant subportfolio and each business day, the bank is required to report the overnight VaR at the 99% level, the realized clean P&L, and the associated PIT-value (see Federal Register, 2012, p. 53105). While the first two fields have been available to regulators for a long time (at least at an aggregate trading book level), access to PIT values is new. Eachofourtensamplesrepresentsreturnsonanequityorforeignexchangesubportfolio. Wehavedataonbothsubportfoliosforfourbanks,andfortwobankswehavedata on only one subportfolio each. Banks have some discretion in defining subportfolios, but in general these are broader than what might be associated with a “trading desk.” The equity subportfolio, for example, is likely to contain equity derivatives (vanilla and exotic) as well as cash positions. All of the samples lie within the three-year period from 2013–2015, inclusive. Summary statistics for the unconditional distributions are found in Table 1. Six series span the entire period, and the shortest sample is about one year in length. As is often the case with new regulatory reporting requirements, data quality are not uniform. Two of the samples (coded Pf104 and Pf110) have a significant number of missing values (3.4% and 6.7%, respectively). Furthermore, close inspection reveals that most of the samples contain a small number of observations that are clearly or 21

very likely to be spurious, e.g., a PIT value of 1 matched to a realized loss that was smaller than the forecast VaR. We developed a heuristic procedure to identify spurious values based on the distance between the reported PIT-value and an imputed value. The latter is constructed using a portfolio-specific model that fits PIT to the ratio of realized loss to VaR; see Appendix C for details. In test results reported below, we treat spurious values as missing to make the tests less sensitive to reporting error. Our conclusions are qualitatively robust to taking all non-missing observations as valid. Remaining columns of the table provide a histogram of PIT values. For some portfolios, the histograms appear to be unconditionally close to uniform. For example, for Pf109, 87.9% of PIT values lie in [0.05,0.95) and remaining mass appears to be symmetrically distributed. For some other portfolios, tail PIT values are underrepresented (e.g., Pf104, Pf107) or overrepresented (e.g., Pf110) in the sample. 6.2 A menagerie of tests and kernel functions Weconsiderkernelsofdiscrete,continuous,andmixedform. Allthebacktestsdescribed below fall within our spectral Z-test class. All reported p-values are based on two-sided tests, though one-sided versions of some tests are of course available. Parametersα andα controlthekernelwindow. Forthecontinuoustests,α andα 1 2 1 2 aretheinfimumandsupremumofthekernelsupport. Forthediscretecase, weconsider 3-level kernels at the set of points (α ,α∗,α ), where α∗ = 0.99 is the conventional VaR 1 2 level. We define a narrow window for which α = 0.985 and α = 0.995, and a wide 1 2 window for which α = 0.95 and α = 0.995. Observe that the narrow window is 1 2 symmetric around α∗, whereas the wide window is asymmetric. For the continuous case, there are a wide variety of plausible candidates for the kernel density. Table 2 lists the kernel density functions on [α ,α ] that we discuss 1 2 below. The uniform and hump-shaped Epanechnikov kernels are borrowed from the nonparametric statistics literature. The exponential kernel allows for weights that are either increasing (ζ > 0) or decreasing (ζ < 0) in u. All but the exponential kernel are special cases of the beta kernel. In view of the flexibility of the beta kernel class, in Appendix D we provide analytical solutions for the moments of the transformed PIT values for the general beta(a,b) case. We next list the backtests to be implemented. For use in tables later, we assign each test a mnenomic. Binomial score test: the two-sided binomial score test at level α∗ (BIN). 3-level multinomial tests: weapplythePearsontest(Pearson3)andtheZ-testwith discrete uniform kernel (ZU3). Continuous spectral tests: we apply tests based on the uniform kernel (ZU); the arcsin kernel (ZA); Epanechnikov kernel (ZE); increasing (ZL ) and decreasing + 22

seicneuqerF :hcihw fo syad gnidarT DI ]1,599.[ )599.,589.[ )589.,59.[ )59.,50.[ )50.,510.[ )510.,500.[ )500.,0[ suoirupS gnissiM 9700.0 9110.0 6930.0 5688.0 0920.0 2310.0 9110.0 0 0 857 101 0400.0 1800.0 0130.0 1439.0 5310.0 1800.0 3100.0 8 0 157 201 7200.0 4500.0 5710.0 9849.0 1800.0 4500.0 1210.0 7 0 057 301 0000.0 0000.0 2300.0 1599.0 6100.0 0000.0 0000.0 8 22 646 401 1800.0 7710.0 0920.0 1488.0 0920.0 7710.0 5410.0 3 0 426 501 9700.0 8320.0 7930.0 3338.0 6550.0 8720.0 9110.0 0 0 252 601 0000.0 0000.0 3100.0 3799.0 3100.0 0000.0 0000.0 2 0 057 701 3900.0 9110.0 5220.0 9719.0 1330.0 3500.0 0000.0 3 0 857 801 1400.0 6710.0 2530.0 4978.0 0240.0 2210.0 5900.0 6 4 847 901 4310.0 1020.0 2530.0 9838.0 3540.0 2520.0 8120.0 7 34 646 011 .scitsitats elpmaS :1 elbaT .13-21-5102 dna 13-21-2102 neewteb llaf soiloftrop lla rof setad gnidarT .seicneuqerf detroper eht morf dedulcxe snoitavresbo suoirups dna gnissiM 23

Kernel Mnemonic Density g(u) Beta representation Uniform ZU 1 1,1 Arcsin ZA 1/ (cid:112) u∗(1−u∗) 1⁄ ,1⁄ 2 2 Epanechnikov ZE 1−(2u∗−1)2 2,2 Linear increasing ZL u∗ 2,1 + Linear decreasing ZL 1−u∗ 1,2 − Exponential ZX exp(ζu∗) for some ζ ∈ R – ζ Table 2: Kernel density functions on [α ,α ]. 1 2 u∗ denotes the rescaled value u∗ =(u−α )/(α −α ). Density functions are not scaled to integrate 1 2 1 to 1. The exponential kernel is outside the class of beta kernels, so has no beta representation. (ZL ) linear kernels; and increasing and decreasing exponential kernels (ZX ) − ζ with parameter ζ of 2 and -2, respectively. Continuous bispectral tests: we apply combinations of the increasing and decreasing linear kernels (ZLL), of exponential kernels with ζ = ±2 (ZXX), and of the arcsinandEpanechnikovkernels(ZAE);wealsoapplythetruncatedprobitnormal score test (PNS). 6.3 Tests of unconditional coverage Table 3 presents p-values for the tests of unconditional coverage. When we adopt a narrow kernel window, we find that all of the tests reject at the 5% level the forecast model for portfolio Pf104 and at the 1% level for Pf107 and Pf110. In view of the histograms observed in Table 1, this is unsurprising. When an empirical distribution function (edf) liesabovetheuniformcdfwithinthekernelwindow(asobservedforPf104andPf107), large PIT values are underepresented in the sample, which suggests that the forecast model overstates the upper quantiles of the loss distribution. When an edf lies below the uniform cdf (as observed for Pf110), large PIT values are overrepresented in the sample, which suggests that the forecast model understates the upper quantiles. For four of the portfolios (Pf101, Pf102, Pf103 and Pf106), none of the tests reject. For the remaining three portfolios (Pf105, Pf108, Pf109), the test p-values vary considerably across the kernel functions. This is to be expected and desirable, as the kernel functions prioritize different quantiles of the unconditional distribution. In the upper panel of Figure 1, we plot the edf for five of the portfolios (Pf101, Pf103, Pf104, Pf108 and Pf109) to illustrate the differences in test performance. We see that the edf for Pf101 is relatively close to the theoretical uniform cdf (dot-dash line) throughout the kernel window. The edf for Pf103 lies well above the theoretical cdf, but still is much closer to uniform than the edf for Pf104. This indicates that departures from uniformity must be fairly large to generate a test rejection in backtest 24

SNP EAZ LLZ LZ LZ EZ AZ UZ 3UZ 3nosraeP NIB wodniw DI − + 6535.0 7242.0 2385.0 0643.0 0164.0 6744.0 9823.0 4883.0 8513.0 2034.0 2406.0 worran 101 7405.0 8256.0 3856.0 8285.0 7524.0 2535.0 9654.0 2605.0 6912.0 7954.0 2406.0 ediw 0218.0 3457.0 3646.0 9417.0 4425.0 4256.0 2295.0 4026.0 0083.0 3264.0 0602.0 worran 201 1683.0 5343.0 2256.0 7073.0 2764.0 7554.0 3623.0 0993.0 4551.0 0793.0 0602.0 ediw 3792.0 6182.0 8402.0 5990.0 7781.0 7911.0 9631.0 3821.0 1511.0 4993.0 4201.0 worran 301 6900.0 6610.0 7330.0 7010.0 9230.0 5220.0 9900.0 1510.0 0500.0 5220.0 4201.0 ediw 2900.0 6610.0 5310.0 1400.0 0310.0 5700.0 2500.0 2600.0 6400.0 6420.0 6210.0 worran 401 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 6210.0 ediw 0680.0 1251.0 7011.0 1140.0 0970.0 2450.0 4250.0 1250.0 0950.0 7561.0 4621.0 worran 501 2221.0 8190.0 1880.0 3861.0 9050.0 5770.0 8041.0 3990.0 0572.0 8105.0 4621.0 ediw 1660.0 4441.0 1760.0 1360.0 8302.0 9621.0 0090.0 9501.0 6470.0 3051.0 4611.0 worran 601 0401.0 3110.0 2730.0 4510.0 5010.0 8700.0 7910.0 4110.0 2780.0 3382.0 4611.0 ediw 4300.0 9600.0 4500.0 6100.0 2600.0 2300.0 1200.0 6200.0 8100.0 8900.0 0600.0 worran 701 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 0600.0 ediw 1951.0 3321.0 8121.0 6570.0 3340.0 7740.0 1060.0 4550.0 0740.0 8490.0 3810.0 worran 801 5290.0 6948.0 8020.0 5254.0 9796.0 6637.0 1218.0 3377.0 3275.0 3120.0 3810.0 ediw 5171.0 8132.0 0820.0 3590.0 3293.0 6461.0 8022.0 6681.0 7633.0 5502.0 4233.0 worran 901 6236.0 2805.0 7204.0 4233.0 1602.0 5062.0 8692.0 4562.0 0514.0 6143.0 4233.0 ediw 2000.0 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 0000.0 1000.0 5100.0 2000.0 worran 011 2000.0 1300.0 2000.0 9200.0 1000.0 9000.0 7000.0 6000.0 1100.0 5200.0 2000.0 ediw .egarevoc lanoitidnocnu fo stseT :3 elbaT .]599.,59.[ si wodniw lenrek ediW dna ]599.,589.[ si wodniw lenrek worraN .noitcnuf lenrek dna ,wodniw lenrek ,oiloftrop yb seulav-p tset troper eW 25

samples of 2–3 years. WiththeexceptionofportfolioPf108,thecontinuousspectralandbispectralZ-tests tend to deliver lower p-values than the binomial score test. As seen in Figure 1, the edf for Pf108 is nearly flat in the lower half of the narrow window, and then rises sharply in the upper half. A step function at the center point α∗ = 0.99 is especially sensitive to this particular form of departure from uniformity, but its performance would not be robust to relatively small changes in a handful of observations. In the case of Pf109, the forecast model is rejected (at the 5% level) only by the bispectral ZLL test. Figure 1 reveals a crossing within the narrow kernel window betweentheedfandtheuniformcdf,whichimpliesthattheforecastmodelunderestimates quantilesatoneboundaryofthekernelwindowandoverestimatesquantilesattheother boundary. We refer to this as a slope deviation from the uniform cdf. The overall proximity of the edf to the uniform cdf presents a challenge for single-kernel spectral tests in general. In a bispectral test, by contract, when the two kernels differ markedly in how they weight the lower and upper ends of the kernel window, the test can effectively identify slope deviations. Backtests for portfolios Pf103 and Pf106 are most sensitive to the choice of kernel window. The associated forecast models are never rejected under the narrow window, but rejected by most of the tests for the wider window. (Of course, the binomial score test is invariant to the choice of kernel window.) For Pf105 and Pf109, however, the few rejections under the narrow window vanish under the wider window. For Pf108, we find that widening the window increases test sensitivity to the choice of kernel function. EDFs for these portfolios are depicted in the lower panel of Figure 1. For portfolios Pf103andPf106, theedfdepartsmostmarkedlyfromuniformityontheexpandedportion [.95,.985] of the wide window, whereas the edfs for Pf105 and Pf109 are relatively close to the uniform cdf within this region. Similar to what was observed for Pf109 within the narrow window, the ZLL test for Pf108 appears to be picking up the slope deviation associated with the single crossing between the edf and uniform cdf within the wide window. Forbrevity,thetablesomitresultsfortheincreasinganddecreasingexponentialkernels (ZX and ZX , respectively) and the bispectral test that combines them (ZXX). +2 −2 These exponential kernel functions coincide closely with the linear kernel functions, so we find for all portfolios that p-values are very similar when we substitute ZX for +2 ZL , ZE for ZL , and ZXX for ZLL. + −2 − 26

1.000 0.995 0.990 0.985 0.980 0.975 0.980 0.985 0.990 0.995 1.000 PIT FDC laciripmE Narrow window Portfolio 101 103 104 108 109 1.00 0.98 0.96 0.94 0.96 0.98 1.00 PIT FDC laciripmE Wide window Portfolio 103 105 106 108 109 Figure 1: Empirical distribution functions for select portfolios. EDFs for narrow window (upper panel) and wide window (lower panel). Note that the set of illustrated portfolios differs between the two panels. 27

6.4 Tests of conditional coverage Testsofconditionalcoverageinvolveallthedesignchoicesoftheunconditionaltests,and further require the choice of the number (k) of lagged PIT values and the conditioning variable transformation h(P). Define V(u) = |2u−1|; this V-shaped transformation of PIT values is well-suited to uncover dependence arising from stochastic volatility. We consider four candidates for the conditioning variable transformation (CVT): EM: h(P) = 1 . This test regresses the spectrally transformed PIT-values on {P>0.99} indicator variables for previous exceedances of the 99% VaR as in Engle and Manganelli (2004). V.BIN: h(P) = 1 . This two-tailed version of EM flags PIT values near {V(P)>0.98} zero or one. Note that this small change requires that the regulator observe PIT values, and not only the traditional exceedance indicators. V.4: h(P) = V(P)4. Raising V(P) to the fourth power places heavier weight on tail PIT values in the recent past. V.1⁄ : h(P) = (cid:112) V(P). Relative to V.4, this transformation dampens sensitivity to tail 2 PIT values. Drawingguidancefromsimulationanalysesinourcompanionpaper,wefixk = 4lagsin the monospectral tests. In the context of daily backtesting, this corresponds to looking at dependencies over a time horizon of one trading week. To facilitate comparison to the monospectral tests, we fix (k = 4,k = 0) for the bispectral tests. For parsimony, 1 2 we consider only the narrow kernel window [0.985,0.995), and a subset of the kernel functions included in the previous section. Missing or spurious values may be especially troublesome in a test of conditional coverage because a PIT value missing at time t introduces missing regressors at t + 1,...,t + k. To avoid losing the subsequent k observations, we replace missing or spuriousP withaninputedvaluewhencomputingthelaggedvectorh . (Asinthe t−ℓ t−1 tests of unconditional coverage, we do not impute missing P to backfill the dependent t variables W , but simply drop these observations.) Details of our imputation algorithm t are provided in Appendix C. Table 4 presents p-values for the tests of conditional coverage. For portfolios Pf108 and Pf110, forecast models are strongly rejected (0.01% level) regardless of the choice of CVT or kernel function; for brevity we drop these portfolios from the table. For only a single portfolio (Pf109), the forecast model is never rejected. In the other seven cases, the choice of CVT and kernel function matter. We find: • For portfolios Pf102, Pf103 and Pf105, the V.4 CVT generally leads to rejection at the 5% level, but tests using the EM CVT never reject. The V.BIN and V.1⁄ 2 28

ID CVT BIN ZU ZL ZL ZLL PNS + − EM 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 V.BIN 0.1450 0.0158 0.0145 0.0209 0.0361 0.0601 101 V.4 0.0599 0.0183 0.0102 0.0305 0.0512 0.0529 V.1⁄ 0.4504 0.3084 0.3396 0.2721 0.3633 0.2928 2 EM 0.8987 0.9960 0.9926 0.9977 0.9838 0.9970 V.BIN 0.8785 0.9045 0.9726 0.7709 0.7721 0.8222 102 V.4 0.3313 0.0418 0.1087 0.0185 0.0261 0.0393 V.1⁄ 0.4683 0.1628 0.3167 0.0819 0.1042 0.1472 2 EM 0.7530 0.8042 0.8838 0.7445 0.7877 0.8754 V.BIN 0.0226 0.0124 0.0061 0.0275 0.0423 0.0149 103 V.4 0.0788 0.0256 0.0277 0.0305 0.0466 0.0157 V.1⁄ 0.3834 0.2512 0.3210 0.2233 0.2837 0.2326 2 EM NA NA NA NA NA NA V.BIN NA NA NA NA NA NA 104 V.4 0.2889 0.1903 0.2935 0.1471 0.2005 0.1564 V.1⁄ 0.2889 0.1903 0.2935 0.1471 0.2005 0.1564 2 EM 0.6178 0.3689 0.4902 0.3095 0.4010 0.3265 V.BIN 0.4124 0.0637 0.2813 0.0079 0.0144 0.0133 105 V.4 0.2355 0.0078 0.0862 0.0006 0.0013 0.0002 V.1⁄ 0.3196 0.0214 0.0935 0.0049 0.0092 0.0009 2 EM 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 V.BIN 0.0098 0.0001 0.0003 0.0000 0.0000 0.0000 106 V.4 0.0088 0.0019 0.0103 0.0005 0.0005 0.0002 V.1⁄ 0.0485 0.0418 0.1425 0.0155 0.0137 0.0073 2 EM NA NA NA NA NA NA V.BIN NA NA NA NA NA NA 107 V.4 0.1850 0.1076 0.1889 0.0772 0.1090 0.0787 V.1⁄ 0.1851 0.1076 0.1889 0.0772 0.1090 0.0787 2 EM 0.8836 0.6208 0.9293 0.2894 0.1021 0.2545 V.BIN 0.4884 0.3959 0.6654 0.1910 0.0658 0.1797 109 V.4 0.8716 0.7150 0.9099 0.4371 0.1606 0.4444 V.1⁄ 0.3425 0.2560 0.3632 0.1638 0.0561 0.2181 2 Table 4: Tests of conditional coverage. We report test p-values by portfolio, conditioning variable transformation, and kernel function. The monospectral tests utilize k=4 lags, and for the bispectral tests we set (k =4,k =0). We fix a 1 2 narrow kernel window of [.985,.995]. Forecast models for Pf108 and Pf110 (not tabulated) are rejected at the 0.01% level for all choices of CVT and kernel. 29

CVT are less robust in performance than V.4. This reflects the greater sensitivity of the V.4 transformation to local spikes in market volatility. • Only in the case of Pf101 does the Engle-Manganelli CVT pick up serial dependence more effectively than the CVT based on V(P), though here too the V.BIN and V.4 CVT lead to rejection at the 5% for uniform and linear kernel functions. • In two cases (Pf104, Pf107), the test statistic is undefined for the EM CVT and itstwo-tailedcounterparty(V.BIN).Astherewerenoobservedviolationsineither tail (P < .01 or P > .99), in both cases the matrix Hˆ of (21) is singular, so Σˆ t t Y in the test statistic cannot be inverted. This demonstrates a practical limitation of a binary CVT, as short samples may often contain no tail values. • Despite the adoption of a narrow kernel window in these tests, the spectral backtests often give improvements in power over the traditional binomial score test. In particular, for portfolios Pf102, Pf103 and Pf105, p-values for tests using the continuous kernel functions are often much lower than p-values for corresponding test using the BIN kernel. 7 Conclusion The class of spectral backtests embeds many of the most widely used tests of unconditional coverage and tests of conditional coverage, including the binomial likelihood ratiotestofKupiec(1995),theintervallikelihoodratiotestofBerkowitz(2001),andthe dynamic quantile test of Engle and Manganelli (2004). As we demonstrate with many examples, viewing these tests in terms of the associated kernel functions facilitates the construction of new tests. From the perspective of the practice of risk management, making explicit the choice of kernel function may help to discipline the backtesting process because the kernel function directly expresses the user’s priorities for model performance. Our results illustrate the value to regulators of access to bank-reported PIT-values. Until recently, regulators effectively observed only a sequence of VaR exceedance event indicators at a single level α, and therefore backtests were designed to take such data as input. Insomejurisdictions,includingtheUnitedStates,PIT-valueshavebeencollected forsometime. Besidesopeningthepossibilityofformingspectralteststatistics,wehave demonstrated that lagged PIT-values are especially effective as conditioning variables in regression-based tests of conditional coverage. There is a growing literature on multivariate or multi-desk backtesting including Wied et al. (2016) and Berkowitz et al. (2011) (see §4.4 and the CavMult test in Table 7, specifically). The new standard for capital requirements for market risk (Basel Committee on Bank Supervision, 2016) calls for backtesting at individual desk level 30

and typical investment banks may have in excess of 50 desks. The spectral and bispectral tests that we propose in this paper admit multi-desk generalizations that allow the simultaneous evaulation of backtest results across multiple desks. We leave this as a topic for future work. A Proofs A.1 Proofs of Propositions 3.1 and 3.2 The logic of these two proofs is identical and we give the proof of Proposition 3.2 only. (cid:18)(cid:90) α2 (cid:19)(cid:18)(cid:90) α2 (cid:19) W W = g (u)1 du g (v)1 dv t,1 t,2 1 {Pt>u} 2 {Pt>v} α1 α1 (cid:90) α2 (cid:90) α2 = g (u)g (v)1 1 dvdu 1 2 {Pt>u} {Pt>v} u=α1 v=α1 (cid:90) α2 (cid:90) α2 = g (u)g (v)1 dvdu 1 2 {Pt>max{u,v}} u=α1 v=α1 (cid:90) α2 (cid:90) u (cid:90) α2 (cid:90) α2 = g (u)g (v)1 dvdu+ g (u)g (v)1 dvdu 1 2 {Pt>u} 1 2 {Pt>v} u=α1 v=α1 u=α1 v=u (cid:90) α2 (cid:18)(cid:90) u (cid:19) (cid:90) α2 (cid:18)(cid:90) v (cid:19) = g (u) g (v)dv 1 du+ g (v) g (u)du 1 dv 1 2 {Pt>u} 2 1 {Pt>v} u=α1 v=α1 v=α1 u=α1 (cid:90) α2 (cid:90) α2 = g (u)G (u)du+ g (v)G (v)dv 1 2 2 1 u=α1 v=α1 Note that g∗(u) clearly satisfies Assumption 1. If g and g are normalized kernel 1 2 densities on [α ,α ] then it follows that 1 2 (cid:90) α2 g∗(u)du = (cid:104) G (u)G (u) (cid:105)α2 = 1. 1 2 α1 α1 A.2 Proof of Theorem 3.3 The likelihood L (θ | P∗) takes the form P (cid:89) (cid:89) (cid:89) L (θ | P∗) = F (α | θ) f (P∗ | θ) F¯ (α | θ) (A.1) P P 1 P t P 2 t:P t ∗=α1 t:α1<P t ∗<α2 t:P t ∗=α2 where F¯(u) denotes the tail probability 1 − F(u). Since T is strictly increasing and continuous on [α ,α ], the distribution F (w | θ) implied by F (p | θ) satisfies 1 2 W P P(W = T(α ) | θ) = F (α | θ), 1 P 1 f (T−1(w) | θ) P f (w | θ) = , w ∈ (T(α ),T(α )), W T′(T−1(w)) 1 2 P(W = T(α ) | θ) = F¯ (α | θ). 2 P 2 31

It follows that the likelihood L (θ | W) is given by W L (θ | W) = (cid:89) F (α | θ) (cid:89) f P (T−1(W t ) | θ) (cid:89) F¯ (α | θ) W P 1 T′(T−1(W )) P 2 t t:Wt=T(α1) t:T(α1)<Wt<T(α2) t:Wt=T(α2) L (θ | P∗) P = . (cid:81) T′(P∗) t:α1<P t ∗<α2 t It is clear that the same value θˆ must maximize both these likelihoods and that the likelihood ratio statistics must satisfy L (θ | W) L (θ | P∗) LR = W 0 = P 0 = LR . W,n L (θˆ| W) L (θˆ| P∗) P,n W P A.3 Sketch of proof of Theorem 4.1 ThePearsontestisoneofthebestknowntestsinstatistics. Theresultcanbeprovedby adapting an approach that is used to derive the asymptotic distribution of the Pearson test statistic. Let X = (X ,...,X )′ be the (m+1)-dimensional random vector with X = t t,0 t,m t,i 1 for i = 0,...,m. Under (9) X has a multinomial distribution satisfying {1′Wt=i} t E(X ) = θ , var(X ) = θ (1−θ ) and cov(X ,X ) = −θ θ for i ̸= j. t,i i t,i i i t,i t,j i j Suppose we define Y to be the m-dimensional random vector obtained from X by t t omitting the first component. Then E(Y ) = θ = (θ ,...,θ )′ and Σ is the m×m t 1 m Y submatrix of cov(X ) resulting from deletion of the first row and column. A standard t approach to the asymptotics of the Pearson test is to show that S = (cid:88) m (O i −nθ i )2 = (cid:88) m ( (cid:80)n t=1 X t,i −nθ i )2 = n(Y −θ)′Σ−1(Y −θ), m nθ nθ Y i i i=0 i=0 where Y = n−1(cid:80)n Y . The central limit theorem is then applied to Y to argue that t=1 t S ∼ χ2 in the limit. m m Let A be the m×m matrix with rows given by (e −e ,e −e ,...,e ) where e 1 2 2 3 m i denotes the ith unit vector. The inverse of this matrix is the upper triangular matrix of one’s. It may be verified that Y = AW , θ = Aµ and Σ = A−1Σ (A′)−1. t t W W Y We note that µ = (1 − α ,...,1 − α )′ and that Σ is a matrix with diagonal W 1 m W entries var(W ) = α (1−α ) and off-diagonal entries cov(W ,W ) = min(α ,α )(1− t,i i i t,i t,j i j max(α ,α )) for i,j ∈ {1,...,m}. It follows that i j S = n(Y−θ)′Σ−1(Y−θ) = n(W−µ )′A′Σ−1A(W−µ ) = n(W−µ )′Σ−1(W−µ ). m Y W Y W W W W 32

A.4 Proof of Theorem 4.2 Computing the score statistic and evaluating it at θ = (0,1)′ yields 0  ψ (α ) P∗ = α ,   1 1 t 1  S t (θ 0 ) = ψ ∗ (P t ∗) α 1 < P t ∗ < α 2 , (A.2)    ψ (α ) P∗ = α . 2 2 t 2 where (cid:32) (cid:33) −φ(Φ−1(u))/u ψ (u) = 1 −φ(Φ−1(u))Φ−1(u)/u (cid:32) (cid:33) Φ−1(u) ψ (u) = ∗ Φ−1(u)2−1 (cid:32) (cid:33) φ(Φ−1(u))/(1−u) ψ (u) = 2 φ(Φ−1(u))Φ−1(u)/(1−u) The jumps at α and α are given by 1 2 (γ ,γ )′ = ψ (α )−ψ (α ), (γ ,γ )′ = ψ (α )−ψ (α ) 1,1 2,1 ∗ 1 1 1 1,2 2,2 2 2 ∗ 2 The weighting functions can be obtained by differentiating ψ (u) with respect to u on ∗ (α ,α ) and are thus 1 2 1 2Φ−1(u) g (u) = , g (u) = . 1 φ(Φ−1(u)) 2 φ(Φ−1(u)) Finally, since µ = W −S (θ ), we must have that µ = −ψ (α ). W t t 0 W 1 1 A.5 Sketch of proof of Proposition 5.1 It may be verified that the partial derivatives ∂ lnL(θ | P∗,P ,...,P ) and ∂µ t t−1 t−k ∂ lnL(θ | P∗,P ,...,P ) take the same essential form as the partial derivatives ∂σ t t−1 t−k of (16), from which it follows that S˜ (θ ) and S˜ (θ ) coincide with S (θ ) and t,1 0 t,2+k 0 t,1 0 S (θ ) respectively. Moreover, t,2 0 ∂ ∂ lnL(θ | P∗,P ,...,P ) = h(P ) lnL(θ | P∗,P ,...,P ), ∂β t t−1 t−k t−i ∂µ t t−1 t−k i hence S˜ (θ ) = h(P )S (θ ) for i = 1,...,k. t,1+i 0 t−i t,1 0 33

 (cid:16) (cid:17)  φ(ξ(α1|θ)) φ(ξ(α1|θ))ξ(α1|θ)−Φ(ξ(α1|θ))+ξ(α1|θ)2Φ(ξ(α1|θ)) ∂2      σ2Φ(ξ(α1|θ))2 P t ∗ = α 1 , − lnL(θ | P∗) = 2ξ(P t ∗|θ) α < P∗ < α , ∂µ∂σ t σ2 1 t 2  (cid:16) (cid:17)     φ(ξ(α2|θ)) φ(ξ(α2|θ))ξ(α2|θ)+Φ(ξ(α2|θ))−ξ(α2|θ)2Φ(ξ(α2|θ))   σ2Φ(ξ(α2|θ))2 P t ∗ = α 2 . (B.7) By taking expectations using (B.1) and (B.2) and evaluating at θ = (0,1)′ we obtain 0 the elements of I(θ ): 0 I(θ ) = φ(Φ−1(α ))2/α +φ(Φ−1(α ))2/(1−α ) 0 1,1 1 1 2 2 +φ(Φ−1(α ))Φ−1(α )−φ(Φ−1(α ))Φ−1(α )+(α −α ), (B.8) 1 1 2 2 2 1 I(θ ) = φ(Φ−1(α ))2Φ−1(α )2/α +φ(Φ−1(α ))Φ−1(α )3 0 2,2 1 1 1 1 1 +φ(Φ−1(α ))Φ−1(α )+φ(Φ−1(α ))2Φ−1(α )2/(1−α ) 1 1 2 2 2 −φ(Φ−1(α ))Φ−1(α )3−φ(Φ−1(α ))Φ−1(α )+2(α −α ), (B.9) 2 2 2 2 2 1 I(θ ) = φ(Φ−1(α ))2Φ−1(α )/α +φ(Φ−1(α )) (cid:0) 1+Φ−1(α )2(cid:1) 0 1,2 1 1 1 1 1 +φ(Φ−1(α ))2Φ−1(α )/(1−α )−φ(Φ−1(α )) (cid:0) 1+Φ−1(α )2(cid:1) . (B.10) 2 2 2 2 2 C Identification of spurious PIT values Consider a stylized Gaussian model in which loss is given by L = σ Z (C.1) t t−1 t where (Z ) is an iid sequence of standard normal random variables and volatility σ t t−1 is F -measurable. Time variation in σ may arise from stochastic volatility or from t−1 t changes over time in portfolio composition. Suppose that the risk-manager knows the true underlying distribution and the volatility. The risk-manager’s ideal value-at-risk forecast at α = 0.99 is then V(cid:100)aR t = Φ−1(0.99)σ t−1 where Φ is the standard normal cdf. We do not observe σ , but from observing L t−1 t and V(cid:100)aR t , we can back out the realized value of Z t as Z t = Φ−1(0.99)×L t /V(cid:100)aR t . (C.2) 35

Furthermore, the PIT values can be expressed as P t = F(cid:98)t−1 (L t ) = Φ(L t /σ t−1 ) = Φ(Z t ). (C.3) In general, we would not expect the Z to be Gaussian, so (C.3) will not hold. t However, so long as (Z ) is iid, there will still be a monotonic relationship between Z t t (asdefinedby (C.2))andP . Wefindthatthepredictedrelationshipholdsqualitatively t for all bank-reported portfolios, but with more noise in some portfolios than in others. ThissuggeststhatwecanuseviolationsofmonotonicitytoidentifyspuriousPITvalues, but the threshold for identification must vary across portfolios. Let H(z;θ ) : R → [0,1] be a family of fitting functions with parameter θ for i i portfolio i, and replace (C.3) by P = H(Z ;θ )+ϵ (C.4) i,t i,t i i,t where the ϵ are white-noise residuals. Since the H function should be increasing, i,t it is convenient to take H to be a cdf, even though it does not have a statistical interpretation in our context. For convenience, we take H to be the normal cdf with unrestricted (µ ,σ ) as θ . i i i For each portfolio i, we proceed as follows: 1. Fit θ by nonlinear least squares, and construct residuals ϵ = P −H(Z ;θˆ). i it it it i 2. The (ϵ ) are bounded in the open interval (−1,1), because H(Z ) does not proit it duceboundaryvalues. Wemodelϵ asdrawnfromarescaledbetadistributionon it (−1,1)withparameters(a = τ /2,b = τ /2). Thisdistributionhasmeanzeroand i i variance 1/(τ +1), so we simply fit τ to the variance of the regression residuals. i i 3. Let B(ϵ;τˆ) be the fitted beta distribution. We flag an observation P as spurious i it wheneverB(ϵ ;τˆ) < q/2orB(ϵ ;τˆ) > 1−q/2, whereq isatoleranceparameter. it i it i 4. We reestimate τ as in step 3 on a sample that excludes the spurious observations. i Repeat step 4 with the updated τˆ. An observation is flagged as spurious if it is i rejected in either round of estimation. In our baseline procedure, we set the tolerance parameter to q = 10−5, which is intended to flag only the most egregious inconsistencies between P and the pair it (L it ,V(cid:100)aR it ). A typical case involves a PIT value very close to zero or one associated with a modest P&L such that |L it | < V(cid:100)aR it . Setting q = 0 is equivalent to shutting down the identification of spurious values. The procedure yields imputed PIT values as Pˆ = H(Z ;θˆ). As noted in Section it it i 6.4, we use the imputed values to fill in for spurious values in forming regressors in the tests of conditional coverage. 36

D Moments for the beta kernel We provide a general solution to the moments and cross-moments of the transformed PIT values when the kernel densities take the form (u−α )a−1(α −u)b−1 1 2 g(u) = (α −α )a+b−1B(a,b) 2 1 for parameters (a > 0,b > 0) and α (cid:54) u (cid:54) α . The normalization guarantees that 1 2 G(α ) = 1, and helps align the solution with standard beta distribution functions 2 provided by statistical packages. In R notation, the kernel function is simply (cid:18) (cid:19) max{α ,min{u,α }}−α 1 2 1 G(u) = pbeta ,a,b . α −α 2 1 Solving for moments and cross-moments of kernels (g (P),g (P)) for uniform P 1 2 involves the following integral: (cid:90) α2 M(a ,b ,a ,b ) = (1−u)g (u)G (u)du 1 1 2 2 1 2 α1 B(a +a ,1+b ) 1 2 1 = F (a ,a +a ,1−b ;1+a ,1+a +a +b ;1) 3 2 2 1 2 2 2 1 2 1 a B(a ,b )B(a ,b ) 2 1 1 2 2 B(a +a ,1+b +b ) 1 2 1 2 = F (1,a +a ,a +b ;1+a ,1+a +a +b +b ;1) 3 2 1 2 2 2 2 1 2 1 2 a B(a ,b )B(a ,b ) 2 1 1 2 2 (D.1) where F (c ,c ,c ;d ,d ;1) denotes a hypergeometric function of order (3,2) and ar- 3 2 1 2 3 1 2 gument unity. The final line follows from the Thomae transformation T7 in Milgram (2010, Appendix A). Due to the normalization of the kernels, M does not depend on the choice of kernel window. When its parameters are all positive, as in the final expression for M, computing F (c ,c ,c ;d ,d ;1) is straightforward via the standard hypergeometric series expan- 3 2 1 2 3 1 2 sion. In practice, we are most often interested in integer-valued cases for which M has a simple closed-form solution. For given kernel window and PIT value, let W be the transformed PIT value a,b under a beta kernel with parameters (a,b). A recurrence rule for the incomplete beta function (Abramowitz and Stegun, 1965, eq. 6.6.7) leads to a linear relationship among “neighboring” transformations: (a+b)W = aW +bW (D.2) a,b a+1,b a,b+1 An immediate implication is that the uniform, linear increasing and linear decreasing transformations (parameter sets (1,1), (2,1) and (1,2), respectively) are linearly de- 37

pendent. Any pair of these kernels would yield an equivalent bispectral test, and a trispectral test using all three kernels would be undefined due to a singular covariance matrix Σ . By iterating the recurrence relationship, we can derive linear relationships W amongsetsofkernelswithinteger-valuedparameterdifferencesa −a andb −b , which i ℓ i ℓ would lead to redundancies among the corresponding j-spectral tests. References Abramowitz, M., and I. A. Stegun, eds., 1965, Handbook of Mathematical Functions (Dover Publications, New York). Acerbi, C., and B. Szekely, 2014, Back-testing expected shortfall, Risk 1–6. Amisano, G., and R. Giacomini, 2007, Comparing density forecasts via weighted likelihood ratio tests, Journal of Business & Economic Statistics 25, 177–190. Barone-Adesi, G., F. Bourgoin, and K. Giannopoulos, 1998, Don’t look back, Risk 11, 100–103. Basel Committee on Bank Supervision, 2013, Fundamental review of the trading book: A revised market risk framework, Publication No. 265, Bank for International Settlements. BaselCommitteeonBankSupervision, 2016, Minimumcapitalrequirementsformarket risk, Publication No. 352, Bank for International Settlements. Berkowitz, J., 2001, Testing the accuracy of density forecasts, applications to risk management, Journal of Business & Economic Statistics 19, 465–474. Berkowitz, J., P. Christoffersen, and D. Pelletier, 2011, Evaluating value-at-risk models with desk-level data, Management Science 57, 2213–2227. Berkowitz, J., and J. O’Brien, 2002, How accurate are Value-at-Risk models at commercial banks?, The Journal of Finance 57, 1093–1112. Billingsley, P., 1961, The Lindeberg–Lévy theorem for martingales, Proceedings of the American Mathematical Society 12, 788–792. BoardofGovernorsoftheFederalReserveSystem,2011,Supervisoryguidanceonmodel risk management, SR Letter 11-7. Cai, Y., and K. Krishnamoorthy, 2006, Exact size and power properties of five tests for multinomial proportions, Communications in Statistics - Simulation and Computation 35, 149–160. Campbell, S.D., 2006, A review of backtesting and backtesting procedures, Journal of Risk 9, 1–17. Christoffersen, P., 1998, Evaluating interval forecasts, International Economic Review 39. 38

Christoffersen, P. F., and D. Pelletier, 2004, Backtesting Value-at-Risk: A durationbased approach, Journal of Econometrics 2, 84–108. Colletaz, Gilbert, Christophe Hurlin, and Christophe Pérignon, 2013, The risk map: A new tool for validating risk models, Journal of Banking and Finance 37, 3843–3854. Costanzino, N., and M. Curran, 2015, Backtesting general spectral risk measures with application to expected shortfall, The Journal of Risk Model Validation 9, 21–31. Crnkovic, C., and J. Drachman, 1996, Quality control, Risk 9, 139–143. Diebold, F.X., T.A. Gunther, and A.S. Tay, 1998, Evaluating density forecasts with applications to financial risk management, International Economic Review 39, 863– 883. Diebold, F.X., and R.S. Mariano, 1995, Comparing predictive accuracy, Journal of Business & Economic Statistics 13, 253–265. Du, Z., and J.C. Escanciano, 2017, Backtesting expected shortfall: accounting for tail risk, Management Science 63, 940–958. Dumitrescu, E., C. Hurlin, and V. Pham, 2012, Backtesting Value-at-Risk: From dynamic quantile to dynamic binary tests, Finance 33, 79–112. Durlauf,S.,1991,Spectralbasedtestingofthemartingalehypothesis,JournalofEconometrics 50, 355–376. Engle, R.F., and S. Manganelli, 2004, CAViaR: conditional autoregressive value at risk by regression quantiles, Journal of Business & Economic Statistics 22, 367–381. Federal Register, 2012, Risk-based capital guidelines: Market risk. Fissler, T., and J. Ziegel, 2015, Higher order elicitability and Osband’s principle, Working paper. Fissler, T., J.F. Ziegel, and T. Gneiting, 2016, Expected shortfall is jointly elicitable with value-at-risk: implications for backtesting, Risk 58–61. Giacomini,R.,andH.White,2006,Testsofconditionalpredictiveability,Econometrica 74, 1545–1578. Gneiting, T., 2011, Making and evaluating point forecasts, Journal of the American Statistical Association 106, 746–762. Gneiting, T., F.Balabdaoui, andA.E.Raftery, 2007, Probabilisticforecasts, calibration and sharpness, Journal of the Royal Statistical Society, Series B 69, 243–268. Gneiting, T., and R. Ranjan, 2011, Comparing density forecasts using threshold- and quantile-weighted scoring rules, Journal of Business & Economic Statistics 29, 411– 422. Hull, J. C., and A. White, 1998, Incorporating volatility updating into the historical simulation method for Value-at-Risk, Journal of Risk 1, 5–19. 39

Hurlin, C., and S. Topkavi, 2007, Backtesting value-at-risk accuracy: a simple new test, Journal of Risk 9, 19–37. Kauppi, H., and P. Saikkonen, 2008, Predicting U.S. recessions with dynamic binary response models, The Review of Economics and Statistics 90, 777–791. Kerkhof, J., and B. Melenberg, 2004, Backtesting for risk-based regulatory capital, Journal of Banking and Finance 28, 1845–1865. Kratz, M., Y.H. Lok, and A.J. McNeil, 2016, Multinomial VaR backtests: A simple implicitapproachtobacktestingexpectedshortfall, toappearintheJournalofBanking and Finance. Kupiec, P. H., 1995, Techniques for verifying the accuracy of risk measurement models, Journal of Derivatives 3, 73–84. Leccadito, Arturo, Simona Boffelli, and Giovanni Urga, 2014, Evaluating the accuracy of Value-at-Risk forecasts: New multilevel tests, International Journal of Forecasting 30, 206–216. McNeil, A. J., R. Frey, and P. Embrechts, 2015, Quantitative Risk Management: Concepts, Techniques and Tools, second edition (Princeton University Press, Princeton). Milgram, Michael S., 2010, On hypergeometric 3F2(1) - a review, Working Paper 1011.4546, arXiv. Nass, C.A.G., 1959, A χ2-test for small expectations in contingency tables, with special reference to accidents and absenteeism, Biometrika 46, 365–385. O’Brien, J., and P.J. Szerszen, 2017, An evaluation of bank measures for market risk before, during and after the financial crisis, Journal of Banking and Finance 80, 215–234. Pérignon, C., Z.Y. Deng, and Z.J. Wang, 2008, Diversification and Value-at-Risk, Journal of Banking and Finance 32, 783–794. Pérignon, C., and D. R. Smith, 2010, The level and quality of Value-at-Risk disclosure by commercial banks, Journal of Banking and Finance 34, 362–377. Pérignon, C., and D.R. Smith, 2008, A new approach to comparing VaR estimation methods, Journal of Derivatives 16, 54–66. Rosenblatt, M., 1952, Remarks on a multivariate transformation, Annals of Mathematical Statistics 23, 470–472. Wied, D., G.N.F. Weiß, and D. Ziggel, 2016, Evaluating Value-at-Risk forecasts: a new set of multivariate backtests, Journal of Banking and Finance 72, 121–132. Ziggel, D., T. Berens, G.N.F. Weiss, and D. Wied, 2014, A new set of improved Valueat-Risk backtests, Journal of Banking and Finance 48, 29–41. Zumbach, G., 2006, Backtesting risk methodologies from one day to one year, Journal of Risk 9, 55–91. 40

Cite this document

APA

Michael B. Gordy and Alexander J. McNeil (2018). Spectral backtests of forecast distributions with application to risk management (FEDS 2018-021). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2018-021

BibTeX

@techreport{wtfs_feds_2018_021,
  author = {Michael B. Gordy and Alexander J. McNeil},
  title = {Spectral backtests of forecast distributions with application to risk management},
  type = {Finance and Economics Discussion Series},
  number = {2018-021},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2018},
  url = {https://whenthefedspeaks.com/doc/feds_2018-021},
  abstract = {We study a class of backtests for forecast distributions in which the test statistic is a spectral transformation that weights exceedance events by a function of the modeled probability level. The choice of the kernel function makes explicit the user's priorities for model performance. The class of spectral backtests includes tests of unconditional coverage and tests of conditional coverage. We show how the class embeds a wide variety of backtests in the existing literature, and propose novel variants as well. In an empirical application, we backtest forecast distributions for the overnight P&L of ten bank trading portfolios. For some portfolios, test results depend materially on the choice of kernel. Accessible materials (.zip)},
}