Spectral backtests unbounded and folded
Abstract
In the spectral backtesting framework of Gordy and McNeil (2020) a probability measure on the unit interval is used to weight the quantiles of greatest interest in the validation of forecast models using probability-integral transform (PIT) data. We extend this framework to allow general Lebesgue-Stieltjes kernel measures with unbounded distribution functions, which brings powerful new tests based on truncated location-scale families into the spectral class. Moreover, by considering uniform distribution preserving transformations of PIT values the test framework is generalized to allow tests that are focused on both tails of the forecast distribution.
Finance and Economics Discussion Series Federal Reserve Board, Washington, D.C. ISSN 1936-2854 (Print) ISSN 2767-3898 (Online) Spectral backtests unbounded and folded Michael B. Gordy and Alexander J. McNeil 2024-060 Please cite this paper as: Gordy, Michael B., and Alexander J. McNeil (2024). “Spectral backtests unbounded and folded,” Finance and Economics Discussion Series 2024-060. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2024.060. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.
Spectral backtests unbounded and folded∗ Michael B. Gordy Federal Reserve Board, Washington DC Alexander J. McNeil School for Business and Society, University of York 15 July 2024 Abstract In the spectral backtesting framework of Gordy and McNeil (2020) a probability measure on the unit interval is used to weight the quantiles of greatest interest in the validation of forecast models using probability-integral transform (PIT) data. We extend this framework to allow general Lebesgue-Stieltjes kernel measures with unbounded distribution functions, which brings powerful new tests based on truncated location-scale families into the spectral class. Moreover, by considering uniform distribution preserving transformations of PIT values the test framework is generalized to allow tests that are focused on both tails of the forecast distribution. JEL Codes: C52; G21; G32 Keywords: Backtesting; Volatility; Risk management ∗The opinions expressed here are our own, and do not reflect the views of the Board of Governors or its staff. AddresscorrespondencetoAlexanderJ.McNeil, UniversityofYork, alexander.mcneil@york.ac.uk. 1
1 Introduction GordyandMcNeil(2020)studyaclassofbacktestsforforecastdistributionsinwhichthetest statistic depends on a spectral transformation of a quantile exceedance indicator function. Thespectraltransformationweightsquantileexceedanceeventsusingakernelmeasurewhich is chosen by the validator to reflect the validator’s priorities for model performance. The present paper extends the original treatment in two directions. First, whereas Gordy and McNeil (2020) restrict the kernel measure to the class of probability measures, in this paper we allow the kernel measure to be unbounded, subject to an integrability condition. We show that unbounded kernels deliver tests materially more powerful than tests based on the boundedkernelsstudiedbyGordyandMcNeil(2020). Second, weintroduceapre-processing of the data by a folding transformation that leaves the size of the backtest unaltered but increases its power against misspecifications of forecast volatility that are extremely common in practice. Our extensions to the spectral backtesting framework are germane to any validation exercise in which performance throughout one or both tails of the forecast distribution is of special interest. Our investigation is motivated by recent developments in the capital regulation of the trading operations of large banks. Under the current Basel III rules (Basel Committee on Bank Supervision, 2019), minimum capital requirements for a bank’s trading book are determined by the bank’s self-reported daily Expected Shortfall (ES) at the α = 1 97.5% confidence level. The adoption of ES departs from earlier Basel regimes tied to Value-at-Risk (VaR) at the α∗ = 99% confidence level. Left unchanged in Basel III is the role of the regulator in validation of the bank’s model through backtesting. For this purpose, banks in the United States report to regulators for each trading day the probability associated with the realized profit-and-loss (P&L) in the prior day’s forecast distribution, i.e., the probability integral transform (PIT) associated with realized P&L. Observing the PIT values is equivalent to observing VaR exceedances at every level α ∈ [0,1]. Besides Gordy and McNeil (2020), bank-reported PIT data have been studied by Lynch et al. (2023) 2
and Iercosan et al. (2023). Under a VaR-based regime, the regulator would have particular interest in testing model performance over some range of confidence levels in the neighborhood of α∗. Accordingly, Gordy and McNeil (2020) illustrate their methods with kernels placing mass in a window [α ,α ] with 0 < α < α∗ < α < 1, e.g., [0.985,0.995]. In such a setting, only bounded 1 2 1 2 measures produce finite test statistics, and since our test statistic is invariant to the measure of the window, without loss of generality we can restrict attention to probability measures. Because ES is an integral of VaR above a threshold level, it is natural under the new regime to consider continuous kernels that weight on every α above some threshold, e.g., α = 0.975 1 in the Basel context. In this setting, even some unbounded measures can be guaranteed to yield valid test statistics. Further, because bank models tend to break down under extreme market events and unbounded measures weight most heavily on such tail events, we expect unbounded measures to deliver more powerful tests. We confirm this intuition in simulation exercises and show as well that this power does not come at the expense of size distortions. The topic of backtesting expected shortfall has led to a lively debate about whether or not ES is amenable to backtesting (Gneiting, 2011; Acerbi and Székely, 2014; Fissler et al., 2016; Acerbi and Székely, 2023). A growing literature, including Patton et al. (2019) and Barendse et al. (2023), employs elicitability theory to develop joint backtests of VaR and ES. For regulatory use, this methodology would generally require that banks submit time series of both VaR and ES estimates, although a recent paper of Bayer and Dimitriadis (2022) suggests a workaround to obtain a test of ES estimates only, at the possible expense of some model misspecification. Issues related to backtesting estimates of risk measures such as ES are sidestepped in our framework because we test the forecast distributions from which risk measures are estimated, rather than the estimates themselves. It may be noted that a number of recent papers propose PIT-based approaches to backtesting expected shortfall and, in particular, exploit the cumulative violation process of Du and Escanciano (2017), which can be viewed as a 3
particular choice of spectral transformation. These include Du et al. (2023), who propose an improved conditional ES backtest, Hoga and Demetrescu (2023), who propose a real-time monitoring procedure for ES forecasts and Hué et al. (2024) who use orthogonal polynomials to jointly test moment conditions for the cumulative violation process and the process of durations between VaR exceedances. Even if the regulator is interested exclusively in the upper tail of the PIT distribution, it is often the case that models that are misspecified in the upper tail may be similarly misspecified in the lower tail. For example, in a risk management setting, a failure to capture stochastic volatility in the distribution of financial returns leads to underestimation of extreme gains as well as extreme losses. Berkowitz et al. (2011) and O’Brien and Szerszen (2017) provide evidence of neglected stochastic volatility in the banking context by showing that simple GARCH models fitted to bank P&L often outperform bank internal models. Expressed in terms of the observed PIT values, such a misspecification would produce too few middling PIT and too many low and high PIT. Thus, even if the regulator is concerned only with large losses, a kernel that assigns no weight to the lower tail of the PIT distribution fails to capture data that may be relevant to detecting misspecification in the upper tail. We show how the regulator can pre-process the PIT values, by an operation we describe as folding, so that tail values from left and right in the original PIT distribution are mapped to the upper tail of the pre-processed distribution without altering the distribution of the test statistic under the null hypothesis. A simple example of a suitable pre-processor would apply the v-shaped mapping T(u) = |1−2u| to the PIT values. Under this mapping, the event that a pre-processed PIT value is in the upper tail, T(PIT) ∈ [v,1], is equivalent to the event that the PIT lies in a union of intervals in both tails, {PIT ∈ [0,(1−v)/2]∪[(1+v)/2,1]}. It is straightforward to see that if the PIT are in fact uniformly distributed (as under the null hypothesis of the backtest) then the transformed PIT are uniformly distributed as well. The linear symmetric mapping T(u) = |1−2u|isonlyasingleexampleofaverylargeclassofuniformdistributionpreserving 4
(u.d.p.) transformations. A common finding in the empirical literature is that the distribution of market returns is asymmetric such that the tail of large losses is heavier than the tail of large gains, a phenomenon that led to the development of asymmetric GARCH-type models incorporating both leverage effects and skewed innovation distributions, including AGARCH (Engle, 1990), EGARCH (Nelson, 1991) and GJR-GARCH (Glosten et al., 1993). We show that asymmetric members of the u.d.p. class can be chosen to highlight model skewness as well as kurtosis. In Section 2, we extend the backtesting framework of Gordy and McNeil (2020) to allow for unbounded kernels and u.d.p. folding transformations. A key result demonstrates that folding is not redundant, i.e., pre-processing delivers backtests that cannot otherwise be obtained. In Section 3 we introduce two novel families of unbounded kernels. Monte Carlo simulations demonstrate that these kernels deliver backtests that are well-sized and highly sensitive to unmodelled kurtosis. A parsimonious but flexible family of v-shaped pre-processors is introduced in Section 4. Monte Carlo simulations show how pre-processing further highlights unmodelled kurtosis. Pre-processors can be effective as well in the presence of unmodelled skewness. However, in the absence of material excess kurtosis, a poorly chosen pre-processor can mask rather than enhance the signature of model misspecificaton. Section 5 offers guidance on implementation in practical settings. 2 Extended spectral backtesting 2.1 Backtesting set-up We assume that a forecaster models portfolio losses (L ) on a filtered probability space t (Ω,F,(F ) ,P) where F represents the information available to the forecaster at time t, t t∈N 0 t N = N∪{0} and N denotes the non-zero natural numbers.1 For any time t ∈ N, the loss 0 L is an F -measurable random variable with conditional distribution function (df) given by t t 1L is the negative value of P&L, so large losses are associated with the right tail of the distribution. t 5
F (x) = P(L ⩽ x | F ). In most applications this distribution is not time-invariant, due t t t−1 to serial dependencies in (L ) and changes in the composition of the portfolio over time. t At time t the forecaster builds a model F(cid:98) of F based on the information F . PITt t t−1 values are the random variables (P ) obtained by setting P = F(cid:98)(L ). If the models F(cid:98) form a t t t t t sequence of ideal probabilistic forecasts in the sense of Gneiting et al. (2007), i.e., coinciding with the conditional laws F of L for every t, then the result of Rosenblatt (1952) implies t t that the process (P ) is a sequence of iid standard uniform variables. PIT-values contain t information about exceedances of quantile estimates at any level u: if V(cid:100)aR = F(cid:98)←(u) u,t t denotes the estimate of the u-quantile of F calculated using the generalized inverse of F(cid:98) at t t probability level u, then P ⩾ u ⇐⇒ L ⩾ V(cid:100)aR . t t u,t We adopt the position of an external model validator, such as a regulator, who uses the PIT-values (P ) to take a decision on the quality of the forecasting methodology. For the t purposes of this paper, we assume that the validator has access only to these PIT values althoughthisrestrictioncouldberelaxedconsiderably. Whatisessentialisthatthevalidator does not observe the entire distribution F(cid:98) which reflects the reality of most regulatory t regimes. Further, for brevity, we consider only tests of unconditional coverage. Application of unbounded measures and folding pre-processors would apply without complication to the tests of conditional coverage described in Gordy and McNeil (2020). 2.2 Spectral backtests The model validator employs a spectral transformation of the PIT values of the form (cid:90) W = 1 dν(u) (1) t {T(Pt)⩾u} I where(i)ν isaLebesgue-Stieltjesmeasurereferredtoasthekernel measure and(ii)T : I → I is a uniform distribution preserving (u.d.p.) transformation; if U ∼ U(0,1) is a standard uniform random variable and T a u.d.p. transformation, then T(U) ∼ U(0,1). Throughout 6
the paper, I denotes the unit interval [0,1]. In Gordy and McNeil (2020) the measure ν was restricted to be a probability measure and the transformation T was simply the identity transformation T(v) = v. This set-up was appropriate for a focus on the right tail of the forecast distribution. By looking at PIT exceedances of levels u and using the probability measure ν to select and weight levels of interest u at the upper end of the unit interval, test statistics were derived that that were sensitive to forecast model specification at a range of quantiles in the right tail. With any Lebesgue-Stieltjes measure ν on domain I, there is an associated increasing right-continuous function G , referred to as a distribution function (df), such that ν([0,u]) = ν G (u). It is easily seen that (1) is equivalent to the closed-form expression ν W = ν([0,T(P )]) = G (T(P )) (2) t t ν t which shows that W is increasing in T(P ). Note that we employ df in a generalized sense, t t since G is a probability df only if lim G (u) = 1. To streamline the presentation, we will ν u→1 ν henceforth impose the following mild regularity condition on ν. Assumption 1. G has at most a finite set of discontinuities and is otherwise absolutely ν continuous. The univariate transformation extends naturally to the multivariate case in which a set of distinct kernel measures ν ,...,ν is applied to PIT-values to obtain the vector-valued 1 m variables W ...,W where 1 n W = (W ,...,W )′, W = ν ([0,T(P )]) = G (T(P )), j = 1,...,m. (3) t t,1 t,m t,j j t j t We refer to any backtest based on W ...,W as a spectral backtest. The null hypothesis 1 n addressed by an unconditional spectral backtest is H : W ∼ F0 (4) 0 t W 7
where F0 denotes the df of W when P is uniform. In Gordy and McNeil (2020) two types W t t of tests were considered: spectral Z-tests based on central limit theorem arguments and spectral likelihood-ratio tests (LR-tests). The results showed a number of advantages of the former over the latter, including better control of size for similar or superior power, ease of implementation and speed of execution. In this paper we focus on Z-tests and provide the necessary extension of the theory to the Lebesgue-Stieltjes case.2 When dimW = m a spectral Z-test is based on the fact that under the multivariate t √ CLT n (cid:0) W −µ (cid:1) −− d −→ N (0,Σ ) where W = n−1(cid:80)n W and µ and Σ are the n W m W n t=1 t W W n→∞ mean vector and covariance matrix of the null distribution F0 . Hence a test can be based W on assuming for large enough n that T = n (cid:0) W −µ (cid:1)′ Σ−1 (cid:0) W −µ (cid:1) ∼ χ2 , (5) n n W W n W m where we refer to T as an m-spectral Z-test statistic. When m = 1 the chi-squared test is n equivalent to a two-sided test based on √ n(W −µ ) n W d Z = −−−→ N(0,1) (6) n σ W n→∞ where µ = E(W ) and σ2 = var(W ) are the moments in the null model F0 for W . W t W t W t By definition, the u.d.p. transformation T(P) does not alter the moments of P under the null hypothesis. Thus, as in Gordy and McNeil (2020), the first moment µ of the W transformed PIT-values W is easily obtained as t (cid:90) µ = (1−u)dν(u) (7) W I The variance σ2 of W and the cross-moments in the covariance matrix Σ of W are W t W t obtained using a simple product rule for spectrally transformed PIT values. 2The theory of spectral LR-tests presented in Gordy and McNeil (2020) carries through in the more general case without the need for any significant modification. 8
Theorem 2.1. The set of spectrally transformed PIT values defined by W = G (T(P )) is t,j j t closed under multiplication. The product W∗ = W W is given by W∗ = G∗(T(P )) where t t,1 t,2 t t ν∗ is a Lebesgue-Stieltjes measure and the associated function G∗ satisfies (cid:90) (cid:90) (cid:16) 1 (cid:17) (cid:16) 1 (cid:17) G∗(u) = G (s)− ν ({s}) dν (s)+ G (s)− ν ({s}) dν (s). 2 2 1 1 1 2 2 2 [0,u] [0,u] It follows that σ2 = µ −µ2 , where µ is found by applying (7) under the measure ν∗ W W∗ W W∗ obtained when ν = ν = ν. This yields 1 2 (cid:90) µ = (1−u)(2G (u)−ν({u}))dν(u). (8) W∗ ν I ThecentrallimittheoremunderpinningtheZ-testrequiresfinitesecondmoments. Forthe univariate case, the following proposition provides a sufficient condition on the tail behavior of G . ν Proposition 2.2. If G (u) = O((1−u)−0.5+ϵ) as u → 1 for some ϵ > 0, then σ2 is finite. ν W In the multivariate setting, the asymptotic distribution in (5) holds if the condition in Proposition 2.2 is satisfied for each ν , j = 1,...,m. j 2.3 Uniform distribution preserving transformations We are interested in u.d.p. transformations T that can extend our testing framework to uncover deficiencies in forecast models that are not revealed by the identity transformation (general theory for u.d.p. transformations can be found in Porubský et al., 1988, among others). Since the choice of kernel measure in our framework is quite flexible, one might ask whether the insertion of any given u.d.p. transformation T in (1) delivers a new Z-test that could not be obtained by changing the kernel measure. To this end we introduce the concept of redundancy in the test framework. Let µ and σ be the moments associated with kernel ν. We say that a u.d.p, transformation T is redundant 9
for kernel ν if there exists another kernel ν˜ with moments µ˜ and σ˜ that always delivers the same magnitude |Z | for the test-statistic in (6). That is, let {P ,...,P } be an arbitrary n 1 n sample of PIT values, and let W = (1/n) (cid:80)n G (T(P )) and W(cid:102) = (1/n) (cid:80)n G (P ). n i=1 ν i n i=1 ν˜ i Then T is redundant if |W(cid:102) −µ˜| |W −µ| n n = σ˜ σ almost surely. As a simple example, consider the u.d.p. transformation T(v) = 1−v. Lemma 2.3. The u.d.p. transformation T(v) = 1−v is redundant for any bounded measure ν and not redundant for any unbounded measure ν. Further examples of non-redundant transformations are obtained by considering u.d.p. transformations that are folding. By this we mean transformations T for which almost all valuesu ∈ I areassociatedwithmultiplevaluesinthepreimageof{u}underT. Tomakethis precise we introduce some additional notation and give a definition. For a generic function f : D → Y and for a set D ⊆ D we write f[D ] to mean the image of D under f; similarly, 1 1 1 for a set Y ⊆ Y, f−1[Y ] is the preimage of Y under f. 1 1 1 Definition 2.4. For a u.d.p. transformation T : I → I let I ⊆ I be the set defined by T I = {u|card(T−1[{u}]) ⩾ 2}. T is folding if I has Lebesgue measure one. T T For example, the u.d.p. transformation T(v) = |1−2v| has I = I \{0.5} and is clearly T folding. The folding class includes v-shaped, m-shaped, w-shaped and more general sawshaped functions. Our general result for the folding class is Proposition 2.5. Let ν be a measure for which G is strictly increasing on a sub-interval ν of I. If T is a folding u.d.p. transformation then it is not redundant for ν. The intuition is that G (P) must be weakly increasing in P, which implies that lower ν˜ and upper tail observations contribute in opposite signs (thereby offsetting one another) in the sample test statistic. By contrast, when we pre-process the PIT values with a folding 10
u.d.p.transformation, G (T(P))cannotbe monotonicinP. PITvalues fromnon-contiguous ν regions of the unit interval will map to the same value of G (T(P)). Which PIT observations ν contribute in the same sign and which offset each other depends on the shape of the preprocessor. 3 Two families of Lebesgue-Stieltjes kernels We consider some possibilities for novel kernels which are not necessarily probability measures. For notational simplicity we present the theory for the case where T is the identity transformation. For the remainder of the paper, the tests we consider are based on dfs G with densities ν g satisfying g (u) > 0 for α < u < α and g (u) = 0 for u < α and u > α . In certain ν ν 1 2 ν 1 2 cases we allow for mass at the boundaries, i.e., ν({α }) ⩾ 0. We refer to the interval [α ,α ] i 1 2 as the kernel window. Remark 3.1. For unbounded measures, we have G (u) → ∞ as u → α . In such cases, ν 2 only α = 1 is admissible. Were we to choose α < 1, we would have Pr(α < P ⩽ 1) = 2 2 2 t (1−α ) > 0 under the null hypothesis, so the first moment µ would be infinite. 2 W 3.1 Simple kernels of power form Gordy and McNeil (2020) observe that the beta-type density (u−α )a−1(α −u)b−1 provides 1 2 a flexible yet parsimonious and tractable form for the density of G . Since that paper ν restricted ν to the set of probability measures, it was necessary to restrict a > 0,b > 0 and to regularize the kernel by the beta function B(a,b). Here we relax the restriction on b and discard the regularization. For u in the unit interval, let B(u;a,b) denote the (unregularized) incomplete beta function (cid:90) u B(u;a,b) = xa−1(1−x)b−1dx. (9) 0 11
We define the beta kernel ν via the df (cid:18) (cid:19) (α ∨u∧α )−α 1 2 1 G (u) = B ;a,b . ν α −α 2 1 This kernel is purely continuous, i.e., ν({α }) = ν({α }) = 0. When b > 0, B(u;a,b) is 1 2 bounded from above by B(a,b) so W = G (P ) certainly has finite moments. However, t ν t when b ⩽ 0, B(u;a,b) is unbounded as u → 1 and, by Remark 3.1, we have to set α = 1. 2 Moreover, the existence of moments has to be checked with the help of Proposition 2.2 and the following result. Proposition 3.2. As u → 1, O((1−u)b) b < 0, B(u;a,b) = O(−ln(1−u)) b = 0. In combination with Proposition 2.2, it follows that W = G (P ) has finite first and second t ν t moments if and only if b > −1/2. In the case b = 0 we note that −ln(1−u) = O((1−u)k), u → 1, (10) for any k < 0, a fact that is used a number of times in the following sections. The b = 0 case is particularly important for practical application. For small |b|, standard algorithms for B(u;a,b) may be numerically unstable for u near 1. However, for b = 0, González-Santander (2021, Theorem 1) provides a finite series expansion in elementary functions. We perform Monte Carlo analyses to explore how the size and power of spectral backtests with beta-type kernels depend on the beta parameters (a,b). We consider four different choices for the df F of the true model of L : the standard normal, and the scaled t , scaled t 10 t and scaled t . The Student t distributions are scaled to have variance one so differences 5 3 stem from different tail shapes rather than different variances. We take the forecaster’s 12
model F(cid:98) to be the standard normal, i.e., we transform the sampled L to PIT-values as t P = Φ(L ). Therefore, when the samples of L are drawn from the standard normal, the t t t PIT-values are uniformly distributed and are used to evaluate the size of the tests. The PIT samples arising from the Student t distributions show the kind of departures from uniformity that are observed when the forecaster’s model is too thin-tailed. We fix a kernel window of [α ,1] for α = 0.975. Our sample size is fixed to n = 500 1 1 correspondingtotwo-yearsamplesoftradingdayreturns. Ourtablesreportthepercentageof rejectionsofthenullhypothesisatthe5%confidencelevelbasedon216 = 65,536replications. All reported p-values are based on two-sided tests. Parameters (1,1) (2,1) (1,1/ 4 ) (1,1/ 8 ) (1,0) (2,0) (5,0) Normal 4.7% 4.6% 4.6% 4.5% 4.4% 4.3% 4.9% Scaled t 13.7% 19.4% 24.1% 28.6% 34.2% 40.8% 45.1% 10 Scaled t 21.2% 34.0% 45.7% 55.0% 64.6% 72.2% 76.4% 5 Scaled t 13.1% 28.7% 46.5% 61.3% 75.0% 82.2% 86.5% 3 Table 1: Size and power of tests based on beta monokernels. Kernel window is [0.975,1]. 2^16 trials with 500 observations per trial. ResultsforunivariatebetakernelsarereportedinTable1. Forallsetsofbetaparameters, tests are well-sized. For each alternative true model F, we find that power increases as b declines and a increases. The magnitude of the effect is extremely large. For example, against the scaled t alternative, the rejection rate increases from 13.1% for the uniform 3 (beta(1,1)) kernel to 75.0% for the beta(1,0) to 86.5% for the beta(5,0). This pattern is confirmed across a finer grid of beta parameters for the case of the Student t in Figure 1. 5 To understand this pattern, observe that for any two PIT values α < p < p < 1, the 1 1 2 ratio of the beta kernels g (p )/g (p ) decreases in b and increases in a. The higher this ν 2 ν 1 ratio, the greater the weight in the test on PIT in the neighborhood of p relative to PIT 2 in the neighborhood of p . As shown in Figure 2, within the kernel window of [0.975,1], the 1 distributions of PIT under the scaled Student t alternatives differ most from the distribution under the null (green solid line) as we move deeper into the tail. Thus, we generally expect 13
0.5 0.3 0.1 0 2 4 6 8 a )elacs gol( etaR noitcejeR b 0 0.5 1 2 4 8 Figure 1: Power of tests based on beta monokernels against scaled t alternative. 5 Kernel window is [0.975,1]. 2^16 trials with 500 observations per trial. tests that weight more heavily on the right-hand tail to deliver higher power. Results for bivariate beta kernels are reported in Table 2. Gordy and McNeil (2020) demonstrated that bikernel tests are generally more powerful than monokernel tests when the component kernels of the bivariate test emphasize opposite ends of the kernel support. Put another way, the lower the correlation between W and W , the greater the additional t,1 t,2 information gain in introducing the second kernel. Gordy and McNeil (2020) illustrate the point by way of a bikernel, given mnenomic ZPP, with component parameter pairs (25,1) and (1,25), which they showed outperformed bivariate kernels of lesser curvature. Consistent with their results, we find that ZPP offers higher power than the bivariate kernel with two linear densities (parameter pairs (2,1) and (1,2)) against all three Student t alternatives. However, by allowing for unbounded beta kernels (i.e., b = 0), we can obtain even better performance without resorting to extreme values of parameter a. For example, the bivariate kernel with component parameters (2,0) and (1,3) outperforms ZPP on all three Student t alternatives. 14
1.00 0.98 0.96 0.950 0.975 0.990 1.000 u ) u £ P ( rP t True model Normal Scaled t10 Scaled t5 Scaled t3 Figure 2: Distribution functions for reported PIT-values. Dfs for the reported PIT-values when the forecaster assumes standard normal losses (F(cid:98) =Φ) but the true loss model F is standard normal, scaled t , scaled t , or scaled t . 10 5 3 Parameters: (2,1) (25,1) (2,0) (5/ 2 ,0) (9/ 2 ,0) (1,2) (1,25) (1,3) (1/ 2 ,3) (1/ 2 ,6) Normal 4.8% 5.5% 5.4% 5.4% 5.5% Scaled t 22.5% 38.0% 41.2% 41.3% 42.5% 10 Scaled t 47.3% 69.7% 74.3% 74.5% 75.3% 5 Scaled t 64.1% 84.7% 88.2% 88.5% 89.0% 3 Table 2: Size and power of tests based on beta bikernels. Kernel window is [0.975, 1]. 2^16 trials with 500 observations per trial. Gordy and McNeil (2020) assign mnemonic ZPP to the parameter pair (25,1), (1,25). 15
3.2 Kernels derived from truncated location-scale families In this section we look at bispectral tests that arise as score tests in truncated location-scale families (TLSFs). A test of this kind based on the normal distribution was proposed in a PhD thesis by Lok (2017) and can be viewed as a Z-test analog of the test of Berkowitz (2001), which is a likelihood-ratio test of the uniformity of PIT values in an interval against a family of non-uniform alternatives. We show below that a wide class of TLSFs yield viable bispectral tests. Let R(µ,σ) denote a family of continuous probability distributions with location parameter µ and scale parameter σ (which need not be equal to the mean and standard deviation) and let R denote the df of the R(0,1) distribution and ρ its density. Examples include the cases where R is standard normal (Lok, 2017), standard logistic or Gumbel distribution. If we assume that R−1(P ) ∼ R(µ,σ) and write θ = (µ,σ)′, the df and density of P are t t respectively3 (cid:16) (cid:17) (cid:18) R−1(p)−µ (cid:19) ρ R−1( σ p)−µ F (p | θ) = R , f (p | θ) = , p ∈ [0,1], (11) P σ P ρ(R−1(p))σ and the uniform distribution corresponds to θ = θ = (0,1)′. 0 The truncated location-scale family corresponding to the window [α ,α ] ⊆ [0,1] is the 1 2 mixed probability distribution of the truncated random variable P∗ = α ∨ P ∧ α . This t 1 t 2 is described by the density f (p | θ) on [α ,α ), an atom of size F (α ,θ) at α if α > 0, P 1 2 P 1 1 1 and an atom of size F (α ,θ) at α if α < 1; if α = 1 then P∗ = α ∨P and the density P 2 2 2 2 t 1 t f (p | θ) applies to the closed interval [α ,1]. P 1 If L (θ | P ) denotes the likelihood contribution of the PIT observation P in this P∗ t t 3The assumption R−1(P )∼R(µ,σ) imposes no restriction on the underlying distribution F for the loss t t X t or for the modeler’s belief F(cid:98)t . An auxiliary variable X˜ t = R−1(P t ) can be conceived as a quasi-loss in the sense that X and X˜ will be comonotonic but otherwise unconnected in their marginal distributions. t t 16
truncated model, then the score vector is given by (cid:18) ∂ ∂ (cid:19)′ S (θ) = lnL (θ | P ), lnL (θ | P ) . (12) t P∗ t P∗ t ∂µ ∂σ Let S (θ ) = 1 (cid:80)n S (θ ) be the mean of the observed score vectors under the null. n 0 n t=1 t 0 √ d (cid:0) (cid:1) Standard likelihood theory implies that nS (θ ) −−−→ N 0,Υ(θ ) under the null, where n 0 2 0 n→∞ Υ(θ) denotes the covariance matrix of S (θ), i.e., the Fisher information matrix. For large t n the score test statistic satisfies nS (θ )′Υ(θ )−1S (θ ) ∼ χ2. (13) n 0 0 n 0 2 Computation of the score vector and the information matrix is detailed in Appendix D. We now give conditions under which the score test (13) can be viewed as a bispectral Z-test with kernel measures ν and ν given by sums of discrete and continuous parts. The 1 2 following assumption will be imposed in our key result. Assumption 2. The df R underlying the TLSF score test is absolutely continuous with log-concave density ρ and support R. Theorem 3.3. Let 0 < α < α ⩽ 1 and assume that Assumption 2 holds. Let λ denote 1 2 ρ the function λ (x) = −ρ′(x)/ρ(x). Then the equation ρ (cid:18) (cid:19) ρ(x) x +λ (x) −1 = 0, (14) ρ R(x) has a unique root x > 0 and, provided that α ⩾ α ≡ R(x), the score vector S (θ ) satisfies 1 t 0 S (θ ) = W −µ where W = G (P ), t 0 t W t,i i t (cid:90) u G (u) = γ 1 +γ 1 + g (x)dx, (15) i i,1 {u⩾α1} i,2 {u⩾α2,α2<1} i 0 17
the densities g are given by the derivatives i d d g (u) = (cid:0) λ (R−1(u)) (cid:1)1 , g (u) = (cid:0) R−1(u)λ (R−1(u)) (cid:1)1 , (16) 1 du ρ {α1 ⩽u⩽α2} 2 du ρ {α1 ⩽u⩽α2} the constants γ are given by the non-negative values i,j ρ(R−1(α )) γ = 1 +λ (cid:0) R−1(α ) (cid:1) , γ = R−1(α )γ −1 1,1 ρ 1 2,1 1 1,1 α 1 ρ(R−1(α )) γ = 2 −λ (cid:0) R−1(α ) (cid:1) , γ = R−1(α )γ +1 1,2 ρ 2 2,2 2 1,2 1−α 2 and µ = α−1ρ(R−1(α )) (1, R−1(α )) ′ . W 1 1 1 Remark 3.4. If the density ρ is not log-concave, we would typically have upper as well as lower bounds on the window over which the score test gives a well-defined bispectral test. For elaboration and illustration, see Appendix B. We now consider the two cases α < 0 and α = 1 separately. The former case is 2 2 straightforwardsincetheW variablesarebounded, guaranteeingthattheelementsofΥ(θ ) t,i 0 are finite. In this case it would be possible to normalize the measures ν to be probability i measures by dividing by G (α ) although there is no practical advantage in doing so. i 2 If α = 1, it follows from Theorem 3.3 and formula 15 that for u ∈ [α ,1] the dfs G 2 1 i have the forms G (u) = ρ(R−1(α ))/α +λ (R−1(u)) 1 1 1 ρ (17) G (u) = R−1(α )ρ(R−1(α ))/α +R−1(u)λ (R−1(u)). 2 1 1 1 ρ Since λ is an increasing function and R−1(u) → ∞ as u → 1, we can infer that G (u) ⩽ ρ 1 G (u) and that G (u) → ∞ as u → 1. In this case G is unbounded and we cannot 2 2 2 normalize the measure ν to be a probability measure. We need to verify that the condition 2 of Proposition 2.2 is satisfied for G to be sure that the elements of Υ(θ ) are finite. 2 0 18
Remark 3.5. While G is always unbounded, G can be bounded if lim λ (x) is finite. 2 1 x→∞ ρ For example, this occurs when ρ is the logistic density or the Gumbel density. These cases are analysed in Examples 3.8 and Examples 3.9 below; in both cases lim λ (x) = 1. x→∞ ρ Remark 3.6. If a TLSF distribution R is skewed, then there exists a complementary TLSF distribution of opposite skew with df R (x) = 1−R(−x). If ρ is log-concave, then so is ρ . To c c simplify implementation, we can exploit the relationships ρ (x) = ρ(−x), λ (x) = −λ (−x), c c ρ and R−1(p) = −R−1(1−p). c The main limitation on the application of bispectral tests based on TLSF score test is the requirement that α ⩾ α where α = R(x) and x > 0 solves (14). The fact that 1 α > R(0) shows that a portion of the interval [0,1] must be eliminated from consideration. 1 We illustrate the TLSF score test and the constraint on α with several examples of log- 1 concave densities. Expressions for Υ(θ ) and other computational details are found in the 0 Online Supplement (Appendix 3). Example 3.7 (Test based on normal distribution). When R(µ,σ) = N(µ,σ2) and R = Φ, P is said to follow a probitnormal distribution on the unit interval. This distribution also t appears as the nesting model in the well-known LR-test of Berkowitz (2001). The normal distribution has a log-concave density with λ (x) = x and (14) has unique root x ≈ 0.84. ρ The conditions for a bispectral test are satisfied if α ⩾ α = Φ(x) ≈ 0.80, which is unlikely 1 to bind in application to the range of tail probability levels of practical interest. From (17) we have that, when α = 1, 2 G (u) ∼ Φ−1(u) and G (u) ∼ Φ−1(u)2, as u → 1. 1 2 Because G (u) ∼ −2ln(1−u) as u → 1 (Abramowitz and Stegun, eds, 1965, eq. 26.2.22), 2 it follows from (10) and Proposition 2.2 that second moments are finite. Example 3.8 (Test based on logistic distribution). The standard logistic distribution has R(x) given by the logistic function S(x) = 1/(1 + exp(−x)). The density is log-concave 19
with λ (x) = S(x) − S(−x) and (14) has unique root x ≈ 1.28. A bispectral test may ρ be constructed for α ⩾ α ≈ 0.78. Interestingly, the first kernel density function satisfies 1 g (u) = 2, implying constant weighting in the kernel window, while, for α = 1, (17) implies 1 2 that G (u) ∼ −ln(1−u) as u → 1 and we argue as in the normal case that second moments 2 are finite. Example 3.9 (Two tests based on Gumbel distribution). Unlike the normal and logistic distributions, the Gumbel distribution is skewed, so weights left and right tails asymmetrically. BoththestandardGumbel(positiveskew)andcomplementaryGumbel(negativeskew)have log-concave density. For the standard Gumbel we have λ (x) = 1−exp(−x) and (14) has ρ unique root x = 1. A bispectral test may be constructed for α ⩾ α = R(1) = e−1/e ≈ 0.69. 1 The first kernel density function is the decreasing function g (u) = u−11 while it 1 {α1 ⩽u⩽α2} maybeeasilydeducedfrom(17)thatthedfofthesecondkernelsatisfiesG (u) ∼ −ln(−lnu) 2 as u → 1 when α = 1. Since −lnu ∼ 1 − u as u → 1 we once again have that 2 G (u) ∼ −ln(1−u) as u → 1, guaranteeing finiteness of second moments. 2 For the complementary Gumbel a bispectral test can be constructed for α ⩾ α ≈ 0.87. 1 c The first kernel density function is the increasing function g (u) = (1−u)−11 while 1 {α1 ⩽u⩽α2} (cid:0) (cid:1) (cid:0) (cid:0) (cid:1)(cid:1) it may be deduced from (17) that G (u) ∼ ln 1 ln ln 1 . 2 1−u 1−u Example 3.10 (Family of tests based on the logistic-beta distribution). The test based on the logistic distribution of Example 3.8 is a special case in a larger family of tests based on the logistic-beta distribution for which the df is a composition of the beta df I(z;a,b) and the logistic function, i.e., R(x) = I(S(x);a,b) where a > 0 and b > 0. The standard logistic is the case where a = b = 1. The density is log-concave. Cumulants are known in closed-form, from which is it easily verified that the skew of the distribution has the same sign as a − b. By the reflection symmetry property of the beta df, the complementary distribution of the logistic-beta (a,b) 20
30 20 10 0 3 6 9 12 - ln ( 1- PIT ) m -) TIP ( G 2 2 kernel Comp Gumbel Gumbel Logistic Probitnormal Figure 3: Tail behavior of the G kernel. 2 The Gumbel and logistic lines are visually indistinguishable. The asymptote lines are y =−1+x and y =−6+2x. is itself a logistic-beta distribution with parameters (b,a). By varying parameters (a,b) one obtains a rich family of distributions which vary materially in their shapes and higher moments. Nonetheless, all members of the family share the same tail behavior for the second kernel: Proposition 3.11. For all parameters a > 0, b > 0 of the logistic-beta family, G (u) ∼ 2 −ln(1−u) as u → 1. Figure 3 summarizes the tail behaviors of the examples described above. We do not includethelogistic-betabecauseitisvirtuallyindistinguishablefromthelogisticandGumbel cases. We assess size and power of TLSF tests following the methodology of Section 3.1. We include two logistic-beta distributions, one left-skewed and one right-skewed, in addition to the logistic. As a benchmark, we also include the Berkowitz (2001) LR-test. Results are reported inTable3. Comparingthefirstcolumntothelast, weseethattheprobitnormalTLSFtestis 21
notably more powerful than its LR analog, the Berkowitz test. The complimentary Gumbel is the most powerful of the TLSF tests. It is comparable in size and power to the bivariate beta kernel tests with parameters ((2,0),(1,3)) as reported in Table 2. The most striking feature of the table is that the Gumbel, logistic, and both logistic-beta TLSF tests deliver virtually identical size and power against all three alternatives. This is a consequence of the identical tail behavior of their respective G (p) kernels as p → 1. The probitnormal, which 2 outperforms the Gumbel and logistic somewhat, has a steeper asymptote in Figure 3, while the complimentary Gumbel is steepest of all and convex. Compli Logistic-Beta F Probitnormal Gumbel Gumbel Logistic (3/ 2 ,1/ 2 ) (1/ 3 ,2/ 3 ) Berkowitz Normal 5.1% 5.5% 5.1% 5.1% 5.0% 5.1% 5.1% Scaled t 38.9% 41.1% 36.8% 36.8% 36.8% 36.8% 28.9% 10 Scaled t 72.9% 74.5% 71.1% 71.2% 71.2% 71.1% 65.5% 5 Scaled t 88.0% 88.8% 87.0% 87.0% 87.0% 87.0% 86.5% 3 Table 3: Size and power of tests based on TLSF bikernels. Kernel window is [0.975, 1]. 2^16 trials with 500 observations per trial. “Berkowitz” denotes the Berkowitz (2001) LR-test. 4 A family of v-shaped u.d.p. transformations While the spectral tests we have developed will work with any u.d.p. transformation, we will confine our practical examples to transformations T(v) that are v-transforms. V-transforms constitute a flexible parametric class of u.d.p. transformations that are well-suited to modeling volatile financial time series. Such transforms map values near a central fulcrum point to near zero and values near the boundaries to near one, so are useful in situations where we wishtoemphasizePITvaluesineithertail. Thesymmetriclinearv-transformT(v) = |1−2v| is the simplest case and an obvious choice when we have no reason to suspect asymmetry in the true model. Following McNeil (2021), we define 22
Definition 4.1. A v-transform is a mapping T : I → I with the following properties: 1. T(0) = T(1) = 1 and there exists a fulcrum point 0 < δ < 1 such that T(δ) = 0; 2. T is continuous on I, strictly decreasing on [0,δ] and strictly increasing on [δ,1]; 3. For every point u ∈ I there is a point l(u) ∈ [0,δ] satisfying T(l(u)) = u and T(l(u)+ u) = u. It is straightforward to verify that such a transformation preserves uniformity: if u ∈ I and P ∼ U(0,1) then P(T(P ) ⩽ u) = P(l(u) ⩽ P ⩽ l(u)+u) = u so that T(P ) ∼ U(0,1). t t t t V-transforms may be characterized as mappings T : I → I taking the form (1−v)−(1−δ)Ψ (cid:0) v (cid:1) v ⩽ δ, δ T(v) = (18) v −δΨ−1 (cid:0) 1−v (cid:1) v > δ, 1−δ where 0 < δ < 1 and Ψ is a continuous and strictly increasing distribution function on I (McNeil, 2021, Theorem 1). Figure 4 shows a number of examples of linear and nonlinear v-transforms constructed using (18) and the function Ψ(v) = vκ, which we refer to as a generator of a v-transform. The symmetric linear v-transform corresponds to δ = 1/ 2 and κ = 1. For small ϵ (ϵ < min{δ,1−δ}) in the linear case, it is straightforward to show that T(ϵ)−T(1−ϵ) is increasing in δ. When κ > 1, the right arm of the v-transform is convex and the left arm concave. For small ϵ, we can show that T(ϵ) − T(1 − ϵ) is increasing in κ. That is, moving the fulcrum to the right or increasing convexity in Ψ(v) increases the emphasis on PIT observations in the left tail relative to those in the right tail. We will sometimes refer to v-transforms satisfying Definition 4.1 as proper v-transforms anddescribetheidentitytransformationT(v) = v andtheu.d.p.transformationT(v) = 1−v as degenerate v-transforms; the latter are degenerate because they satisfy lim T(v;δ) = v δ→0 for v ∈ (0,1] and lim T(v;δ) = 1−v for v ∈ [0,1), for any family of proper v-transforms δ→1 T(·;δ) indexed by δ. It is straightforward to see that all proper v-transforms are folding in the sense of Definition 2.4 since, in all cases, I = I \{δ}. T 23
1.00 0.75 0.50 0.25 0.00 0.00 0.25 0.50 0.75 1.00 v )v(T 1.00 0.75 0.50 0.25 0.00 0.00 0.25 0.50 0.75 1.00 v d 1/2 1/3 2/3 )v(T k 1/2 2 Figure 4: Examples of v-transforms with generator Ψ(v)=vκ. Linear v-transforms are shown in the left panel, for which we fix κ=1 and vary the fulcrum δ. Nonlinear v-transforms are shown in the right panel, for which we fix δ =1/2 and vary the exponent κ. Fix a kernel ν on the unit interval, and let ν be the same kernel scaled to the window 1 η of width η at the upper end of the unit interval, i.e., the window [1−η,1]. Similar to the scaling of the beta kernel in Section 3.1, we can write (cid:18) (cid:19) (1−η ∨u)−(1−η) G (u) = G . η 1 η If the PIT data {P } are pre-processed by the v-transform T(v) = |1−2v|, then transformed t values {W } satisfy W > 0 if and only if P ∈ [0,η/2] ∪ [1 − η/2,1]. In particular, PIT t t t observations in [1−η,1−η/2] no longer receive weight. Thus, a possibly unintended consequence of the pre-transformation is that it concentrates weight on more extreme portions of the tails. To remedy this situation, we can double the kernel width, i.e., we replace G (u) η by G (u) so that W > 0 if and only if P ∈ [0,η]∪[1−η,1]. 2η t t We illustrate with a simple exercise. We fix the kernel to the beta bikernel with parameters pairs (1,0) and (1,2). As in Section 3, we consider four different choices for the df F of 24
the true model of L : the standard normal, and the scaled t , scaled t and scaled t , and t 10 5 3 we assume that the forecaster’s model is F(cid:98) = Φ. In Table 4, the column labeled “identity” reports rejection rates in the absence of pre-processing (equivalently, by application of the degenerate v-transform T(v) = v) whereas the final column reports rejection rates when pre-processing by the proper v-transform T(v) = |1 − 2v|. The narrow window has kernel width η = 0.025 whereas the wide window has kernel width 2η = 0.05. F identity |1-2u| Kernel window: Narrow Normal 5.3% 5.5% Scaled t 40.8% 60.9% 10 Scaled t 74.1% 92.1% 5 Scaled t 88.1% 97.9% 3 Kernel window: Wide Normal 5.0% 5.1% Scaled t 38.6% 58.8% 10 Scaled t 75.4% 92.2% 5 Scaled t 93.9% 98.7% 3 Table 4: Size and power of v-transformed tests under excess kurtosis. Beta bikernel with parameters ((1,0),(1,2)). 2^16 trials with 500 observations per trial. Comparing the upper left quadrant to the lower left and the upper right quadrant to the lower right, we see that power can increase or decrease when the kernel width is doubled. As discussed in Section 3.1, tests are most powerful when the kernel window coincides with the range over which the true distribution differs most strongly from the forecaster’s model. Doubling the kernel width can strengthen or dilute that coincidence. Comparing the upper left quadrant to the lower right, we see that the combination of pre-processing with adjustment of the kernel window substantially increases the power of the backtest. Because in the comparison both sets of tests weight identically on the upper 0.025 percent of the distribution of PIT values (under the null), the improved performance is coming from the addition of the lower 0.025 of the distribution of PIT values in the tests of the lower right quadrant. 25
When the true model exhibits skewness as well as excess kurtosis, in some situations one can obtain a more powerful backtest by pre-processing with an asymmetric v-transform. However, when the model is close to mesokurtic, the pre-processor must be well-chosen. To illustrate, consider the family of Fernández and Steel (1998) (hereafter “FS”) skew-t distributions. We use a standardized version of this family with density (cid:16) (cid:17) 2γ f t γ(x σ −µ);ζ x ⩽ µ f (x;γ,ζ) = (19) FS σ(1+γ2) (cid:16) (cid:17) f t x γ − σ µ;ζ x > µ wheref (·;ζ)denotesthedensityofthestandardt-distributionwithζ > 2degreesoffreedom, t γ is a skewness parameter, and location µ = µ(ζ,γ) and scale σ = σ(ζ,γ) are chosen so that the FS distribution has mean 0 and variance 1; see the Online Supplement (Appendix 1) on this standardization. In the case of ζ = ∞, we have the FS-Normal(γ) distribution. In the case of γ = 1, we have the scaled t distribution as used earlier in Table 4 and in Section 3. ζ Figure 5 illustrates the PIT signature for unmodelled skewness. In the case of the FS- Normal distribution (left panel), for γ = 3/2 (blue line) the PIT density features a local maximum in (0,0.5) and a local minimum in (0.5,1). A symmetric pattern arises when we invert γ (γ = 2/3, red line). Numerically, we can show that skewness of the PIT distribution takes the same sign as γ−1 (so right skew for γ = 3/2, left skew for γ = 2/3) and increases in magnitude with |γ−1|. Kurtosis of the PIT distribution is invariant to γ (kurtosis of 9/5, equal to that of the uniform distribution). When we increase kurtosis in the true model (ζ = 5, right panel), the departure from uniformity in the PIT density is magnified but the visual pattern is qualitatively similar. Numerically, we find that skewness is decreasing in magnitude as ζ → ∞ and as γ → 1. Kurtosis is decreasing in ζ and increasing in |γ −1|. The exercise of Table 5 is similar to that of Table 4, except that we vary the skewness of the true model rather than its kurtosis. The rejection rates in the top row, labeled “Normal” 26
1.5 1.0 0.5 ζ=∞ 0.0 0.00 0.25 0.50 0.75 1.00 PIT ytisned 1.5 1.0 0.5 ζ=5 0.0 0.00 0.25 0.50 0.75 1.00 PIT γ 2/3 3/2 ytisned γ 2/3 3/2 Figure 5: Density for PIT under model misspecification. The true model is FS-t with ζ degrees of freedom and skewness parameter γ. The dotted line is the uniform density. (equivalent to FS-Normal(1)), capture the size of the test. Parameter γ increases as we move down the rows of the table, which increases right skew. Comparing the first two columns shows that pre-processing with T(v) = |1 − 2v| decreases power, regardless of whether we double the width of the kernel window. Under the FS-Normal(γ) model, left-tail PIT values are underrepresented and right-tail PIT values overrepresented. Folding the PIT distribution averages over these effects, thereby obscuring the departure from uniformity. The remaining columns of Table 5 vary parameters (δ,κ) of the pre-processor. For the right-skewed true models, we observe that an asymmetric pre-processor can materially increase the power of the backtest relative to a symmetric pre-processor but all pre-processors reduce power relative to the case of no pre-processor for the reasons just discussed. Rejection rates are higher for δ = 1/3 than for δ = 2/3 and higher for κ = 1/ 2 than for κ = 2. The intuition is most straightforward for the linear case: as we reduce δ, the pre-processor T(v;δ) converges towards the identity pre-processor (equivalent to δ = 0). Finally, in Table 6, we report rejection rates for a true model with considerable excess kurtosis as well as skewness. The results inherit some features from each of the previous 27
F identity |1-2u| (1/ 3 , 1) (2/ 3 , 1) (1/ 2 , 1/ 2 ) (1/ 2 , 2) Kernel window: Narrow Normal 5.3% 5.5% 5.4% 5.4% 5.3% 5.2% FS-Normal(51/50) 7.0% 5.6% 5.9% 5.1% 6.9% 4.2% FS-Normal(26/25) 8.8% 5.7% 6.5% 4.8% 8.7% 3.4% FS-Normal(6/5) 31.9% 10.2% 15.3% 6.4% 30.5% 3.7% FS-Normal(4/3) 52.2% 17.9% 27.1% 11.5% 50.3% 11.9% Kernel window: Wide Normal 5.0% 5.1% 5.1% 5.2% 4.9% 5.1% FS-Normal(51/50) 6.4% 5.3% 5.6% 4.9% 6.4% 4.3% FS-Normal(26/25) 8.2% 5.4% 6.0% 4.8% 7.8% 3.6% FS-Normal(6/5) 30.9% 9.0% 13.7% 5.8% 28.7% 5.4% FS-Normal(4/3) 51.1% 16.4% 25.1% 11.2% 48.5% 16.3% Table 5: Size and power of v-transformed tests under skewness with minimal excess kurtosis. Column headers of form (δ,κ) refer to the v-transform with fulcrum δ and generator Ψ(u)=uκ. The identity pre-processor is equivalent to (δ =0,κ=1) and |1−2u| equivalent to (δ =1/2,κ=1). All tests utilize the beta bikernel with parameters ((1,0), (1,2)). 2^16 trials with 500 observations per trial. two tables. As in Table 4, the symmetric linear pre-processor materially increases the power of the backtest when γ is close to 1. As skewness increases, the considerations at work in Table 5 begin to dominate. As in Table 5, we can further increase the power in some cases by nudging the fulcrum of the pre-processor to the left. For larger values of γ, a nonlinear pre-processor (κ = 1/ 2 ) is even more effective. In Section 3.1 and in Gordy and McNeil (2020), we highlight the intuition that a kernel is most powerful against a given alternative when it places greater mass on regions of the unit interval over which the forecaster’s PIT distribution departs most from uniformity. The intuition extends to the choice of pre-processor. In Figure 6, we plot the df of the linear T(P;δ) for P ∼ FS-t(5,26/25) and various values of δ. Consistent with the findings in Table 6, we see that the df of the PITs crosses the uniform df within the wide kernel window of [0.95,1]. The dfs of the pre-processed PIT values are pointwise more distant from the uniform df except at the left edge of the window. Departure from uniformity is slightly decreasing across the three values of δ, which accords with the slight differences in power of 28
F identity |1-2u| (1/ 3 , 1) (2/ 3 , 1) (1/ 2 , 1/ 2 ) (1/ 2 , 2) Kernel window: Narrow Normal 5.3% 5.5% 5.4% 5.4% 5.3% 5.2% Scaled t 74.1% 92.1% 91.7% 91.7% 81.8% 81.8% 5 FS-t(5,51/50) 77.2% 92.2% 92.0% 91.7% 83.7% 80.0% FS-t(5,26/25) 79.7% 92.2% 92.2% 91.6% 85.3% 78.1% FS-t(5,6/5) 92.1% 93.1% 94.1% 91.4% 93.5% 63.9% FS-t(5,4/3) 96.1% 94.3% 95.5% 92.4% 96.5% 58.0% Kernel window: Wide Normal 5.0% 5.1% 5.1% 5.2% 4.9% 5.1% Scaled t 75.4% 92.2% 91.8% 91.9% 81.4% 81.3% 5 FS-t(5,51/50) 78.2% 92.3% 92.0% 91.8% 83.2% 79.7% FS-t(5,26/25) 80.6% 92.3% 92.3% 91.7% 84.9% 77.8% FS-t(5,6/5) 92.1% 93.4% 94.3% 91.9% 93.5% 64.7% FS-t(5,4/3) 96.1% 94.9% 95.9% 93.4% 96.5% 61.6% Table 6: Size and power of v-transformed tests under skewness and excess kurtosis. Column headers of form (δ,κ) refer to the v-transform with fulcrum δ and generator Ψ(u)=uκ. The identity pre-processor is equivalent to (δ =0,κ=1) and |1−2u| equivalent to (δ =1/2,κ=1). The Scaled t alternative is equivalent to FS-t(5,1). All tests utilize the beta bikernel with parameters ((1,0), (1,2)). 5 2^16 trials with 500 observations per trial. the corresponding tests.4 4The line for δ = 1/2, not shown in the plot, lies between those of δ = 1/3 and δ = 2/3, and is difficult to distinguishable from the former. 29
1.00 0.98 0.96 ζ=5, γ=26 25 0.95 0.96 0.97 0.98 0.99 1.00 u )u≤)P(T(rP t δ identity 1/3 2/3 Figure 6: Tails of distribution functions for v-transformed PIT-values. Dfs for v-transformed PIT-values when the forecaster assumes standard normal losses (F(cid:98) =Φ) but the true loss model F is FS-t(5,26/25) and the validator applies a linear pre-processor. 30
5 Conclusion We conclude with some practical guidance on the choice of kernel and pre-processor. As in Gordy and McNeil (2020), we emphasize that such choices express implicitly the preferences of the validator (i.e., the locus of departure from PIT uniformity that the validator would deem most worrisome), so it is not our place to be prescriptive. To the extent that the validator may be uncertain or agnostic over her preferences, other considerations could come into play. In particular, if the validator has some prior familiarity with the forecaster’s methodology, the validator might wish to craft the test to highlight suspected flaws. Power and size are natural considerations as well, and the validator might also prefer that the test statistic have good numerical properties, e.g., simple to program, quick to calculate, and robust to small changes in parameters and data values. The unbounded beta and TLSF families are similar in the power that can be achieved against alternatives featuring excess kurtosis, and substantially more powerful than tests based on bounded kernels. Our evidence indicates that all cases of these families yield wellsized tests. Where they differ is in their computational and aesthetic properties. Tests based on the beta kernel are easy to program and fast but depend in general on the availability of routines for the hypergeometric functions. Under special cases, in particular when parameters take integer and half-integer values, the hypergeometric functions can be sidestepped, resulting in simple expressions. With regard to the unbounded beta kernel, the case of b = 0 is easily programmed and numerically stable, but for non-zero values of b near zero, numerical instability can arise. TheTLSFkernelsarisenaturallyasscoretests,socanbeseenasmoment-basedanalogues to well-known LR tests such as that of Berkowitz (2001). Relative to the beta family, tests based on the TLSF families are more intricate to program and somewhat slower to execute, though still much faster than their LR-test analogs and generally more powerful on samples of typical length. These tests appear to be numerically stable. The TLSF family of tests are valid on the upper tail of the PIT distribution but not over the entire unit interval. In some 31
applications, the restriction on the choice of kernel window could be a limitation. When the validator suspects unmodeled kurtosis in the forecast model, as might arise when a model adapts inadequately to changes in market volatility, pre-processing the PIT data with a folding transformation is highly effective in highlighting an excess of tail observations. The simple symmetric linear v-transform T(v) = |1−2v| performs well. When the validator additionally suspects unmodeled skewness, an asymmetric v-transform may offer somewhat greater power than the symmetric linear v-transform but at some cost to robustness. All of the pre-processors considered in our analysis can be implemented in a few lines of trivial code. Furthermore, because the test statistic is a sample mean of a simple composition of the pre-processor and the kernel df, the algorithm can be implemented in modular fashion. That is, starting with a sample of PIT values {P }, we first apply the i ˜ ˜ pre-processor as P = T(P ), and then feed the sample {P } through the spectral backtest. i i i There is no alteration to the code for the spectral backtest. A Proofs A.1 Proof of Theorem 2.1 FornotationsimplicityassumeT istheidentitytransformation. SinceG andG areincreas- 1 2 ing, right-continuousfunctions, itfollowsthatthefunctionG∗(u) = G (u)G (u)mustalsobe 1 2 increasing and right-continuous and thus it can be used to define a Lebesgue-Stieltjes measure ν∗ by setting ν∗({0}) = G∗(0) = 0 and ν∗((a,b]) = G∗(b)−G∗(a) for any 0 ⩽ a < b ⩽ 1. It follows that W∗ = G∗(P ) = ν∗([0,P ]). t t t The formula for G∗ is obtained by applying the integration-by-parts formula for the Lebesgue-Stieltjes integral (Hewitt, 1960, Theorem A); see also Refuz and Yor (2004, Ch. 0). 32
A.2 Proof of Proposition 2.2 Since G (u) = O((1 − u)−1/2+ϵ) as u → 1 for some small ϵ, there exists a value u and a ν 0 positive constant C such that G (u) ⩽ C(1−u)−1/2+ϵ for u ⩾ u . Let u¯ be the larger of u ν 0 0 and the last point at which G is not differentiable (there are only finitely so many points ν by Assumption 1). We can decompose (8) as (cid:90) (cid:90) E(W2) = (1−u)(2G (u)−ν({u}))dν(u)+ (1−u)(2G (u)−ν({u}))dν(u) (A.1) t ν ν [0,u¯] (u¯,1] The integrand in the first term is bounded above by 2G (u¯) and so the integral is finite. We ν only need to prove the finiteness of the second term which can be written as (cid:90) 1 (cid:90) 1 d (cid:90) 1 (1−u)2G (u)g (u)du = (1−u) (cid:0) G (u)2 (cid:1) du = (cid:2) G (u)2(1−u) (cid:3)1 + G (u)2du ν ν du ν ν u¯ ν u¯ u¯ u¯ using integration by parts. Since 0 ⩽ G (u)2(1−u) ⩽ C2(1−u)2ϵ for u ⩾ u¯ and (1−u)2ϵ → 0 as u → 1, it follows ν that [G (u)2(1−u)] 1 = −G (u¯)2(1−u¯). Moreover, the second term is finite because ν u¯ ν (cid:90) 1 (cid:90) 1 C2 G (u)2du ⩽ C2 (1−u)−1+2ϵdu = (1−u¯)2ϵ. ν 2ϵ u¯ u¯ A.3 Proof of Lemma 2.3 Assume that ν is a bounded measure with moments µ and σ. Without loss of generality ν may be taken to be a probability measure with df G satisfying G (1) = 1. In that case let ν ν ν˜ to be the probability measure defined by the df G (u) = 1−G (1−u). Observe that if U ν˜ ν has df G then 1−U has df G and hence moment formulas for linear functions of random ν ν˜ variables give µ˜ = 1−µ and σ˜ = σ. We obtain the identity |W(cid:102) −µ˜| |(1−W )−(1−µ)| |W −µ| n n n = = , σ˜ σ σ 33
showing redundancy of T. Now assume the measure ν is unbounded with finite moments µ and σ and consider PIT samples consisting of the single point {v}. If the transform T(v) = 1 − v is redundant we must have |G (v)−µ˜| |G (1−v)−µ| ν˜ ν = σ˜ σ for all v ∈ [0,1], a measure ν˜ and finite values µ˜ and σ˜. But the rhs tends to infinity as v → 0 while the lhs tends to the finite limit µ˜/σ˜ which yields a contradiction. A.4 Proof of Proposition 2.5 Suppose that the folding u.d.p. transformation T is redundant. For any value u ∈ I , let v T ¯u and v¯ be the smallest and largest elements of T−1[{u}]. Consideration of the PIT samples u {v } and {v¯ } implies that the following identities must hold. ¯u u |G (v )−µ˜| |G (u)−µ| |G (v¯ )−µ˜| ν˜ ¯u = ν = ν˜ u (A.2) σ˜ σ σ˜ Monotonicity of G implies that G (v ) ⩽ G (v¯ ). Now consider two cases. ν˜ ν˜ ¯u ν˜ u Case (a): For some u with G (u) ̸= µ, G (v ) < G (v¯ ). Equation (A.2) can hold only ν ν˜ u ν˜ u ¯ if µ˜−G (v ) = G (v¯ )−µ˜, implying µ˜ = (G (v )+G (v¯ ))/2. Now consider the PIT sample ν˜ ¯u ν˜ u ν˜ ¯u ν˜ u of length two, {v ,v¯ }. We can easily verify that W(cid:102) = µ˜ while W = G (u) ̸= µ. Thus, ¯u u 2 2 ν the Z statistic must be zero for ν˜ but non-zero in magnitude for the u.d.p.-transformed ν, n which contradicts the supposition that T is redundant for ν. Case (b): For all u with G (u) ̸= µ, G (v ) = G (v¯ ). Observe first that (A.2) implies ν ν˜ u ν˜ u ¯ that G (v ) = G (v¯ ) = µ˜ whenever G (u) = µ, so in case (b) we must have that G (v ) = ν˜ ¯u ν˜ u ν ν˜ ¯u G (v¯ ) for all u ∈ I . Because G (v) is nondecreasing for all v, this implies that each value ν˜ u T ν˜ of u ∈ I maps to a interval [v ,v¯ ] ⊆ I over which G (v) is constant. These intervals do not T ¯u u ν˜ 34
overlap, so there can be only countably many. However, because G (u) is strictly increasing ν over some interval within I, we need uncountably many values in G [I] to satisfy (A.2) at ν˜ each u ∈ I , thus leading to a contradiction. T A.5 Proof of Proposition 3.2 Wemaintaintheassumptionthata > 0. WolframResearch(2023,06.19.06.0049.01)provides these expansions as u → 1: −ln(1−u)−ψ(a)−γ b = 0, B(u,a,b) ∝ (−1)b−1Γ(a)ln(1−u) − (1−u)b −b ∈ N, (A.3) (−b)!Γ(a+b) b B(a,b)− (1−u)b otherwise. b where ψ is the digamma function and γ is the Euler-Mascheroni constant. In the middle case, note that (1−u)b dominates ln(1−u) as u → 1 and further that in the edge case of −(a+b) ∈ N, Γ(a+b) is infinite so the first term simply drops out. In the final case, note that B(a,b) is finite even when b < 0. The proposition follows directly. A.6 Proof of Theorem 3.3 The likelihood function is given by (cid:89) (cid:89) (cid:89) ¯ L (θ | P) = F (α | θ) f (P | θ) F (α | θ) (A.4) P∗ P 1 P t P 2 t:Pt<α1 t:α1 ⩽Pt<α2 t:Pt ⩾α2 ¯ where F (u) denotes the tail probability 1−F (u). The likelihood contributions L (θ | P ) P P P∗ t are given by the individual terms in (A.4) according to whether P < α , α ⩽ P < α t 1 1 t 2 or P ⩾ α . With the help of the calculations in Appendix D, and using the functions t 2 ¯ ¯ C (x) = ρ(x)/R(x) and C (x) = −ρ(x)/R(x) defined there, we can compute the score vector 1 1 35
and evaluate it at θ = (0,1)′ to obtain 0 ψ 1 (R−1(α 1 )) P t < α 1 , S t (θ 0 ) = ψ ∗ (R−1(P t )) α 1 ⩽ P t < α 2 , (A.5) ψ 2 (R−1(α 2 )) P t ⩾ α 2 . where ¯ −C (x) λ (x) −C (x) 1 ρ 1 ψ (x) = , ψ (x) = and ψ (x) = . 1 ∗ 2 ¯ −xC (x) xλ (x)−1 −xC (x) 1 ρ 1 The third case in (A.5), described by the function ψ , only comes into play when α < 1. 2 2 If ρ is log-concave this is equivalent to saying that λ (x) = −ρ′(x)/ρ(x) is an increasing ρ function. We first prove that the equation (14) has a unique root x and that x > 0. Let Λ(x) = x(C (x)+λ (x)) and note that Λ is a continuous function satisfying lim Λ(x) = 0 1 ρ x→0 and lim Λ(x) = ∞. Hence there exists at least one x satisfying Λ(x) = 1 and x > 0. x→∞ The derivative Λ′(x) can be split into two parts yielding d d Λ′(x) = (xC (x))+ (xλ (x)) 1 ρ dx dx d = C (x)−xλ (x)C (x)−xC (x)2 + (xλ (x)) 1 ρ 1 1 ρ dx d = C (x)(1−Λ(x))+ (xλ (x)). 1 ρ dx At any x satisfying Λ(x) = 1 the first term must be zero and the second term must be strictly positive, since it is the derivative of a strictly increasing function. Since the gradient of Λ is positive at any root of the equation (14) we conclude that the latter has a unique root x and that x > 0. We now turn to the representation of the score test as a bispectral test. Since λ is an ρ increasing function it follows that both components of ψ (x) are also increasing functions ∗ 36
and thus non-negative weighting functions g can be obtained by differentiating ψ (R−1(u)) i ∗ with respect to u on [α ,α ]. 1 2 The discontinuities at α and α are given by 1 2 (γ ,γ )′ = ψ (R−1(α ))−ψ (R−1(α )), (γ ,γ )′ = ψ (R−1(α ))−ψ (R−1(α )) 1,1 2,1 ∗ 1 1 1 1,2 2,2 2 2 ∗ 2 where the γ constants need only be considered when α < 1. Non-negativity of the γ i,2 2 i,j requires that the inequalities λ (x)+C (x) ⩾ 0 (A.6) ρ 1 x(λ (x)+C (x))−1 ⩾ 0 (A.7) ρ 1 hold for x = R−1(α ) and the inequalities 1 − (cid:0) λ (x)+C ¯ (x) (cid:1) ⩾ 0 (A.8) ρ 1 −x (cid:0) λ (x)+C ¯ (x) (cid:1) +1 ⩾ 0 (A.9) ρ 1 hold for x = R−1(α ) if α < 1. 2 2 Since ρ′(x)/ρ(x) is a decreasing function we can infer that ρ′(x) ρ′(x) (cid:90) x (cid:90) x ρ′(t) R(x) = ρ(t)dt ⩽ ρ(t)dt = ρ(x), ρ(x) ρ(x) ρ(t) −∞ −∞ implying that (A.6) holds for all x ∈ R, and ρ′(x) ρ′(x) (cid:90) ∞ (cid:90) ∞ ρ′(t) R ¯ (x) = ρ(t)dt ⩾ ρ(t)dt = −ρ(x), ρ(x) ρ(x) ρ(t) x x implying that (A.8) holds for all x ∈ R. Since we have assumed that R−1(α ) ⩾ R−1(α) = x, 1 it follows that Λ(R−1(α )) ⩾ 1 and hence that (A.7) holds for x = R−1(α ). Moreover, 1 1 since (A.8) holds at x = R−1(α ) and R−1(α ) > R−1(α ) ⩾ R−1(α) > 0, then (A.9) clearly 2 2 1 37
also holds for x = R−1(α ). 2 Finally, to determine µ = W −S (θ ), we note that, if P < α , then (15) implies that W t t 0 t 1 W = 0 for i = 1,2 while (A.5) implies that S (θ ) = ψ (R−1(α )). It follows that we must t,i t 0 1 1 have µ = −ψ (R−1(α )). W 1 1 A.7 Proof of Proposition 3.11 RecallthatasymptoticbehaviorofG (p)asp → 1dependsonthatofR−1(p)λ (R−1(p)). The 2 ρ dfisacompositionofthebetadfI(z;a,b)andthelogisticfunction, i.e., R(x) = I(S(x);a,b). The inverse df is therefore R−1(p) = logit(I−1(p;a,b)) where logit(u) = ln(u/(1−u)). The well-known symmetry for the beta distribution implies a symmetry of the same form for the inverse df: 1−I−1(p;a,b) = I−1(1−p;b,a) Consequently, we can write (cid:0) (cid:1) (cid:0) (cid:1) R−1(p) = ln I−1(p;a,b) −ln I−1(1−p;b,a) . (A.10) From λ (x) = bS(x)−aS(−x), it follows immediately that lim λ (R−1(p)) = b. Combin- ρ p→1 ρ ing with (A.10), we can infer R−1(p)λ (R−1(p)) ∼ −bln(I−1(1−p;b,a)) as p → 1. ρ From (C.3), we have limaB(a,b)u−aI(u;a,b) = 1 u→0 from which we may infer that 1 (cid:0) (cid:1) ln I−1(1−p;b,a) ∼ (ln(1−p)+ln(bB(a,b)))as p → 1. b The constant additive term is negligible asymptotically so R−1(p)λ (R−1(p)) ∼ −ln(1−p), p → 1, ρ 38
regardless of (a,b). B TLSF tests when the density is non-log-concave The issues that arise for a non-log-concave density can be illustrated by considering the Student t distribution with ν degree of freedom and working through the steps of the proof in Section A.6. In this case λ (x) = (ν+1)x/(ν+x2) is only an increasing function between ρ √ two turning points at x = ± ν and so there are immediate constraints on the interval in which the densities (16) are positive; in particular, we cannot construct a bispectral test with α = 1. Moreover, we have to check all of the conditions (A.6) to (A.9) individually to find 2 an interval [α ,α ] on which the test may be applied. For example, when ν = 4, we need 1 2 to set α ⩾ α ≈ 0.773 and α ⩽ α ≈ 0.887 to obtain a proper bispectral test. The width 1 2 of the interval increases for larger degrees of freedom but the extra constraints relative to log-concave densities render the t distribution less suitable for constructing bispectral tests. C Moments for the beta kernel We seek solutions to the moments and cross-moments of the transformed PIT values when the kernel densities take the form g (u) = (α −α )1−a−b(u−α )a−1(α −u)b−11 ν 2 1 1 2 {α1 ⩽u⩽α2} for parameters (a > 0,b > −1/2) and 0 ⩽ α < α ⩽ 1. 1 2 Coding is facilitated by computing moments in terms of the moments of standardized beta kernels with α = 0,α = 1. Let ν˜ denote a beta(a,b) kernel with kernel density 1 2 g (u) = ua−1(1−u)b−1 on [0,1] and let W = G (U) and W ˜ = G (U) for U ∼ Uniform(0,1). ν ν ν˜ 39
The uncentered moments of W can be obtained as: (α 2 −α 1 )E(W ˜ k)+(1−α 2 )G ν˜ (1)k if α 2 < 1 E(Wk) = (C.1) (1−α 1 )E(W ˜ k) if α 2 = 1. Proposition 3.2 guarantees that lim (1−y)G (y)k = 0 for k = 1,2, so the expression in y→1 ν˜ the α = 1 case is simply the limit of the expression in the α < 1 case. 2 2 ˜ The first moment of W is (cid:90) 1 E(W ˜ ) = (1−u)g (u)du = B(a,1+b) (C.2) ν˜ 0 Since b > −1/2, this expression presents no difficulties for our application. By Wolfram Research (2023, 06.19.26.0005.01, 06.19.26.0006.01), the kernel function G (u) = B(u;a,b) can be expressed in terms of the Gauss hypergeometric function, F , ν˜ 2 1 in two equivalent forms: ua B(u;a,b) = F (a,1−b;a+1;u) (C.3) 2 1 a ua(1−u)b = F (1,a+b;a+1;u). (C.4) 2 1 a Say we have two beta variables with (possibly) different parameters (a ,b ) for i = 1,2. i i ˜ LetG denotethetransformfunctionforkerneli. Togetcross-momentsandsecondmoments, i we need the integral (cid:90) 1 ˜ M(a ,b ,a ,b ) = (1−u)g˜ (u)G (u)du (C.5) 1 1 2 2 1 2 0 B(a +a ,1+b ) 1 2 1 = F (a ,a +a ,1−b ;1+a ,1+a +a +b ;1) (C.6) 3 2 2 1 2 2 2 1 2 1 a 2 B(a +a ,1+b +b ) 1 2 1 2 = F (1,a +a ,a +b ;1+a ,1+a +a +b +b ;1) (C.7) 3 2 1 2 2 2 2 1 2 1 2 a 2 The two forms come from application of Gradshteyn and Ryzhik (2007, 7.512.5) to (C.3) 40
and (C.4), respectively. When calculated from its series expansion, equation (C.6) will be numericallystablewheneverb ⩽ 0whereas(C.7)willbenumericallystablewheneverb > 0. 2 2 In our Online Supplement (Appendix 2), we list numerous special cases for which F (1) 3 2 has known closed-form solution. D The score function and information matrix We impose Assumption 2 in this section and recall that the likelihood function is given by (cid:89) (cid:89) (cid:89) ¯ L (θ | P) = F (α | θ) f (P | θ) F (α | θ) (D.1) P∗ P 1 P t P 2 t:Pt<α1 t:α1 ⩽Pt<α2 t:Pt ⩾α2 ¯ where F (u) denotes the tail probability 1−F (u). P P We begin with the case of lower truncation, i.e., P < α . For notational convenience, t 1 define C (x) = lnR(x). First and second derivatives follow as 0 ρ(x) C (x) = , C (x) = −λ (x)C (x)−C (x)2. 1 2 ρ 1 1 R(x) Let ζ (p) = (R−1(p)−µ)/σ. The first derivatives of the log-likelihood of the TLSF distri- θ bution are ∂ lnL (θ | P < α ) = −C (ζ (α ))/σ (D.2) P∗ t 1 1 θ 1 ∂µ ∂ lnL (θ | P < α ) = −ζ (α )C (ζ (α ))/σ (D.3) P∗ t 1 θ 1 1 θ 1 ∂σ 41
and for the second derivatives we have ∂2 lnL (θ | P < α ) = (1/σ2)C (ζ (α )) (D.4) ∂µ2 P∗ t 1 2 θ 1 ∂2 lnL (θ | P < α ) = (1/σ2)(C (ζ (α ))+ζ (α )C (ζ (α ))) (D.5) P∗ t 1 1 θ 1 θ 1 2 θ 1 ∂µ∂σ ∂2 (cid:0) (cid:1) lnL (θ | P < α ) = (1/σ2) 2ζ (α )C (ζ (α ))+ζ (α )2C (ζ (α )) . (D.6) ∂σ2 P∗ t 1 θ 1 1 θ 1 θ 1 2 θ 1 ¯ ¯ The case of upper truncation is similar. Define C (x) = lnR(x). First and second 0 derivatives follow as ρ(x) C ¯ (x) = − , C ¯ (x) = −λ (x)C ¯ (x)−C ¯ (x)2. 1 ¯ 2 ρ 1 1 R(x) The first and second derivatives of the log-likelihood of the TLSF distribution for this case ¯ ¯ take the same form as in (D.2)–(D.6) except with C and C replaced by C and C and 1 2 1 2 with α replaced by α . 1 2 For the intermediate continuous case of P = p ∈ [α ,α ), we have t 1 2 ∂ lnL (θ | P = p) = (1/σ)λ (ζ (p)) (D.7) P∗ t ρ θ ∂µ ∂ lnL (θ | P = p) = (1/σ)(ζ (p)λ (ζ (p))−1) (D.8) P∗ t θ ρ θ ∂σ and for the second derivatives we have ∂2 lnL (θ | P = p) = (−1/σ2)λ′(ζ (p)) (D.9) ∂µ2 P∗ t ρ θ ∂2 (cid:0) (cid:1) lnL (θ | P = p) = (−1/σ2) λ (ζ (p))+ζ (p)λ′(ζ (p)) (D.10) ∂µ∂σ P∗ t ρ θ θ ρ θ ∂2 (cid:0) (cid:1) lnL (θ | P = p) = (−1/σ2) 2ζ (p)λ (ζ (p))+ζ (p)2λ′(ζ (p))−1 . (D.11) ∂σ2 P∗ t θ ρ θ θ ρ θ 42
Recall that the expected Fisher information matrix is defined as (cid:18) ∂2 (cid:19) Υ(θ) = −E lnL (θ | P ) , ij P∗ t ∂θ ∂θ i j implyingthatweneedtointegrateacrossthethreecases. Fortheloweranduppertruncation cases, we simply weight the respective expressions by α and 1−α and evaluate at θ = θ . 1 2 0 For the intermediate case, we require integrals of the following forms: (cid:90) α (cid:90) R−1(α) (cid:0) (cid:1) A (α) = R−1(p)kλ R−1(p) dp = xkλ (x)ρ(x)dx, k = 0,1 (D.12) ρ,k ρ ρ 0 −∞ (cid:90) α (cid:90) R−1(α) (cid:0) (cid:1) B (α) = R−1(p)kλ′ R−1(p) dp = xkλ′(x)ρ(x)dx k = 0,1,2. (D.13) ρ,k ρ ρ 0 −∞ Integrals of form A have general solution for any density ρ: ρ,k (cid:0) (cid:1) A (α) = −ρ R−1(α) (D.14) ρ,0 (cid:0) (cid:1) A (α) = α−R−1(α)ρ R−1(α) (D.15) ρ,1 Integrals of form B (α) depend on the chosen family of TLSF. When working with com- ρ,k plementary pairs of skewed distributions, we can show that Ac (α) = (−1)k+1(A (1)−A (1−α)) (D.16) ρ,k ρ,k ρ,k Bc (α) = (−1)k(B (1)−B (1−α)). (D.17) ρ,k ρ,k ρ,k 43
We can now express the elements of the information matrix as: Υ(θ ) = B (α )−B (α )−α C (cid:0) R−1(α ) (cid:1) −(1−α )C ¯ (cid:0) R−1(α ) (cid:1) , (D.18) 0 1,1 ρ,0 2 ρ,0 1 1 2 1 2 2 2 Υ(θ ) = A (α )−A (α )+B (α )−B (α ) 0 1,2 ρ,0 2 ρ,0 1 ρ,1 2 ρ,1 1 (cid:0) (cid:0) (cid:1) (cid:0) (cid:1)(cid:1) −α C R−1(α ) +R−1(α )C R−1(α ) 1 1 1 1 2 1 −(1−α ) (cid:0) C ¯ (cid:0) R−1(α ) (cid:1) +R−1(α )C ¯ (cid:0) R−1(α ) (cid:1)(cid:1) , (D.19) 2 1 2 2 2 2 Υ(θ ) = 2(A (α )−A (α ))+B (α )−B (α )−(α −α ) 0 2,2 ρ,1 2 ρ,1 1 ρ,2 2 ρ,2 1 2 1 (cid:0) (cid:0) (cid:1) (cid:0) (cid:1)(cid:1) −α 2R−1(α )C R−1(α ) +R−1(α )2C R−1(α ) 1 1 1 1 1 2 1 −(1−α ) (cid:0) 2R−1(α )C ¯ (cid:0) R−1(α ) (cid:1) +R−1(α )2C ¯ (cid:0) R−1(α ) (cid:1)(cid:1) . (D.20) 2 2 1 2 2 2 2 References Abramowitz, M. and I. A. Stegun, eds, Handbook of Mathematical Functions, New York: Dover Publications, 1965. Acerbi, C. and B. Székely, “Back-testing Expected Shortfall,” Risk, December 2014, 26 (12). and B. Székely, “Backtestability and the ridge backtest,” Frontiers of Mathematical Finance, 2023, 2 (4), 497–521. Barendse, S., E. Kole, and D. van Dijk, “Backtesting Value-at-Risk and Expected Shortfall in the Presence of Estimation Error,” Journal of Financial Econometrics, 2023, 21 (2), 528–568. Basel Committee on Bank Supervision, “Minimum capital requirements for market risk,” Publication No. 457, Bank for International Settlements January (rev. February) 2019. 44
Bayer, S. and T. Dimitriadis, “Regression-Based Expected Shortfall Backtesting,” Journal of Financial Econometrics, 09 2022, 20 (3), 437–471. Berkowitz, J., “Testingtheaccuracyofdensityforecasts, applicationstoriskmanagement,” Journal of Business & Economic Statistics, 2001, 19 (4), 465–474. , P. Christoffersen, and D. Pelletier, “Evaluating value-at-risk models with desk-level data,” Management Science, 2011, 57 (12), 2213–2227. Du, Z. and J.C. Escanciano, “Backtesting expected shortfall: accounting for tail risk,” Management Science, 2017, 63 (4), 940–958. , P. Pei, X. Wang, and T. Yang, “Powerful backtests for historical simulation expected shortfall models,” Journal of Business and Economic Statistics, 2023, 42 (3), 864–874. Engle, R.F., “Stock volatility and the crash of ’87: Discussion,” The Review of Financial Studies, 1990, 3 (1), 103–106. Fernández, C. and M.F.J. Steel, “On Bayesian modeling of fat tails and skewness,” Journal of the American Statistical Association, 1998, 93 (441), 359–371. Fissler, T., J.F. Ziegel, and T. Gneiting, “Expected shortfall is jointly elicitable with value-at-risk: implications for backtesting,” Risk, January 2016, 28 (1), 58–61. Glosten, L. R., R. Jagannathan, and D. E. Runkle, “On the relation between the expected value and the volatility of the nominal excess return on stocks,” The Journal of Finance, 1993, 48 (5), 1779–1801. Gneiting, T., “MakingandEvaluatingPointForecasts,” Journal of the American Statistical Association, 2011, 106 (494), 746–762. , F. Balabdaoui, and A.E. Raftery, “Probabilistic forecasts, calibration and sharpness,” Journal of the Royal Statistical Society, Series B, 2007, 69 (2), 243–268. 45
González-Santander, Juan Luis, “A Note on Some Reduction Formulas for the Incomplete Beta Function and the Lerch Transcendent,” Mathematics, 2021, 9 (13), 1486. Gordy, Michael B. and Alexander J. McNeil, “Spectral backtests of forecast distributions with application to risk management,” Journal of Banking and Finance, 2020, 116, 105817. Gradshteyn, I.S. and I.M. Ryzhik, Table of Integrals, Series, and Products, seventh ed., New York: Academic Press, 2007. Hewitt, E., “Integration by Parts for Stieltjes Integrals,” The American Mathematical Monthly, 1960, 67 (5), 419–423. Hoga, Y. and M. Demetrescu, “Monitoring value-at-risk and expected shortfall forecasts,” Management Science, 2023, 69 (5), 2954–2971. Hué, S., C. Hurlin, and Y. Lu, “Backtesting expected shortfall: accounting for both duration and severity with bivariate orthogonal polynomials,” 2024. Available at https://dx.doi.org/10.2139/ssrn.4816132. Iercosan, Diana, Alysa Shcherbakova, David McArthur, and Rebecca Alper, “BeyondExceedance-BasedBacktestingofValue-at-RiskModels: MethodsforBacktestingthe Entire Forecasting Distribution Using Probability Integral Transform,” in David Lynch, Iftekhar Hasan, and Akhtar Siddique, eds., Validation of Risk Management Models for Financial Institutions: Theory and Practice, Cambridge University Press, 2023, chapter 4, pp. 57–83. Lok, Hsiao Yen, “Validating market risk models using realized PIT values.” PhD dissertation, Heriot-Watt University, Edinburgh, UK 2017. Lynch, David, Valerio Potì, Akhtar Siddique, and Francesco Campobasso, “Evaluation of Value-at-Risk Models: An Empirical Likelihood Approach,” in David Lynch, 46
Iftekhar Hasan, and Akhtar Siddique, eds., Validation of Risk Management Models for Financial Institutions: Theory and Practice, Cambridge University Press, 2023, chapter 5, pp. 84–103. McNeil, A.J., “Modelling volatility with v-transforms and copulas,” Risks, 2021, 9 (1), 14. Nelson, D. B., “Conditional Heteroskedasticity in Asset Returns: A New Approach,” Econometrica, 1991, 59, 347–370. O’Brien, J. and P.J. Szerszen, “An evaluation of bank measures for market risk before, during and after the financial crisis,” Journal of Banking and Finance, July 2017, 80, 215–234. Patton, A.J., J.F. Ziegel, and R. Chen, “Dynamic semiparametric models for expected shortfall (and Value-at-Risk),” Journal of Econometrics, 2019, 211 (2), 388–413. Refuz, D. and M. Yor, Continuous martingales and Brownian motion, Springer-Verlag, Berlin, 2004. Rosenblatt, M., “Remarks on a multivariate transformation,” Annals of Mathematical Statistics, 1952, 23, 470–472. S˘. Porubský, T. S˘alát, and O. Strauch, “Transformations that preserve uniform distribution,” Acta Arithmetica, 1988, XLIX, 459–479. Wolfram Research, “Mathematical Functions Site,” https://functions.wolfram.com/, as of 2023-07-23 2023. 47
Cite this document
Supplemental materials (PDF) (2024). Spectral backtests unbounded and folded (FEDS 2024-060). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2024-060
@techreport{wtfs_feds_2024_060,
author = {Supplemental materials (PDF)},
title = {Spectral backtests unbounded and folded},
type = {Finance and Economics Discussion Series},
number = {2024-060},
institution = {Board of Governors of the Federal Reserve System},
year = {2024},
url = {https://whenthefedspeaks.com/doc/feds_2024-060},
abstract = {In the spectral backtesting framework of Gordy and McNeil (2020) a probability measure on the unit interval is used to weight the quantiles of greatest interest in the validation of forecast models using probability-integral transform (PIT) data. We extend this framework to allow general Lebesgue-Stieltjes kernel measures with unbounded distribution functions, which brings powerful new tests based on truncated location-scale families into the spectral class. Moreover, by considering uniform distribution preserving transformations of PIT values the test framework is generalized to allow tests that are focused on both tails of the forecast distribution.},
}