feds · December 31, 2004

Density Selection and Combination Under Model Ambiguity: An Application to Stock Returns

Abstract

This paper proposes a method for predicting the probability density of a variable of interest in the presence of model ambiguity. In the first step, each candidate parametric model is estimated minimizing the Kullback-Leibler 'distance' (KLD) from a reference nonparametric density estimate. Given that the KLD represents a measure of uncertainty about the true structure, in the second step, its information content is used to rank and combine the estimated models. The paper shows that the KLD between the nonparametric and the parametric density estimates is asymptotically normally distributed. This result leads to determining the weights in the model combination, using the distribution function of a Normal centered on the average performance of all plausible models. Consequently, the final weight is determined by the ability of a given model to perform better than the average. As such, this combination technique does not require the true structure to belong to the set of competing models and is computationally simple. I apply the proposed method to estimate the density function of daily stock returns under different phases of the business cycle. The results indicate that the double Gamma distribution is superior to the Gaussian distribution in modeling stock returns, and that the combination outperforms each individual candidate model both in- and out-of-sample.

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C. Density Selection and Combination Under Model Ambiguity: An Application to Stock Returns Stefania D’Amico 2005-09 NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

Density Selection and Combination under Model Ambiguity: an Application to Stock Returns Stefania D’Amico ∗ First Draft: May 2003, This Draft: January 2005 Abstract This paper proposes a method for predicting the probability density of a variable of interest in the presence of model ambiguity. In the first step, each candidate parametric model is estimated minimizing the Kullback-Leibler ‘distance’ (KLD) from a reference nonparametric density estimate. Given that the KLD represents a measure of uncertainty about the true structure,inthesecondstep,itsinformationcontentisusedtorankandcombinetheestimated models. The paper shows that the KLD between the nonparametric and the parametric density estimates is asymptotically normally distributed. This result leads to determining the weights in the model combination, using the distribution function of a Normal centered on the average performance of all plausible models. Consequently, the final weight is determined by the ability of a given model to perform better than the average. As such, this combination technique does not require the true structure to belong to the set of competing models and is computationally simple. I apply the proposed method to estimate the density function of daily stock returns under different phases of the business cycle. The results indicate that the double Gamma distribution is superior to the Gaussian distribution in modeling stock returns, and that the combination outperforms each individual candidate model both in- and out-of-sample. Keywords: Density forecast comparison, Kernel density estimation, Entropy, Model Combination. I would like to thank Phoebus Dhrymes for his guidance and many interesting discussions. I also wish to thank ∗ Jean Boivin, Xiaohong Chen, Mitali Das, Rajeev Dehejia, Stefano Eusepi, Mira Farka, Marc Henry, Alexei Onatski, AthanasiosOrphanidesandJonathanWrightforvaluablecomments. Iamsolelyresponsiblefortheremainingerrors and the opinions expressed in this paper do not necessarily reflect those of the Federal Reserve Board orthe Federal Reserve System. Address: Division of Monetary Affairs, Board of Governors of the Federal Reserve System. E-mail: stefania.d’amico@frb.gov 1

1 Introduction “Predictionmayberegardedasaspecialtypeofdecisionmakingunderuncertainty: the acts available to the predictor are the possible predictions, and the possible outcomes are success (for a correct prediction) and failure (for a wrong one). In a more general model, one may also rank predictions on a continuous scale, measuring the proximity of the prediction to the eventuality that actually transpires, allow set-valued predictions, probabilistic predictions, and so forth.”1 This paper proposes a method to quantify the plausibility of alternative probabilistic models and to combine them in a unique weighted predictive distribution, where the weights are function of the uncertainty about the correct model. The following three basic observations motivate this analysis. First, even though econometric models are implemented in order to deal with uncertainty and guide decisions, very often they are developed without any reference to the “uncertainty about the model.” Second, even when model uncertainty is acknowledged and a set of finely parameterized models is considered, a typical implicit assumption is that this set contains the true model. Third, although the approximating nature of a simple model is recognized, the information contained in the approximation error is rarely exploited. In contrast, in this study, I investigate the problem of density prediction allowing for model ambiguity. Instead of specifying a unique statistical structure and treating it as the true model, 1Gilboa I. and D. Schmeidler; “A Theory of Case-Based Decisions,” 2001, pp 59-60. 2

I consider a finite set of competing models not necessarily including the correct model. Thus, since we do not know the true model and we approximate it by choosing among a set of candidate models, at most we can aspire to estimate its best approximation. This implies the presence of an approximation error whose information content can be exploited to combine models. I develop a method of prediction that ranks different probabilistic models according to the sum oftheirsimilaritiestopastobservations. Thesimilarityismeasuredbytheoppositeofthedistance, thatistheKullback-LeiblerInformation(KI),betweenthecandidatemodelandthereferencemodel that is approximated by a nonparametric density. The final weights used to combine models are a function of these distances which embody the uncertainty about the correct structure. This modeling approach will permit one to study and exploit model misspecification which is defined as the discrepancy between the candidate and the actual model and is measured by the KI. Since the KI is given by the sum of the estimation and approximation errors and since the weights are function of the KI, through the models’ weights, we are able to account for both errors and to extract information from a nonparametric estimate. Toimplementthismethodology,thepapershowsthattheKullback-LeiblerInformationbetween the nonparametric fit and the parametric candidate model is asymptotically normally distributed with mean given by the model’s approximation error.2 This result leads to determining the weights in the model combination using the cumulative distribution function of a Normal centered on the average performance of all plausible models. As such, the final weight is determined by the ability 2The literature on nonparametric testing provides me the technical machinery to derive the asymptotic distribution of the KI. See for example Hall(1984, 1987), Robinson(1991), Fan(1994), Zheng (1996, 2000), and Hong and White(2000). 3

of a given model to provide a realization of misspecification that is lower than the average. An important advantage of this method is that it increases the model’s flexibility without compromising its parsimony. Because often, tightly parameterized models give better out-of-sample performance, parsimony is a desirable characteristic. As a result, the set of competing models consists of simple parametric alternatives, even when an infinite-dimensional approximation is available.3 Thisincreasesthelikelihoodthatthetruemodeldoesnotbelongtothesetofcandidatesand that more than one model can perform fairly well, such that it can be hard to distinguish among them. Under these circumstances, the model combination could provide a better hedge against the lackofknowledgeofthecorrectstructureandoutperformbothin-sampleandout-of-sampleeachof thecompetingmodels. Thisisbecausethemodel combination, providinganexplicitrepresentation ofuncertaintyacrossmodels,gathersinformationfrom‘all’plausibleones. Thatis,modelcombination can be viewed as a device to increase the flexibility of the estimation procedure. Furthermore, if the weights in the model combination are not estimated as free parameters but are determined by the ignorance about the true structure, this extra flexibility does not imply the estimation of a higher number of parameters. This translates in a lower risk of overparameterization and in a potentially more robust out-of-sample performance. I apply the proposed method to determine the predictive density of daily stock returns under different phases of the business cycle. This empirical application is motivated both by the difficulty in estimating the probability law of asset returns which usually are modelled with a misspecified 3Forexamplethekerneldensityestimator(Silverman(1986))oracountablemixtureofNormals(Ferguson(1983)) canapproximatearbitrarlycloseanywell-behavingdensityfunction. Wecanviewthesemodelsasinfinite-dimensional parameter alternatives. 4

density function, and by the large availability of data for financial series which facilitates the use of nonparametric techniques. I find that the model combination outperforms in-sample and outof-sample each candidate model including the single best minimizer. The results also indicate that in the small out-of-sample exercise, the model combination performs slightly better than the nonparametric density and than the mixture of models where the weights are estimated as free parameters. Furthermore, in the larger out-of-sample exercise its performance is only marginally worse than the last mentioned models that can be regarded as more complex alternatives. Thiswayofimplementingprobabilisticpredictionisimportanttoimproveeconometricmodeling and to decision making. In fact, my method like others in the literature, can be considered as a preliminary step to account explicitly for model ambiguity in econometrics. One of the first studies that uses information criteria to identify the most adequate regression model among a set of alternativesis due to Sawa(1978). Asubsequentwork by Sin andWhite(1996)usesinformation criteria for selecting misspecified parametric models. Nevertheless, none of these studies makes use of a preliminary nonparametric estimation to distinguish among alternative models. Furthermore and more importantly, none of these papers focuses on model combination. In the context of model combination, there are two main strands of literature related to this work. The first includes Bayesian Model Averaging (BMA) and its application to stock returns predictability and to the investmentopportunity set, see for example Avramov (2002) and Cremers (2002). Unlike the Bayesian approach, in this study it is not necessary to assume that the true structurebelongstothesetofcandidatemodels. Further, thisselectionandcombinationprocedure 5

is based on the idea that although the available database is not sufficient to choose a unique welldefinedmodel,itstillprovidesrelevantknowledgethatcanbeusedtodifferentiateamongcompeting models. Forthisreasonapilotnonparametricdensity,summarizingallinformationcontainedinthe data, is used to guide the estimation. Finally, this methodology, being based only on an objective measure of the proximity between multiple candidate models and actual data, aims to overcome the necessity to have a specific prior over the set of models and about parameters belonging to each of the models under consideration. It refers only to the analogy between past samples (actually encountered cases) and models at hand. This requires a limited amount of hypothetical reasoning since it relies directly on data that are available to any observer without ambiguity. The cognitive plausibility of my methodology is founded on case-based decision theory (CBDT). In particular the behavioral axioms of Inductive Inference developed by Gilboa and Schmeilder (2001) provide support for my prediction method4. Thesecondvein,thoughcharacterizedbyacompletelydifferentapproach,representsthestudies aboutforecastevaluationandcombination: DieboldandLopez(1996),HendryandClements(2001) and Giacomini (2003) among others. Finally, there is a third strand partially related to this work. It consists of the vast literature on dynamic portfolio choice under model misspecification where investorstrytolearnfromhistoricaldata,seeforexampleUppalandWang(2002)andKnox(2003). The paper is organized as follows: Section II illustrates the model combination technique; Section III analyzes the asymptotic distribution of the uncertainty measure; Section IV contains 4As shown in Gilboa-Schmeidler (2001) this is also the same principle at the base of Maximum Likelihood Estimation. 6

theempiricalapplicationtostockreturns; andSectionVconcludes. Analyticalproofsandtechnical issues are discussed in the Appendix. 2 Description of the selection and combination method 2.1 Model selection I consider a prediction problem for which a finite set of parametric candidate models is given: f (x,θ),j =1,...,J . The goal of the predictor is to rank these models and to combine M≡{ j }θ Θ ∈ them in a similarity-weighted probability distribution. Given the set , we define the set of M elements that have to be ranked as Θ= θ : f (x,θ) , and Θ d. fj j ∈ M ⊂R © ª The information set Ω is a finite set of Q samples of N independent realizations of the random q variable X. Given the set Ω, its information content is processed estimating a nonparametric density f (x) for each sample q = 1,...Q. Subsequently, from the set Ω, I derive the set of past n cases =c f (x) : x Ω , which is the final information that the predictor posses to judge the nq C ∈ n o different mocdels. The problem is then to describe how to process and recall this information to assess the similarity of past observations to the set of candidate models. Lets define the weight a map w : Θ , it assigns a numerical value w to each pair of qj ×C →R past case f (x) and parameter θ , representing the support that this case lends to the model nq fj f (x,θ) in c. j M The sum of weights w represents the tool through which the predictor judges the similarity qj of a particular model to the estimated distributions which his knowledge is equipped with. More precisely, these weights represent the degree of support that past distributions lend to the specific 7

model at hand. However, they also embody the misspecification contained in each model, that being just an approximation of the reality still preserves a distance from the actual data. It seems reasonable that the model with the lowest distance from the nonparametric densities, is also the modelwiththehighestsimilaritytopastobservations. Assuch,ithastobethemodelcharacterized by the highest sum of weights. For these reasons, it seems natural to determine w by the opposite of the distance between qj the nonparametric density f (x) and the model f (x,θ) : nq j c w = KI f (x),f (x,θ) , (1) qj nq j − ³ ´ c where KI f (x),f (x,θ) is the Kullback-Leibler distance, whose empirical version in this study nq j ³ ´ is defined ascfollows: Nq f (x ) nq i KI = f (x )log , (2) qj nq i f (x ,θ) Ã j i ! i=1 X c c c where i is the index for all observations contained in a sample q. Ifthevaluesoftheoptimalparameterswereknown,thepredictionrule-rankingtheplausibility of each model through the sum of their weights (over the past cases) - will lead us to choose as predictive density f rather than f if and only if: 1 2 w > w , (3) q1 q2 q C q C X∈ X∈ or equivalently: KI f (x),f (x,θ) < KI f (x),f (x,θ) . (4) nq 1 nq 2 q X∈ C ³ ´ q X∈ C ³ ´ c c 8

The sum of the weights relative to model f can be interpreted as in Gilboa and Schmeilder 1 (2001)asthe“aggregatesimilarityorplausibility”ofmodelf .However,asthevaluesoftheoptimal 1 parameters are unknown, it is necessary to estimate them as described in D’Amico (2003a), that is: max w =min KI f (x),f (x,θ) . (5) qj nq j θ θ fj q X∈ C fj q X∈ C ³ ´ c It follows then that the rank of the competing models is obtained as follows: f f IFF min KI f (x),f (x,θ) < min KI f (x),f (x,θ) , (6) 1 2 nq 1 nq 2 Â θ Θ θ Θ f1∈ q X∈ C ³ ´ f2∈ q X∈ C ³ ´ c c which in turn implies that the best model can be represented by the following prediction rule: inf min KI f (x),f (x,θ) . (7) nq j { j:1,...,J }   θ fj∈ Θ q X∈ C ³ ´   c   2.2 Model Combination Selecting a single model as described in the previous section, even if implicitly recognizes the presence of misspecification, does not account explicitly for model ambiguity. More importantly, it does not consider that the true structure may not belong to the initial set of candidate models, as such to use only the best minimizer is not necessarily the ultimate solution. This implies that in order to incorporate the information contained in the KI, the combination of all plausible models in a similarity-weighted predictive distribution is needed, where the weights are function of KI f (x),f (x,θ) . n j ³ ´ c Tcheintuitionbisthefollowing: KI ,canbeinterpretedasameasureofuncertaintyorignorance j about the true structure. When computed at the optimal value of the parameter θ , it can be fj 9 b

considered as a measure of the goodness of the model, since it represents the margin of error of this model in a particular sample. If it is different from zero for each candidate distribution and/or there are many models that exhibit a similar loss, then the econometrician fearing misspecification will explicitly account for it by combining the models in the predictive distribution M(θ ) = fj p (KI)f (x,θ). The similarity-weight p (KI) can be loosely interpreted as the probabbility of j j j j P model fcbeingbcorrect. In contrast, if the pcredictor selected a single distribution f , he would j j overestimate the precision of this model, since he would implicitly assign to the model probability (p (KI)) of being correct equal one. j Icn order to better appreciate the importance of the information contained in the model’s misspecification and subsequently in M(θ ), it is necessary to give a brief description of the spaces fj in which we operate, when the statistbical structural assumptions are not necessarily true. Define G to be the space of functions to which the true unknown model g(x) belongs: by assumption g(x) minimizes the KI over G. F G represents the finite dimensional space to which the Θ fj ⊆ parametric candidate models belong, we can call it the approximation space and it is also the space where the estimation is carried out. The best approximation f (x,θ ) in F to the function g(x) j ∗ Θ fj is the p.d.f. that minimizes the KI over F , while f (x,θ) F minimizes the sample version Θ j Θ fj ∈ fj of the KI. The distance between f (x,θ) and f (x,θ ) reprebsents the estimation error that vanishes j j ∗ as n . Instead, the approximatibon error5 given by the distance between f (x,θ ) and g(x), j ∗ → ∞ can be reduced only if the dimension of F grows with the sample size. Model combination can Θ fj therefore be considered as a method to increase the dimension of the parameter space accounting 5See Chen X. and J.Z. Huang (2002). 10

for the approximation error. Only if F G, then g(x) = f (x,θ ) = f (x,θ ) and θ is a consistent estimator of the Θfj ≡ j 0 j ∗ true parameter θ . Typically, because of the advantages6 offeredbby parsimonious models, F is 0 Θfj a small subset of G and hence model misspecification can be a serious problem also affecting the asymptotic results. Furthermore, in finite sample the KI embodies information about both the j estimation and approximation errors relative to f , andcas such it can not be ignored. j Once it is decided to use the combinations of p.d.f. M(θ ) as predictive density, the main task fj consists in determining the probability p (KI). For this pubrpose, I show that (see the next section j and the Appendix for more details) KI mcinus a correction term (m = dist(f (θ ),g)), mainly j n ∼ j ∗ due to the approximation error, is asycmptotically distributed Normal N(0,σ2), where a consistent estimate of σ2 is determined only by the nonparametric density. Then, the probability of being the correct model can be determined by the probability of obtaining a misspecification KI worse than j the one actually obtained (ki). That is: c p (KI) =1 P(KI ki). (8) j j − ≤ c c Since it is well known that KI(g,f (θ)) 0, where the equality attains if and only if g = f , j j ≥ then p (KI) = 1 if and only if ki = 0. This follows trivially from the fact that P(KI 0) = 0. j j ≤ Consequecntly, p (KI)willbelessthanoneforanypositiverealizationofKI .Accordcingly, iftheki j j is very small, thencthe probability (P(KI ki)) of obtaining a realizaticon of the misspecification j ≤ even smaller than a such low value wilcl be very little; it then follows that the probability p (KI) j 6Closed form solution, ease of interpretation and low computational costs. c 11

of having a good model will be very high. Itisclearthattodeterminetheweightitisjustsufficienttocomputethecumulativedistribution function of a Normal with mean m and variance σ2 for the realized value ki. Nevertheless, in the n implementation of this methodology, it is necessary to pay attention to the mean m that, being n affected by the approximation error, varies with the candidate model. In the next section and in the appendix, the device to fix this problem and the measurement of m are described in more n details. 3 Asymptotic results 3.1 Assumptions Before proceeding with the theorems let me state first all the assumptions: A1: X are i.i.d with compact support S, their marginal density g exists, is bounded away i { } from zero, and is twice differentiable. Its first order derivative is also bounded and moreover g (x ) g (x ) C x x for any x ,x S and for some C (0, ). 00 1 00 2 1 2 1 2 | − | ≤ | − | ∈ ∈ ∞ A2: The kernel K is a bounded symmetric probability density function around zero, s.t :(i) K(u)du =1; (ii) u2K(u)du< ; (iii) h =h 0 as n ; (iv) nh as n . n n ∞ → →∞ →∞ →∞ R R A3: Given the set , it is possible to select a kernel K that satisfies A2 and such that the M tail-effect terms involved in the use of the KI are negligible. A4: Θ is a compact and convex subset of Rd, the family of distributions F(θ) has density f(θ,x)whicharemeasurableinxforeveryθ Θandcontinuousinθ foreveryx Ω; E [logg(x) g ∈ ∈ − logf(θ,x)] exists and has a unique minimum at an interior point θ of Θ; logf(θ,x) is bounded by ∗ 12

a function b(x) for all θ Θ, where b(x) is integrable w.r.t. the true distribution G. ∈ ∂logf(θ,x) ∂logf(θ,x) A5: The first and second derivative of logf(θ,x) w.r.t. θ and are ∂θ × ∂θ ¯ ¯ also dominated by b(x); B(θ ∗ ) = E ∂log ∂ f θ (θ ∗ ,x) × ∂log ∂ f θ (θ ∗ ,x) g2(x) is ¯ ¯non singular and A(θ ¯ ¯∗ ) = h³ ´ i E ∂2logf(θ ∗ ,x) g(x) has a constant rank in some open neighborhood of θ . ∂θ∂θ ∗ h i Assumption A1 requires that X are continuously distributed and imposes regularity conditions i on the unknown density g. A2 represents the standard assumptions on the kernel function and the smoothing parameter used in the nonparametric literature. Assumption A3 is a practical assumption that we need in order to simplify the proofs and ignore the tail-effects due to the use of the Kullback-Leibler distance. As indicated by Hall(1987) it is important that K is chosen such that its tails are sufficiently thick with respect to the tails of the underlying function f (θ,x). j Since we know the candidate parametric models it is always possible to choose an adequate Kernel. Furthermore, Hall suggested a practical alternative which is given by the Kernel K(u) = 0.1438 ∗ exp[ 1 log(1+ u) 2] whose tails decrease more slowly than the tails of the Gaussian Kernel and −2{ | | } that allows in most cases to neglect the tails-effect terms. Finally, the last two assumptions A4 and A5 are standard to ensure the consistency and asymptotic normality of QMLE (White (1982)). 3.2 Asymptotic distribution of KI: heuristic approach In order to obtain the weights in the models combination, as indicated by the formula (8), we need to derive the asymptotic distribution of KI , the random variable that measures the ignorance j about the true structure. c The purpose of this section is to provide a sketch of the proof (developed in the Appendix), in 13

order to give the main intuition and to convey two main pieces of information. First, the effect of estimating the true model g by f (θ,x) on the limiting distribution of KI . Second, how and which j j of the different components of the KbI affect themean and varianceofctheasymptotic distribution. j TosimplifythenotationIdropctheindexj andIrewritef (θ,x)=f , f (x)=f andg(x)=g, j θ n n then KI is given by the following formula: b b c c c KI =KI(f ,f )= (lnf lnf )f dx = n θ n − θ n x Z c c b c b c = (lnf lng)dF (lnf lng)dF =KI KI , (9) n − n − θ − n 1 − 2 x x Z Z c b b b c c where the definition of KI and KI is clear from the previous expression. 1 2 1) KI can be approcximated icn the following way7: 1 c 2 f g 1 f g 1 n n KI − dF − dF =KI KI , (10) 1 n n 11 12 ' g − 2 g − 2 xà ! xà ! Z Z c c c b b c c where KI is a stochastic element that will affect the asymptotic distribution of KI, while KI 11 12 is roughcly8 the sum of squared bias and variance of f . It is O((nh) 1+h4) and itcwill contricbute n − to the asymptotic mean of KI. c 2) KI hasa differentnacture: itrepresentsthe partof theKIthatisaffected bytheparameters 2 estimatcion. KI can be rewritten in the following way: 2 c KI = (lnf lnf )dF + (lnf lng(x))dF =KI +KI , (11) 2 θ − θ ∗ n θ ∗ − n 21 22 x x Z Z c b b b c c where f =f (x/s,θ ). θ j ∗ ∗ 7Thiscanbeeasilyseenbyrewriting f g n inthefollowingway: fn−g g+g =1+fn g− g =1+γ,thenln(1+γ) ' γ − 1 2 γ2. 8In order to see this, it is just sufficiebnt to rewrite KI 12 as bfn− Efn g +Efn− bg 2 dF n . R ³c c c ´ c b 14

Althoughinthiscase,thefirsttermKI isstochastic,itwillnotaffecttheasymptoticdistribu- 21 tion of KI. In fact, sinceitis O 1 whecn rescaled by theappropriate convergence rate d =nh1/2 p n n ¡ ¢ it convecrges to zero: d KI p 0. (12) n 21 −→ c The second term KI has the following behavior: 22 c KI p E [lnf lng(x)] =( KI(g,f )) 0, (13) 22 g θ θ −→ ∗ − − ∗ ≤ c assuchitspresenceisduetotheapproximationerror. ItisimportanttonotethatKI varieswith 22 the underlying candidate model and it can not be observed. This implies that a tecrm of the KI’s asymptotic mean will depend on the specific model M , then in order to determine and estimacte a j limiting distribution that is the same for all candidate models the following assumption is needed: A6: KI αh1/2KI . (14) 22 12 ∼ A6 requires that the mean of the approximation error is proportional to a quantity (KI ) whose 12 estimationdependsonlyonf ,consequentlyitwillnotbeinfluencedbyanyspecificmodelf (θ,x). n j Further, when h n β withcβ > 1, KI C(nh) 1, then we obtain that: b ∝ − 5 12 ∼ − c d KI p αC, (15) n 22 −→ c where C is a known positive constant. This assumption can be interpreted as a local misspecification, where the resulting local convergence rate is chosen such that it cancel out with the rate at which the misspecification would converge to infinity. 15

Thus collecting all terms together: 1 KI KI KI KI +KI , (16) 11 12 21 22 ' − 2 − ³ ´ c c c c c we have the next theorem: THEOREM 1: Given assumptions A1-A6, and given that nh5 0 as n , then −→ −→∞ 1 nh1/2 KI + KI +KI d N(0,σ2) 12 22 2 −→ µ ¶ c c c where σ2 =2 K2(u)du K(u)K(u+v)du 2 dv − n o R R £R ¤ Proof: See the Appendix. To better understand the implication of A6 for the determination of the combination weights p (KI), it is helpful to rewrite the previous result as follows: j c 1 nh1/2 KI + KI A N(m,σ2) (17) 12 2 ∼ µ ¶ c c where m = αC KI(g,f ), from (13) and (15). This implies that to estimate the mean of the θ ' ∗ distribution it is necessary to pin down the α, whose estimation is based on the ‘plausibility’ of the candidate models. Assumption A6 elicits the following definition of plausible model: Def : f (θ,x) is plausible, thus will be included in the set , if the expected value of its j M approximation error is equal to αC . In other words, according to A6, all the competing models are on average expected to have the same distance from the true model g. Subsequently, as suggested by the definition of m, α could be estimated by a suitably normalized average of all models’ misspecification: 16

1 α= KI /C KI(g,f )/C, (18) J j ' θ∗ j X b c where E(KI ) can be considered an approximation of the average specification error KI(g,f ) j θ∗ ¡ ¢ that can ncot be observed. Therefore, to obtain p (KI) we have to employ the c.d.f. of a Normal with mean E(KI ) and j j variance σ2. This entails thcat, if a model performs better than the average performancce of all plausible models, that is 0 < ki < m , then it receives a large weight in the models combination. j n On the other hand, if the model pebrforms poorly relative to all other models, that is ki j > m n , then its probability of being correct (p j (KI)) will be low. b c 4 Application to stock returns Acommonassumptiontomanymodelsinfinance, suchasthecapitalassetpricingmodel(CAPM), the arbitrage pricing theory (APT) and the Black and Scholes option pricing theory, is that of normally distributed returns. The problem is that very often this assumption is not supported by empirical evidence. Financial asset returns posses distributions characterized by a sharp peak around zero, by tails heavier than those of the normal distribution and by a certain degree of asymmetry. As early as 1963, Mandelbrot (1963) strongly rejected normality as a distributional model for asset returns and a subsequent work by Fama (1965) further corroborated such evidence. These studies give rise to a new probabilistic foundation for financial assets that was based on the Stable Paretian Distribution, which generalizes the Gaussian distribution and allows for heavy tails and 17

skewness. However, this kind of distributions had little success in practice, since they are characterized by infinite variance which is inappropriate for real data and further very often there is not a closed form expression for the density. Given the importance of the subject, more recently many economists and statisticians have focused their attention on tests and models to describe the distribution of asset returns9. First, as reported by Campbell-Lo-Mackinlay (1997)10, the skewness for daily US stock returns tend to be negative for stock indexes and positive for individual stocks. Second, the excess Kurtosis for daily US stock returns is large and positive for both index and individual stocks. Both characteristics are further documented in Ullah-Pagan11 (1999) using non-parametric estimation of monthly stock returns’ density from 1834 to 1925. In their analysis it is clearly shown that the density departs significantly from a normal, because of its asymmetry, the fat tails and the sharp peak around zero. Third, Diebold-Gunther and Tay (1998) in their application to density forecasting of daily S&P 500 returns indicate that the Normal forecasts are severely deficient. Finally, Knight-Satchell and Tran (1995) show that scale Gamma distributions are a very good model for UK FT100 index. 4.1 A Set of simple models I now apply the described prediction method to determine stock returns predictive density, that subsequently can be used to determine the optimal share to invest in the risky asset. Given the previous facts, let me assume that the set of candidate models for the risky asset’s returns consists 9Seeforexample,theHandbookofHeavyTailedDistributionsinFinance(2003),foracompleteanalysisofstudies about modeling the distribution of several financial assets. 10The Econometrics of Financial Markets, 1997, pag. 16 and 17. 11Nonparametric Econometrics, 1999, pag 71-74. 18

of three distributions: a Normal (N(µ,σ2)), a Fisher-Tippet12 (F(α,β)) and a mixture of general Gamma (G(ς,λ)). The first model, derives from the ‘convenient’ version of random walk hypothesis. Typically, due to the hypothesis of asset market efficiency, stock prices are assumed to follow a random walk, that is: p =µ+p +(cid:18) , (cid:18) IID, where p =log(P ). t t 1 t t t t − Further, sincethe mostwidespreadassumptionfortheinnovations(cid:18) isnormality, stockreturns t are normally distributed with mean µ and variance σ2. The second model is suggested by the empirical evidence reported in the previous paragraph which advocates the use of extreme value distributionwithmoreprobabilitymassinthetailareas,andthethirdmodelisadirectconsequence of the study by Knight-Satchell and Tran (1995). Let X be the log of asset return for day t, it will be modelled using the following densities: t 1 (X µ)2 t 1) f(X ;µ,σ) exp − , t ≡ σ√2π − 2σ2 1 X α X α t t 2) f(X ;α,β) exp( − )exp( exp( − )). t ≡ β β − β The third model requires some more details since Gamma distribution is defined only for 0 ≤ X , as such the distribution for X will be a mixture of two Gammas. Following the authors, t t ≤∞ let us define the variable: 1 with probability p Z = t 0 with probability 1-p 12It is also known as double exponential distribution and a particular case of it is the Gumbel distribution. 19

where p is the proportion of returns that are less than a specified benchmark γ. It then follows that X is defined t X =γ+X (1 Z ) X Z t 1t t 2t t − − where X are independent random variables with density f (), j = 1,2. Hence if Z = 1, X γ jt j t t · ≤ and we sample from the X distribution; if Z =0,X >γ and we sample from the X distribution. 2 t t 1 f () and f () are defined as follow: 1 2 · · λς 3) f (X ;ς,λ) (X γ)ς 1exp( λ(X γ)) 1 1t 1t − 1t ≡ Γ(ς) − − − λς f (X ;ς,λ) (γ X )ς 1exp( λ(γ X )) 2 2t 2t − 2t ≡ Γ(ς) − − − 4.2 The Data To implement the empirical application I use daily closing price observations on the US S&P500 index over the period from December 1, 1969 to October 31, 2001, for a total of 7242 observations. The source of the data is DRI. Stock return X is computed as log(1+R ) where R = Pt Pt 1. t t t P − t 1 − − Descriptive statistics for the entire sample are provided in the following table. S&P500 index Min. value -0.08642 Max. value 0.087089 Mean 0.000319 Std. deviation 0.01005 Kurtosis 4.9333 Skewness -0.10974 Table I Furthermore,AngandBekaert(2001,2002)andGuidolinandTimmermann(2002)havestressed the importance of distinguishing between ‘bear’ and ‘bull’ regimes in modeling stock returns and 20

indicate that these persistent regimes have important economic implications for investors’ portfolio decisions. Based on these observations, I have chosen to divide the data in two groups. The first contains all samples relative to contraction (C) and the second includes all samples relative to expansion (E). These two phases of the business cycle typically coincide with ‘bear’ and ‘bull’ regimes of the stock market. This implies that the optimal model for asset returns is conditional on the specific regime, which for simplicity I assume to be known at the time of the empirical analysis13. Under theassumptionthatineachregimeallsubsamples aredrawnfroma fixeddistribution, it is possible to create for each state a unique sample that includes all contractions and all expansions respectively. Merging together all the recessions I obtain a sample of 1321 observations, while combining all expansions I obtain a sample of 5921 observations. The descriptive statistics for these two subsamples are reported in the following tables. Expansion S&P500 index Contraction S&P500 index Min. value -0.08642 Min. value -0.05047 Max. value 0.087089 Max. value 0.05574 Mean 0.00044 Mean -0.00039 Std. deviation 0.009165 Std. deviation 0.0132 Kurtosis 7.1555 Kurtosis 1.05685 Skewness -0.30326 Skewness 0.26712 Table II ItisevidentfromTableIandII,thatthesedataarenotconsistentwiththecommonassumption thatthetruemodelforX istheGaussiandistribution. Thesevaluesconfirmpreviousstudieswhere t dailystockreturnshavebeenfoundtoexhibithighexcessKurtosisandnegativeSkewnessforindex 13The contractions and expansions are those provided by NBER’s Business Cycle Dating Committee for the US Economy, available at the website www.nber.org/cycles. 21

returns. Further, it is very striking how these values differ across regimes. First, as found in other studies, contractions and in general bear regimes are characterized by high volatility and negative mean for stock return, which turns out to be a problem in determining the optimal share to invest in the risky asset. Second, while during expansions stock returns show a positive excess kurtosis (evenbiggerthanthatdisplayedinTableIforalldata)andanegativeSkewness(threetimesbigger than that for the entire sample), during contractions the excess Kurtosis is negative (lower than three)andtheSkewnessispositive. Accordingtothesesimpledescriptivestatistics, itisreasonable to expect different optimal models for stock returns across these two regimes. 4.3 Empirical Results. For each of these samples I estimate the univariate density of stock returns by Nadaraya-Watson kernel density estimators. For the Kernel function I employ the second-order Gaussian Kernel and the bandwidths are selected via least-squares cross-validation (Silverman, 1986, p48). I then use the Kullback-Leibler entropy to measure the distance between the estimated nonparametric density and each of the models belonging to the set . Minimizing this distance I M obtain the parameter estimates for each candidate distribution and a value for KI , which allows j me to achieve a ranking of all competing models and the subsequent weight for eacch of them in the final model combination. The estimated parameters for each distribution are reported below. N(µ,σ2) Entire sample Expansion Contraction µ 0.0004* 0.0005* -0.0008* σ 0.0082* 0.0075* 0.0123* Kb I 0.1897 0.1587 0.0513 b *All estimates are significant at 1% level 22

F(α,β) Entire sample Expansion Contraction α -0.00179* -0.0014* -0.00403* β 0.008509* 0.00773* 0.01213* Kb I 0.9836 0.9209 0.3362 b *All estimates are significant at 1% level G(ς,λ) Entire sample Expansion Contraction ς 1.1104* 1.1212* 1.1237* λ 146.3839* 160.6803* 97.4237* bγ 0.00031 0.00044 -0.00039 b p 0.47878 0.465631 0.53637 1b p 0.52122 0.5343 0.46363 − Kb I 0.0468 0.0666 0.0776 b *All estimates are significant at 1% level Table III Examining the tables we see that all the estimates are intuitively reasonable and significantly different from zero. Comparing all the three models over the entire sample, we can notice that the model characterized by the double Gamma outperforms the other two models. Its KI assumes the lowest value (0.0468) which is four times smaller than that for the Normal and twencty time smaller than that of Fisher-Tippet. Also in the case of expansion, the double Gamma is clearly better than the other two models; its KI equals 0.0666 which is half the value for the Normal. In contrast, for the sample including all ccontractions the Gaussian distribution performs slightly better than the doubleGamma. ThevalueofitsKI isequalto0.0513whichissmallerthantherespectivevaluefor the double Gamma (0.0776). Finaclly, both values are ten times smaller than the KI for the Fisher- Tippet distribution. These results contradict the common assumption that the bcest unique model for the stock returns is the Gaussian distribution, and confirm that the optimal model changes across regimes. Further, since more than one model performs fairly well, and because each of them 23

has properties that capture particular characteristics of return distribution, it seems reasonable to combine them. It is important to stress some characteristics of the double Gamma, since it is overall the model that provides the best performance in terms of aggregate similarity to the data. First of all, it is worth mentioning that in all three samples the values of p suggest that the sample proportions for negative returns are not very different from that of posibtive returns. Second, ς’s estimates in all three samples are greater than unity, which entails that returns are well described by a bimodal density. All these features of the estimated model confirm the results that Knight-Satchell and Tran (1995) found in the case of UK stock returns. The final step to compute the similarity-weighted predictive distribution M(θ ) consists in fj evaluating for each of the models under consideration the ‘probability’ p (KI) of bbeing correct. It j can be helpful to first provide the realizations of KI for all models in eachcof the sample. j All data Expandsion Contraction G 0.0468 0.0666 0.0776 N 0.1897 0.1587 0.0513 F 0.9836 0.9209 0.3362 Table IV: Realized loss for each model The following table exhibits the value of p(KI ) for the three models under consideration. j All data Expcansion Contraction G 0.8121 0.7811 0.5689 N 0.7033 0.7086 0.604 F 0.0779 0.0924 0.331 Table V:Optimal weight for each model Asitcanbenoticedthesevaluesrepresent‘probabilities’beforenormalizationsincetheydonot 24

sumuptounity. ResultscontainedintableVseemtoconfirmthatthismethodologyindetermining the “probability of being the correct model” works in the right direction. In fact, in each of the samples the p.d.f. with the lowest realization of the KI receives the highest p (KI), and hence it j will receive the largest weight in the model combination. Further, the very poor pcerformance of the Fisher-Tippet distribution with respect to the other two candidate models, suggests that it would be sensible to discard this model in order to conform the application to assumption A6. Thus, in the next section I present the results obtained combining only the Normal and the double Gamma distributions. 4.4 In and Out-of-sample performance of model combination Lets first consider the in-sample performance of model combination. The results are summarized in the following table, where the values of KI for each single model are reported. All data Expansion Contraction wkiG+wkiN 0.0256 0.0179 0.0137 g n G 0.0468 0.0666 0.0776 N 0.1897 0.1587 0.0513 Table VI: In-sample Results Note: wki indicates the weight for model j obtained as function of KI j Using the entire dataset from December 1, 1969 to October 31, 2001- after normalizing the p(KI )-thedoubleGammaG(1.1104,146.38)receivesaweightof0.5359andtheNormalN(0.0004,(0.0082)2) j recdeives a weight of 0.4641. The Kullback-Leibler distance between the nonparametric density estimate and the model combination equals 0.0256, attaining a loss almost half of the best minimizer. If I consider the sample including all expansions, to the Gamma G(1.1212,160.68) it is 25

assigned a weight equal to 0.5243 and to the Normal N(0.0005,(0.0075)2) a weight of 0.4757. This model combination delivers a distance from the nonparametric density equal to 0.0179 which is a third of that achieved by the best model. Finally, considering only contraction data, the Gamma G(1.1237,97.42) receives a weight of 0.4937, while the Normal N( 0.0008,(0.0123)2) attains a − weight equal to 0.5063. In this case as well, the model combination outperforms the best model by achieving a KI equal to 0.0137, which is one fourth of the distance achieved by the best model. Now, to verify the performance of the nonparametric KI and of the model combination outof-sample, the previous results are analyzed in the context of a different dataset, using the series of stock returns observed from November 1, 2001 to September 30, 2003, for a total number of observations of 479. This sample represents the most recent case of expansion, or more precisely recovery, according to thelatest determination of the BusinessCycleCommitteeof theNBER. The summary statistics are displayed below. S&P500 index Min. value -0.01842 Max. value 0.024204 Mean -0.0000556 Std. deviation 0.00619 Kurtosis 0.932 Skewness 0.2804 Table VII Using this data, but the parameter estimates and the weights obtained from the expansion samplefortheperiodDecember1, 1969toOctober31, 2001, IevaluatetheKIdistancebetweenthe nonparametric density estimated in the new sample (f ) and the parametric models estimated nOUT in the previous sample. I obtain the following resulbts: the KI between the model combination 26

(0.5243G + 0.4757N ) and f is equal to 0.7639, between the Gamma distribution and IN IN nOUT f is equal to 0.7749 and bbetween the Normal and f is 0.9235. That is, the model nOUT nOUT cbombination slightly outperforms both models, including thebGamma that in the case of expansion was the best minimizer. Models Expansion 2001-03 wkiG+wkiN 0.7639 g n G 0.7749 N 0.9235 w G+w N 0.8194 g n f 0.7927 nIN Table VIII: Out-of-sample Results b Note: wki and w indicate the weight for model j obtained j j as function of KI and as free parameter respectively. Another important comparison to carry out is the following. If the mixture of the Normal and Gamma distributions is estimated in-sample, where the weights are estimated as free parameters, how does this mixture perform with respect to the model combination, where the weights are a function of model misspecification? The mixture that minimizes the distance from the nonparametric density estimated in-sample relative to expansion is equal to 0.4863N(0.0006,0.00672) + 0.5137G(1.0518,127.996) and it delivers a KI equal to 0.0037, which is the smallest value obtained so far. However, the out-of-sample fit of this mixture is worse than the fit obtained by model combination, since its distance from the nonparametric density estimated out-of-sample equals 0.8194. Hence, while increasing the number of parameters leads to better in-sample fit, it gives less good 27

out-of-sample results. On the contrary, when the weights are not unrestricted parameters, but are function of model misspecification, the out-of-sample fit seems to be more robust. This result regarding the not excellent out-of-sample performance of models that involve the estimation of a large number of unrestricted parameters is not uncommon (see for example Stock and Watson (1999), J.H. Wright (2003) and Cogley, Morozov and Sargent (2003)), even though there is not a definitive explanation for it. To stress further this last point, I also control the out-of-sample performance of the nonparametric density estimated in-sample. The reason for this check should be clear if we think about the nonparametric density as an infinite-dimensional parametric alternative. As such, in-sample it represents the benchmark model, but what about its charcterization of the data out-of-sample? The answer is in line with the observation that highly parametrized models do not necessarily perform well out-of-sample. In fact, as shown in Table VIII the KI between the nonparametric fit obtained in-sample and the nonparametric fit out-of-sample is equal to 0.7927, which is somewhat worse than the model combination. Are all these results further corroborated using a larger out-of sample dataset (i.e. 2506 observation rather than 479)? To verify the stability of the results I have redone the estimation using as in-sample data the stock return during all the expansions included from February 1961 to June 1990,andasout-of-sampledatathestockreturnsfromMarch1991toMarch2001,whichrepresents the longest expansion period available. 28

Models Expansion 1961-90 wkiG+wkiN 0.0983 g n G 0.1995 N 0.5249 w G+w N 0.0359 g n Table IX: In-Sample results In this case, the new double Gamma G(1.1139,436.15) achieves a KI equal to 0.1995 receiving a weight of 0.6555, while the Normal N(0.0002,(0.0028)2) obtains a KI that equals 0.5249, receiving a weight of 0.3445. The Kullback-Leibler distance between the nonparametric density estimate and the model combination equals 0.0983, attaining once more a loss half of the size of the best minimizer. On the other hand, the best mixture in-sample is given by 0.4355N(0.003,0.00242)+ 0.5645G(1.05,349.68)and itdeliversa KIequal to0.0359, that is onethird of thedistanceachieved by model combination. Models Expansion 1991-01 wkiG+wkiN 0.7786 g n G 0.8838 N 0.9714 w G+w N 0.7562 g n f 0.7637 nIN Table X: Out-Sample results b The out-of-sample results, on the other hand, confirm only partially the previous findings. It still holds true that the model combination outperforms the best in-sample minimizer: its KI is equalto0.7786whiletheGamma’sKIequals0.8838. However, themixturedeliversadistancefrom the nonparametric fit that equals 0.7562 that, in contrast to the previous out-of-sampel results, is marginallybetterthantheKIachievedbythemodelcombination. Further,eventhenonparametric 29

fit attains a KI smaller than that of model combination: 0.7637 versus 0.7786. Theseresultsarenotsurprisingifwethinkaboutthelargeamountofobservationsinthisout-ofsample exercise. Nevertheless, it is striking that a parsimonious model like the model combination does not perform much worse than these richer models. Based on both out-of-sample exercises, it is possible to conclude that the use of model combination, where the weight are function of the uncertainty about the true model, can provide a useful forecast tool. 5 Conclusions This paper proposes a method to estimate the probability density of a random variable of interest in the presence of model ambiguity. The first step consists in estimating and ranking the candidate parametricmodelsminimizing the Kullback-Leibler information between the nonparametric fitand the parametric fit. In the second step, the information content of the KI is used to determine the weights in the model combination, even when the true structure does not necessarily belong to the set of candidate models. This approach has the following features. First, it provides an explicit representation of model uncertainty exploiting models’ misspecification. Second, it overcomes the necessity to have a specific prior over the set of models and about parameters belonging to each of the models under consideration. Finally, it is computationally extremely easy. Toimplementthemodelcombination, usingthetechnicalmachineryprovidedbypreviousstudies on nonparametric entropy-based testing, I derive the asymptotic distribution of the Kullback- Leibler information between the nonparametric density and the candidate parametric model. Since 30

the approximation error affects the asymptotic mean of the KI’s distribution, the latter varies with the underlying parametric model. Then, to determine the same distribution for all candidate models, employing an assumption technically equivalent to a Pitman alternative, I center the resulting Normalontheaverageperformanceofallplausiblemodels. Consequently, theweightsinthemodel combinationaredeterminedbytheprobabilityofobtainingaperformanceworsethanthatactually achieved, relatively to that attained on average by the other competing models. The empirical application to daily stock returns indicates that, during the phases of expansion, the best model is the double Gamma distribution, while during the phases of recession is the Gaussiandistribution. Moreover,thecombinationoftheNormalandthedoubleGamma,according to the weights obtained with the described methodology, outperforms in- and out-of-sample all candidate models including the best single model. This result can be due to the fact that none of the candidate models is the true structure, as such the models combination being a higher dimensional parametric alternative is able to approximate the data more closely. However, this explanation is not complete. The mixture of models where the weights are estimated as free parameters, even though is characterized by the same number of parameters does not perform like the model combination. Most likely, the information contained in model misspecification, when embodied in the weights of model combination, can improve the robustness of results to future mistakes. This suggests that in decision contexts characterized by high uncertainty, such that it can be hard: to form specific priors, to conceive an exhaustive set of all possible models and/or to use 31

the true complex structure, the proposed approach can provide a better hedge against the lack of knowledge of the correct model. Additionally, this methodology can also be used to form priors in training sample, before applying more sophisticated Bayesian averaging techniques. This approach can be further extended to conditional distributions to address more challenging and complex prediction problems. I leave this problem to future research. 6 Appendix 6.1 Proof Theorem 1: KI can be rewritten in the following way: KI = (lnf (x) lnf (x))dF (x)= (lnf (x) lng(x))dF (x) (lnf (x) lng(x))dF (x)=KI KI . n − θ n n − n − θ − n 1 − 2 Zx Zx Z (19) c b b c b b b Similarly to Fan(1994), this representation is very helpful to examine the effect of estimating f by f on θ θ ∗ b the limiting distribution of KI. From now on the index j for the single model will be omitted. Istartexaminingthelim c itingdistributionofKI = 1 ln fn(xi) thatbytheLawofLargeNumbers 1 n i g(xi) ³c ´ P c (LLN) can be considered a good approximation of E((lnf (x) lng(x))=KI . This first part of the proof n 1 − c draws heavily upon Hall(1984) and Hong and White(2000). Using this inequality ln(1+u) u+ 1u2 u3 for u <1 and defining u= fn(x) − g(x) = fn(x) 1 we − 2 ≤| | | | g(x) g(x) − c c ¯ ¯ ¯ ¯ obtain the following result: 2 1 f (x ) 1 f (x ) g(x ) 1 f (x ) g(x ) ln n i n i − i + n i − i u3. (20) n à g(x i ) !− n à g(x i ) ! 2n à g(x i ) ! ≤ i i i i i X c X c X c X We can drop the absolute value because of Markov’s inequality, see proof of Lemma 3.1 in Hong-White 32

(2000). Let define 1 f (x ) g(x ) V = n i − i 1n n g(x ) Ã i ! i X c and b 2 1 f (x ) g(x ) V = n i − i . 2n n g(x ) Ã i ! i X c b By Lemma 3.1 Hong-White (2000), under assumption A1 and A2, nh4/lnn , h 0. Then: →∞ → 1 KI 1 =V 1n − 2 V 2n +O p (n − 3 2h − 3lnn+h6). (21) c b b Now we have to analyze the terms V 1n and V 2n . Let define f(x i )=h − 1 K x −h xi g(x)dx and a (x ,x b )= h − 1bK(xi− h xj) − h − 1 K x −h xi R g(x) ¡ dx ¢ n i j g(x ) iR ¡ ¢ b (x )= h − 1 K x −h xi g(x)dx − g(x i ) . n i g(x ) R ¡ ¢i Then 1 f (x ) f(x ) f(x ) g(x ) 1 1 V = n i − i + i − i = a (x ,x )+ b (x ) 1n n g(x ) g(x ) n(n 1) n i j n n i " i i # i − i j,i=j i X c X X6 X b =V +B , (22) 11n n b b where V is a second order U-statistic and it will affect the asymptotic distribution of KI . Similarly to 11n 1 b c Hall(1984) let rewrite V in the following way: 11n 1 b V = H (x ,x ) 11n n(n 1) 1n i j − i j,i=j X X6 b H (x ,x )= 1 K xj− h xi − K x −h xi g(x)dx + K xi− h xj − K x −h xi g(x)dx J (x ,x )+J (x ,x ) 1n i j 2h ³ ´ gR(x i )¡ ¢ ³ ´ gR(x i )¡ ¢ ≡ n i j n j i   (23) 33

E(H (x ,x )/x )=0, then using Theorem 1 in Hall(1984) we can show that 1n i j i 1 2E H2 (x ,x ) V 11n =   n(n − 1) X i j X ,i 6 =j H 1n (x i ,x j )   Á ( £ 1 n n 2 i j ¤)→ d N(0,1). (24) b 2 E J2(x ,  x ) = 1 K xj− h xi −  K x −h xi g(x)dx g(x )g(x )dx dx n i j 4h2 ³ ³ ´ gR2(x i¡) ¢ ´ i j i j Z Z £ ¤ applying a change of variable from (x i ,x j )=(x i ,u) where u= xj− h xi we get the following expression 1 K2(u)+ h K(u)g(x +hu)du 2 2K(u) h K(u)g(x +hu)du i i = − g(x )g(x +hu)dx du 4h g2(x ) i i i Z Z £ R ¤ i £ R ¤ 1 1 1 = K2(u)du+o =O . (25) 4h h h Z µ ¶ µ ¶ Similarly we can show that 1 1 1 E[J (x ,x )J (x ,x )]= K2(u)du+o =O . (26) n i j n j i 4h h h Z µ ¶ µ ¶ Then it follows that 1 1 1 E H2 (x ,x ) =E 2J2(x ,x )+2J (x ,x )J (x ,x ) = K2(u)du+o =O , (27) 1n i j n i j n i j n j i h h h Z µ ¶ µ ¶ £ ¤ £ ¤ and 2 1 σ2 = K2(u)du+o . (28) 1n n2h h Z µ ¶ The second term in (22) is the expected value of a Bias term, that is 1 h2 B = b (x ) µ g(2)(x)dx+o(h2), (29) n n n i ' 2 2 i Z X b where g(2)(x) is the second derivative of the p.d.f. and µ = u2k(u)du. Hence B =O n 1/2h2 . Thus, 2 n p − R ¡ ¢ b what we obtain is 34

h2 V =V +B σ N(0,1)+ µ g(2)(x)dx+o(h2). (30) 1n 11n n ∼ 1n 2 2 Z b b b 2 1 f (x ) f(x ) f(x ) g(x ) V = n i − i + i − i = 2n n g(x ) g(x ) " i i # i X c 2b 2 1 f (x ) f(x ) 1 f(x ) g(x ) 2 f (x ) f(x ) f(x ) g(x ) n i − i + i − i + n i − i i − i (31) n g(x ) n g(x ) n g(x ) g(x ) i " i # i · i ¸ i à i !µ i ¶ X c X X c =V +V +V . (32) 21n 22n 23n 2 1 b 1b b V = a (x ,x ) 21n n n 1 n i j  i − j,i=j X X6 b   1 2 = a2(x ,x )+ a (x ,x )a (x ,x ). (33) n(n 1)2 n i j n(n 1) n i j n i z − i j,i=j − i j=iz=j X X6 XX6 X6 The first term is a variance term and it will affect the mean of the asymptotic distribution. As n , →∞ by Lemma 2 Hall(1984) the first term of V is given by: 21n 1 b a2(x ,x )=σ2 +O (n 3/2h 1), (34) n(n 1)2 n i j n p − − − i j,i=j X X6 where σ2 = 1 σ2 . n 2n 1n The second term equals a twice centered degenerate U-statistic U , which is of the same order of magn b nitude of V and it also affects the asymptotic distribution of KI . 11n 1 2 2 b 2U = a (x ,x)a (x ,x)g(x)dx=c H (x ,x ), (35) n n(n 1) n j n i n(n 1) 2n i j − i i=jZ − i i=j XX6 XX6 b H (x ,x )= 1 K xj− h xi − K xj− h xi g(x j )dx j K xz−h xi − K xz−h xi g(x z )dx z g(x )dx . 2n i j h2 Z  ³ ´ Rg(x³ i ) ´ " ¡ ¢ R g(x ¡i ) ¢ # i i   2 E H2 (x ,x ) = 1 E K xj− h xi − K xj− h xi g(x j )dx j K xz−h xi − K xz−h xi g(x z )dx z g(x )dx 2n i j h4  Z  ³ ´ Rg(x³ i ) ´ à ¡ ¢ R g(x ¡i ) ¢ ! i i  £ ¤     35

2 = 1 K xj− h xi − K xj− h xi g(x j )dx j K xz−h xi − K xz−h xi g(x z )dx z g(x )dx g(x )g(x )dx dx h4 Z Z  Z  ³ ´ Rg(x³ i ) ´ à ¡ ¢ R g(x ¡i ) ¢ ! i i  j z j z    2  1 K xj− h xi K xz−h xi 1 = g(x )dx g(x )g(x )dx dx +o h4 Z Z  Z ³ g2´(x i )¡ ¢ i i  j z j z µ h ¶ 1 K(u)K(u+  v) 2  1 1 1 2 = h du g(x )g(x +hu hz)dx hdv+o = K(u)K(u+v)du g2(x )dx dv h4 g(x +hu) j j − j h h g2(x ) j j Z Z · Z j ¸ µ ¶ Z j ·Z ¸ 2 1 =h 1 K(u)K(u+v)du dv+o . (36) − h Z ·Z ¸ µ ¶ By Lemma 3 in Hall(84), then U is asymptotically Normally distributed N(0,σ2 ), where n 2n 2 b σ2 2n 2h 1 K(u)K(u+v)du dv. (37) 2n ' − − Z ·Z ¸ Finally we have that V σ2 +O (n 3/2h 1)+√2σ N(0,1). (38) 21n ∼ n p − − 2n b 2 V 22n = n 1 i f(xi g ) ( − xi g ) (xi) = n 1 i b2 n (x i ), whichis apurelydeterministicBias-squaredterm, andit will h i P P b affect the mean of the asymptotic distribution. That is, 1 h4 g(2)(x) 2 b2 = µ2 dx+o(h4). (39) n n 4 2 g(x) i Z ¡ ¢ X Finally we can analyze V : 23n 2 b f (x ) f(x ) f(x ) g(x ) 2 2V = n i − i i − i = H (x ,x ), (40) 23n n g(x ) g(x ) n(n 1) 3n i j i à i !µ i ¶ − i X c X b similarly to Hall(1984) define H (x ,x )= a (x ,x )b (x )= 1 K x −h xi − K xj− h xi g(x j )dx j f(x i ) − g(x i ) dx . 3n i j j n i j n i h Z  ¡ ¢ R g(x³ i ) ´  µ g(x i ) ¶ i X   (41) 36

Underassumptions A1and A2 andgiventhat EH =0, byLemma1inHall(1984) wehavethat2V 3n 23n b is asymptotically normally distributed with zero mean and variance given by: g(2)(x ) 2 2 σ2 2n 1h4µ2 i dx g(2)(x ) dx , (42) 3n ' − 2 " Z ¡ g(x i ) ¢ i − µZ ³ i ´ i ¶ # which can be easily seen if we consider that f(xi) − g(xi) = h2µ 2 g(2)(xi)and that g(xi) g(xi) g(2)(x ) 2 2 EH2 =h4µ2 i dx g(2)(x ) dx . 3n 2 " Z ¡ g(x i ) ¢ i − µZ ³ i ´ i ¶ # Also this term will affect the asymptotic distribution of KI . 1 c To summarize all previous steps, we can rewrite the expansion of KI in the following way: 1 1 KI =V +B V +V +2Vc (43) 1 11n n − 2 21n 22n 23n ∼ h2 c b1 b ³ b b b ´ h4 g(2)(x) 2 N(0,σ2 )+ µ g(2)(x)dx+o(h2) σ2 +O (n 3/2h 1)+2N(0,σ2 )+ µ2 dx+o(h4)+2N(0,σ2 ) . 1n 2 2 Z −2 Ã n p − − 2n 4 2 Z ¡ g(x) ¢ 3n ! Oncemore,followingHall(1984),fromthedefinitionofV andthefactthatnh ,wehavethatthe 21n →∞ difference between 1 a2(x ,x ) and σ2 is neg b ligible w.r.t. 2U , hence the previous expression n(n 1) i j=i n i j n n − 6 P P b can be rewritten as follows: 1 KI (nh1/2) 1√2σ N (nh1/2) 1√2σ N n 1/2h2√2σ N +B c , (44) 1 ∼ − 1 1 − − 2 2 − − 3 3 n − 2 n c b where N ,N and N are asymptotically normal N(0,1); and 1 2 3 2 g(2)(x ) 2 2 σ = K2(u)du, σ = K(u)K(u+v)du dv and σ =µ2 i dx g(2)(x ) dx , 1 2 3 2 g(x ) i − i i Z Z ·Z ¸ " Z ¡ i ¢ µZ ³ ´ ¶ # h4 g(2)(x) 2 and c =(nh) 1 K2(u)du+ µ2 dx+o(n 1h 1+h4). (45) n − 4 2 g(x) − − Z Z µ ¶ 37

ItisimportanttonoticethatB ,whichisO (n 1/2h2),willasymptoticallycanceloutwithn 1/2h2√2σ N , n p − − 3 3 b since they are of the same order of magnitude. Thus, we have the following results: as n , h 0, nh and nh5 0 →∞ → →∞ → 1 nh1/2(KI + c ) d √2σ N √2σ N . 1 2 n → 1 1 − 2 2 c Since aN(0,1) + bN(0,1) can be proved to be asymptotically normal N(0,a2 + b2), then we have that nh1/2(KI + 1c ) √2(σ σ )N(0,1). 1 2 n → 1 − 2 Let us now examine the term c KI = (lnf (x) lng(x))dF (x)= (lnf (x ) logf (x )+logf (x ) lng(x ))dF (x ). 2 θ − n θ i − θ∗ i θ∗ i − i n i Z Z We start examining b the limiting distbribution of b b 1 1 KI = logf (x ) logf (x ) f (x )+ (logf (x ) logg(x ))f (x )=KI +KI , (46) 2 n θ i − θ∗ i n i n θ∗ i − i n i 21 22 i=1 i=1 X¡ ¢ X c b b b c c that similarly of KI by the LLN, can be considered a good approximation of E(lnf (x) lng(x)). This 1 θ − c b part of the proof is based mainly on Zheng (1996). Employing the same expansion used for KI 1 , where now u= f θ b (x f i θ ) ∗ − ( f x θ i ∗ ) (xi) : 1 log f θ (x i ) 1 fc θ (x i ) − f θ∗ (x i ) 1 f θ (x i ) − f θ∗ (x i ) 2 , n f (x ) ' n f (x ) − 2n f (x ) X i=1 µ θ b∗ i ¶ X i=1 b θ∗ i X i=1µ b θ∗ i ¶ we can rewrite KI in the following way: 21 c 1 f (x ) f (x ) 1 f (x ) f (x ) 2 1 KI (f ,f ) θ i − θ∗ i f (x ) θ i − θ∗ i f (x )=I I . (47) 21 θ θ∗ ' n f (x ) n i − 2n f (x ) n i n1 − 2 n2 X i=1µ b θ∗ i ¶ X i=1µ b θ∗ i ¶ c b b b Applying the mean value theorem to f (x ) we obtain: θ i f θ (x i ) − f θ∗ (x i ) ∼ = ∂f θ ∂ ∗ θ b( 0 x i ) (θ − θ ∗ )+ 2 1 (θ − θ ∗ )0 ∂2 ∂ f θ θ ∂ ( θ x 0 i ) (θ − θ ∗ ), b b b b where θ lies between θ and θ . ∗ b 38

Thus, 1 f (x ) I = n i f (x ) f (x ) (48) n1 n f (x ) θ i − θ∗ i ' X i=1 b θ∗ i ³ ´ 1 f (x ) ∂f (x ) 1 b f (x ) ∂2f (x ) n i θ∗ i (θ θ ∗ )+ (θ θ ∗ )0 n i θ∗ i (θ θ ∗ )= n i f θ∗ (x i ) ∂θ 0 − 2n i − f θ∗ (x i ) ∂θ∂θ 0 − X b X b b b b 1 1 x x ∂f (x )/∂θ K j − i θ∗ i (θ θ ∗ )+ n(n 1) h h f (x ) − − i j µ ¶ θ∗ i XX 1 1 x x ∂2fb(x )/∂θ∂θ (θ θ ∗ )0 K j − i θ i 0 (θ θ ∗ )= − 2n(n 1) h h f (x ) − − i j µ ¶ θ∗ i XX b b S (θ θ )+(θ θ )S (θ θ ). (49) 1n ∗ ∗ 0 2n ∗ − − − b b b It can be noticed that the U-statistic form of S is the same as that of U defined in theorem 2 D’Amico 1n n (2003a)14. It follows that S =O ( 1 ). 1n p √n 1 1 x x ∂2f (x )/∂θ∂θ E(S )= E K j − i θ i 0 , (50) 2n 2n(n 1) h h f (x ) − i j · µ ¶ θ∗ i ¸ XX 1 x x ∂2f (x )/∂θ∂θ 1 x x ∂2f (x )/∂θ∂θ E K j − i θ i 0 = K j − i θ i 0 g(x )g(x )dx dx = h h f (x ) h h f (x ) i j i j · µ ¶ θ∗ i ¸ Z Z µ ¶ θ∗ i ∂2f (x )/∂θ∂θ K(u) θ i 0 g(x )g(x +hu)dx du. (51) f (x ) i i i Z Z θ∗ i Similarly to Dimitriev-Tarasenko(1973), applying the Cauchy-Schwartz inequality we obtain that ∂2f (x )/∂θ∂θ lim sup E(S ) θ i 0 g2(x)dx; (52) n →∞ 2n ≤ Z f θ∗ (x i ) then ∂2f (x )/∂θ∂θ E( S ) K(u) θ i 0 g(x )g(x +hu)dx du=O(1) k 2n k ≤ f (x ) i i i Z Z ° θ∗ i ° ° ° ° ° ° ° Thus, we have that S 2n = O p (1). Taking into account that √n(θ θ ∗ ) = O p (1), which in turn implies − that (θ θ )=O ( 1 ), it follows that I =S (θ θ )+(θ θ ) b S (θ θ ) is equal to − ∗ p √n n1 1n − ∗ − ∗ 0 2n − ∗ 14Thbe appendix of this paper is available upon rebquest. b b 39

1 1 1 1 1 I =O ( ) O ( )+O ( ) O (1) O ( )=O ( ). (53) n1 p √n ∗ p √n p √n ∗ p ∗ p √n p n Now we have to consider I : n2 2 1 f (x ) f (x ) 1 1 x x ∂lnf (x )∂lnf (x ) I n2 = n i=1 Ã θ f i θ∗ − (x i θ ) ∗ i ! f n (x i ) ' (θ − θ ∗ ) 0 n(n − 1) i j h K µ j − h i ¶ ∂ θ θ i ∂θ θ 0 j (θ − θ ∗ ) X b XX b b (54)b =(θ θ )S (θ θ ). (55) ∗ 0 3n ∗ 0 − − b b Similarly to S , it can be shown that S is O (1). It follows that I 2n 3n p n2 1 1 1 I =O O (1) O =O . (56) n2 p √n ∗ p ∗ p √n p n µ ¶ µ ¶ µ ¶ Finally, we get that: 1 1 1 1 1 KI (f ,f ) I I =O ( ) O =O , 21 θ θ∗ ' n1 − 2 n2 p n − 2 p n p n µ ¶ µ ¶ then it follows that c b 1 (nh1/2)KI (f ,f )=(nh1/2)O =O (h1/2) p 0. (57) 21 θ θ∗ p n p → µ ¶ c b Now, the same expansion used for KI can be applied to KI (f ,g): 21 22 θ∗ c c 1 n f (x ) g(x ) 1 n f (x ) g(x ) 2 1 KI (f ,g)= θ∗ i − i f (x ) θ∗ i − i f (x )=J J , (58) 22 θ∗ ∼ n g(x ) n i − 2n g(x ) n i n1 − 2 n2 i=1µ i ¶ i=1µ i ¶ X X c b b f (x ) g(x ) E(J (f ,g))=E θ∗ i − i f (x )g(x )dx = K(u)(f (x) g(x))g(x+hu)dxdu. 1n θ∗ g(x ) n i i i θ∗ − µZ µ i ¶ ¶ Z Z (59) b Applying the same steps used for S we can show that 2n lim sup E(J (f ,g)) (f (x) g(x))g(x)dx=E(f (x) g(x)) n 1n θ∗ ≤ θ∗ − θ∗ − →∞ Z 40

E( J ) K(u) f (x) g(x) g(x+hu)dxdu=O(1) k 1n k ≤ k θ∗ − k Z Z It follows that J (f ,g)=O (1). Repeating the same steps once more for J (f ,g) we obtain: 1n θ∗ p 2n θ∗ 1 n f (x ) g(x ) 2 f (x ) g(x ) 2 E θ∗ i − i f (x ) =E θ∗ i − i f (x )g(x )dx = n g(x ) n i g(x ) n i i i à i=1µ i ¶ ! à Z µ i ¶ ! X b b (f (x ) g(x ))2 (f (x) g(x))2 =E θ ∗ i − i f (x )dx = K(u) θ ∗ − g(x+hu)dxdu, n i i g(x ) g(x) à i ! Z Z Z b lim sup E(J (f ,g)) (f (x) g(x))2dx (60) n 2n θ∗ ≤ θ∗ − →∞ Z Then also J (f ,g)=O (1). This implies that KI (f ,g)=J 1J =O (1). 2n θ∗ p 22 θ∗ n1 − 2 n2 p c Then it is clear that given assumptions A1-A5, if h 0, nh , then → →∞ 1 KI (f ,g) p E(f (x) g(x)) (f (x) g(x))2dx, (61) 22 θ∗ → θ∗ − − 2 θ∗ − Z c this implies that nh1/2KI p , hence we need to rescale it by d =n 1h 1/2 where d 0 as n . 22 n − − n → ∞ → →∞ c This is embodied in assumption A6, which implies: KI αh1/2c (62) 22 n ' c Finally we can put all terms together: KI = lnf (x) lnf (x) f (x)dx=KI KI n − θ n ∼ 1 − 2 ∼ Zx ³ ´ c c b c c c 1 (nh1/2) 1√2σ N (nh1/2) 1√2σ N c KI (f ,f )+KI (f ,g) , (63) − 1 1 − − 2 2 − 2 n − 21 θ θ∗ 22 θ∗ · ¸ h i c b c since we showed that 41

(nh1/2)KI (f ,f ) p 0 (64) 21 θ θ∗ → c b the entire expression for (nh1/2)KI can be approximated in the following way: 1 1 (nh1/2) (nh1/2) 1√2σ N (nh1/2) 1√2σ N c J J . (65) − 1 1 − − 2 2 − 2 n − n1 − 2 n2 · µ ¶¸ Thus, if h ∝ n − β with β > 1 5 , c n ' C(nh) − 1 1 (nh1/2) KI+ c √2σ N √2σ N +αC (66) 2 n ∼ 1 1 − 2 2 µ ¶ c then, 1 (nh1/2) KI+ c d N αC,2 σ2 σ2 . (67) 2 n → 1− 2 µ ¶ ¡ ¡ ¢¢ c 42

References [1] Ahmad, I.A. and P.E. Lin, A Nonparametric Estimation of the Entropy for Absolutely Continuous Distributions. IEEE Transactions on Information Theory, 22, pp.372-375, 1976. [2] Aït-SahaliaYacine,NonparametricPricingofInterestRateDerivativeSecurities,Econometrica,Vol.64 No.3 (May 1996), 527-560. [3] AngA.,andG.Bekaert,InternationalAssetAllocationwithRegimeShifts,Review of Financial Studies 15, 4, pp.1137-87, 2002. [4] Ang A.,and G. Bekaert, How Do Regimes Affect Asset Allocation, Columbia Business School, 2002. [5] AvramovD.,StockReturnPredictabilityandModelUncertainty, TheWhartonSchool Working Paper, May 2000. [6] Avramov D., Stock Return Predictability and Model Uncertainty, Journal of Financial Economics 64, pp.423-458, 2002. [7] Bedford T. and R.Cooke, Probabilistic Risk Analysis, Cambridge University Press 2001. [8] Chen X. and Huang J.Z., Semiparametric and Nonparametric Estimation via the Method of Sieves, Manuscript New York University, Noevmber 2002. [9] Cogley T., S. Morozov and T.J. Sargent, Bayesian Fan Charts for U.K. Inflation: Forecasting and Sources of Uncertainty in an Evolving Monetary System, Working paper No. 2003/44, Center for Financial Studies, 2003. 43

[10] Cremers K. J. M.; Stock Return Predictability: A Bayesian Model Selection Perspective, The Review of Financial Studies Vol.15, No.4, pp.1223-1249, Fall 2002. [11] D’AmicoS.,QuasiMaximumLikelihoodEstimationviaaPilotNonparametricEstimate,mimeo2003. [12] Dhrymes P.J., Topics in Advanced Econometrics Volume I and II, Springer-Verlag 1993. [13] Dhrymes P.J, Identification and Kullback Information in the GLSEM, Journal of Econometrics 83, 163-184 (1998). [14] DieboldF.X.,T.A.GuntherandA.S.Tay,EvaluatingDensityForecasts,PIERWorkingPaper 97-018. [15] Diebold, F.X. and J.A.Lopez, Forecast Evaluation and Combination. In G.S.Maddala and C.R.Rao, Handbook of Statistics, Volume 14, pp.241-68. Amsterdam: North-Holland. [16] Dmitriev, Yu G. and F.P. Tarasenko, On the Estimation of Functionals of the Probability Density and its Derivatives, Theory of Probability and Its Application 18, pp.628-33, 1973. [17] Dimitriev,YuG.andF.P.Tarasenko,OnaClassofNonparametricEstimatesofNonlinearFunctionals, Theory of Probability and Its Application 19, pp.390-94, 1974. [18] DudewiczE.J.andEdwardC.vanderMulen,TheEmpiricEntropy,ANewApproachtoNonparametric Entropy Estimation, in Puri M., Vilaplana J.P. and Wertz W., New Perspectives in Theoretical and Applied Statistics, 1987. [19] EbrahimiN.,Maasoumi E.and Soofi E.S.; OrderingUnivariateDistributions by EntropyandVariance, Journal of Econometrics 90, 1999 pag 317-336. 44

[20] Fan Y., Testing the goodness of fit of a parametric density function by kernel method, Econometric Theory, 10, 316-356, 1994. [21] Giacomini R., Comparing Density Forecasts via Weighted Likelihood Ratio Tests: Asymptotic and Bootstrap Methods, working paper University of California San Diego, June 2002. [22] GiacominiR.andH.White,Testofconditionalpredictiveability,workingpaperUniversityofCalifornia San Diego, April 2003. [23] Gilboa I. and D. Schmeidler; Cognitive Foundations of Inductive Inference and Probability: An Axiomatic Approach, Mimeo March 2000. [24] Gilboa I. and D. Schmeidler; Inductive Inference: An Axiomatic Approach, Econometrica, January 2003, v. 71, iss. 1, pp. 1-26. [25] Gilboa I. and D. Schmeidler, A Theory of Case-Based Decisions, Cambridge University Press 2001. [26] Hall P.; Central Limit Theorem for Integrated Square Error of Multivariate Nonparametric Density Estimators, Journal of Multivariate Analysis 14, 1-16 (1984). [27] Hall P.; On Kullback-Leibler Loss and Density Estimation, The Annals of Statistics, Volume 15, Issue 4, 1491-1519 (Dec.,1987). [28] Hansen L.P. and T.J. Sargent; Acknowledging Misspecification in Macroeconomic Theory, Review of Economic Dynamics 4, 519-535 2001. [29] Härdle W., Applied Nonparametric Regression, Econometric Society Monographs 1990. 45

[30] HasminskiiR.Z.andI.A.Ibragimov,OntheNonparametricEstimationofFunctionals,inMandlP.and M. Huskova, Proceedings of the Second Prague Symposium on Asymptotic Statistics, August 1978. [31] HendryD.F.and M.P.Clements; Pooling ofForecasts, Econometrics Journal, volume5, pp.1-26, 2002. [32] Henry M., Estimating Ambiguity, Manuscript Columbia University 2001. [33] HongY.andH.White,AsymptoticDistributionTheoryforNonparametricEntropyMeasuresofSerial Dependence, manuscript July 2000. [34] Keuzenkamp H. A.; Probability, Econometrics and Truth, Cambridge University Press 2000. [35] Knight J.L., Satchell S.E. and K.C. Tran, Statistical modelling of asymmetric risk in asset returns, Applied Mathematical Finance 2, 1995, 155-172. [36] Knox T.A., Analytical Methods for Learning How to Invest when Returns are Uncertain; Manuscript University of Chicago Graduate School of Business, August 2003. [37] Maasoumi E. and Racine J., Entropy and Predictability of Stock Market Returns, Journal of Econometrics 107, 2002 pages 291-312. [38] Pagan A. and A. Ullah, Nonparametric Econometrics, Cambridge University Press 1999. [39] Robinson, P.M., Consistent Nonparametric Entropy-Based Testing, Review of Economic Studies 1991, 58, 437-53. [40] SawaT., InformationCriteriaforDiscriminatingAmongAlternativeRegressionModels, Econometrica Vol.46, Nov.1978. 46

[41] Skouras S., Decisionmetrics: A decision-based approach to econometric modelling; Manuscript Santa Fe Institute, November 2001. [42] Sims A., Uncertainty Across Models, The American Economic Review, Volume 78, Issue 2, 163-67 (May, 1988). [43] Sin C.Y. and H. White, Information Criteria for Selecting Possibly Misspecified Parametric Models, Journal of Econometrics 71 (1996), pp207-225. [44] Stock H.J. and M.W. Watson, Forecasting Inlfation, Journal of Monetary Economics 44 (1999) 293- 335. [45] Ullah A., Entropy, Divergence and Distance Measures with Econometric Applications, Journal of Statistical Planning and Inference 49 (1996) 137-162. [46] Uppal R. and T. Wang, Model Misspecification and Under-Diversification, mimeo January 2002. [47] Vapnik V.N., The Nature of Statistical Learning Theory, Springer-Verlag 2000. [48] White H., Estimation, Inference and Specification Analysis, Cambridge University Press 1994. [49] WrightJ.H.,ForecastingU.S.InflationbyBayesianModelAveraging,InternationalFinanceDiscussion Papers, Number 780, Board of Governors of the Federal Reserve System, September 2003. [50] Zheng J.X., A Consistent Test of Functional Form Via Nonparametric Estimation Techniques, Journal of Econometrics 75 (1996) pp263-289. 47

[51] Zheng J.X, A Consistent Test of Conditional Parametric Distributions, Econometric Theory, 16, 2000, pp 667-691. 48

Cite this document
APA
Stefania D'Amico (2004). Density Selection and Combination Under Model Ambiguity: An Application to Stock Returns (FEDS 2005-09). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2005-09
BibTeX
@techreport{wtfs_feds_2005_09,
  author = {Stefania D'Amico},
  title = {Density Selection and Combination Under Model Ambiguity: An Application to Stock Returns},
  type = {Finance and Economics Discussion Series},
  number = {2005-09},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2004},
  url = {https://whenthefedspeaks.com/doc/feds_2005-09},
  abstract = {This paper proposes a method for predicting the probability density of a variable of interest in the presence of model ambiguity. In the first step, each candidate parametric model is estimated minimizing the Kullback-Leibler 'distance' (KLD) from a reference nonparametric density estimate. Given that the KLD represents a measure of uncertainty about the true structure, in the second step, its information content is used to rank and combine the estimated models. The paper shows that the KLD between the nonparametric and the parametric density estimates is asymptotically normally distributed. This result leads to determining the weights in the model combination, using the distribution function of a Normal centered on the average performance of all plausible models. Consequently, the final weight is determined by the ability of a given model to perform better than the average. As such, this combination technique does not require the true structure to belong to the set of competing models and is computationally simple. I apply the proposed method to estimate the density function of daily stock returns under different phases of the business cycle. The results indicate that the double Gamma distribution is superior to the Gaussian distribution in modeling stock returns, and that the combination outperforms each individual candidate model both in- and out-of-sample.},
}