feds · June 26, 2023

A Comprehensive Empirical Evaluation of Biases in Expectation Formation

Abstract

We revisit predictability of forecast errors in macroeconomic survey data, which is often taken as evidence of behavioral biases at odds with rational expectations. We argue that to reject rational expectations, one must be able to predict forecast errors out of sample. However, the regressions used in the literature often perform poorly out of sample. The models seem unstable and could not have helped to improve forecasts with access only to available information. We do find some notable exceptions to this finding, in particular mean bias in interest rate forecasts, that survive our out-of-sample tests. Our findings help narrow down the set of biases that merit closer attention of researchers in behavioral macroeconomics.

Finance and Economics Discussion Series Federal Reserve Board, Washington, D.C. ISSN 1936-2854 (Print) ISSN 2767-3898 (Online) A Comprehensive Empirical Evaluation of Biases in Expectation Formation Kenneth Eva and Fabian Winkler 2023-042 Please cite this paper as: Eva, Kenneth, and Fabian Winkler (2023). “A Comprehensive Empirical Evaluation of Biases in Expectation Formation,” Finance and Economics Discussion Series 2023-042. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2023.042. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

A Comprehensive Empirical Evaluation of Biases in Expectation Formation Kenneth Eva Fabian Winkler∗ June 14, 2023 Abstract Werevisitpredictabilityofforecasterrorsinmacroeconomicsurveydata,whichisoften takenasevidenceofbehavioralbiasesatoddswithrationalexpectations. Wearguethatto rejectrationalexpectations,onemustbeabletopredictforecasterrorsoutofsample.However,theregressionsusedintheliteratureoftenperformpoorlyoutofsample. Themodels seemunstableandcouldnothavehelpedtoimproveforecastswithaccessonlytoavailable information. Wedofindsomenotableexceptionstothisfinding,inparticularmeanbiasin interestrateforecasts,thatsurviveourout-of-sampletests. Ourfindingshelpnarrowdown thesetofbiasesthatmeritcloserattentionofresearchersinbehavioralmacroeconomics. JEL:C53,D84,E37 Keywords:BehavioralBias,Forecasting,Out-of-sampleprediction,Rationalexpectations,SurveyData ∗Federal Reserve Board, 20th St and Constitution Ave NW, Washington DC 20551, email: kenneth.j.eva@frb.govandfabian.winkler@frb.gov.WethankAndrewChen,SaiMa,AndrewPatton,andparticipantsatthe2023AEAAnnualMeetingandthe2022ComputinginFinanceandEconomicsconference forhelpfulcomments. Theviewsexpressedinthispaperaresolelytheresponsibilityoftheauthorsand shouldnotbeinterpretedasreflectingtheviewsoftheBoardofGovernorsoftheFederalReserveSystem oranyotherpersonassociatedwiththeFederalReserveSystem. 1

1 Introduction Eversincethethetheoryofrationalexpectationshasenteredeconomicsseveraldecades ago,economistshavedebatedtheempiricalquestionwhetherpeople’sexpectationsare in fact rational. For the purpose of this paper, we say expectations are rational when “outcomes do not differ systematically (i.e., regularly or predictably) from what people expected them to be” (Sargent, 2007). The empirical question then becomes: Are forecasterrorspredictableusingavailableinformation,andifso,how? Inthispaper,wewanttocontributetothisdebatebydrawingaparalleltothequestion of return predictability in finance. It is easy to show that (aggregate) stock market returns are predictable by regressing them on a readily observed variable like the price/dividendratio(FamaandFrench,1988). Thisobservationledtothedevelopment of theoretical models and investment recommendations consistent with the observed patterns of predictability. But the performance of the regressions had been evaluated almostexclusivelyinsample (IS).Inaninfluentialstudy,WelchandGoyal(2007)argued thatthein-sampleperformanceisnotausefuldefinitionofpredictability,becausepredictions are made using future data that is not available to an investor predicting returns in real time. They instead evaluated the predictive performance out of sample (OOS),andfoundthatthepredictabilitypreviouslydocumentedintheliteratureallbut disappeared. Similarly,itisrelativelyeasytoshowthatforecasterrorsarepredictablebyregressing them on a readily observed variable like forecast revisions. The predictable portion of theforecasterroristhencalledabias. Manysuchbiaseshavebeendocumentedusing different data sets on expectations. A host of labels such as “overreaction”, “underreaction”, “extrapolation”, or “stickiness” have been proposed in an attempt to interpret them. A sizable literature has been devoted to the development of theoretical models of expectation formation to match the empirical findings, examine the propagation of macroeconomicshocksthroughexpectations,andeventomakerecommendationsfor the conduct of policy.1 But again, predictive performance is almost exclusively evalu- 1SomeexamplesinthisliteratureareAngeletosetal.(2018);Winkler(2020);Pfa¨utiandSeyrich(2022); Bhandarietal.(2022). 2

atedinsample. Inourview,thein-sampleapproachfordetectingbiasesinexpectationssuffersfrom the same problem as that for detecting predictability of returns. The in-sample performance is not a useful yardstick to reject the null of rational expectations, because the regression makes use of future data that could not possibly have been available to the agent who formed the expectation. Armed with the benefit of hindsight that an insampletestprovides,itiseasytosaythatsomeone’sexpectationswerebiased. Ahigher bar is to demonstrate that more accurate predictions are possible in real time. This is whatwesetouttodointhispaperbyevaluatingthepredictiveperformanceoutofsample. Wecomprehensivelyre-examinetheempiricalevidenceonthepredictabilityofforecast errors in survey data documented in the literature as of 2022. We find that many so-called biases are unstable or spurious. By and large, the models have poor OOS performance. At best, they predict forecast errors of some variables in some time periods without a clear pattern. Additionally, many regressions are not significant even in-sample. Our evidence suggests that many empirical models of behavioral expectation formation would not have helped to improve their forecasts. In fact, trying to use themodelstocorrectforexpectationalbiaseswouldhaveledtolargerforecasterrors. Importantly, we also find exceptions to this finding. Some biases are remarkably stable out of sample. We find robust evidence of a mean bias in professional forecasts ofbondyieldsacrossthematuritycurve. Thesepatternscouldbeconsistentwithrigid priors about the low-frequency behavior of interest rates (Farmer et al., 2021). Also, there is strong evidence for what we call forecast combination bias—the fact that individual forecasts exhibit excess dispersion around the cross-sectional mean. One candidateexplanationforthispatternisstrategicinteractionbetweenforecasters(Gemmi andValchev,2021). Our paper is simple, and we hope that the simplicity of our approach strengthens the credibility of our evidence. We do not view our contribution as a judgment in favor of, or against, rational expectations. Rather, we want to carefully examine which of the previously documented biases are actually useful to improve forecasts in prac- 3

tice,muchlikeWelchandGoyal(2007)askedwhichassetpricingmodelscouldactually predict returns in practice. Our results can be seen as providing a selection criterion that helps narrow down the set of biases that merit closer investigation in behavioral macroeconomics. Researchers interested in theoretical models of expectation formationmaywanttofocusonreplicatingthosebiasesthatsurviveourout-of-sampletests. WhyisitthatsomeofthebiasesthatwerefoundtobestatisticallysignificantinpreviousstudiesdonotworkinourOOStests? Onecandidateexplanationisthatourtests sufferfromlowpower. Itistruethat,ifaresearcherisconfidentinawell-specified,stableunderlyingmodel,anOOStestwillalwayshavelowerpowerthananIStest. Butthis confidence is rarely warranted in practice. Even a solid finding that one’s expectations were biased in the past is of limited use if it does not lead to a prescription of how to improveone’sexpectationsinthefuture. Fromtheevidenceinthispaper,weconcludethatamorelikelyexplanationforweak OOSperformanceisthatthemodelsarenotstableovertime. Someofthebiasesdocumentedintheliteraturemaybetime-orstate-dependent.2 Asaresult,OOSprediction is difficult because one not only has to estimate the bias, but also how it will change in the future. This argument has also been made in the finance literature: Structural changesinreturnpredictionmodelslikelyexplainthedisconnectbetweenISandOOS predictability(LettauandVanNieuwerburgh,2007). However,wealsodocumentsome biases that do seem structurally stable: Mean bias in professional forecasts of interest rates, as well as excess dispersion of forecasts around the cross-sectional mean. We thinkthesebiasesmeritcloserattentionofresearchersinthefield. The only recent study we are aware of that takes a rigorous OOS perspective when assessing expectational biases is Bianchi et al. (2022). They show that a sophisticated machine-learning algorithm can improve GDP and inflation forecasts of professional forecastersoutofsample. TheyalsodocumentthatthepredictiveregressionsofCoibion and Gorodnichenko (2015) perform poorly out of sample for GDP and inflation. The scope of our paper is different. Rather than constructing a new measure of bias, we 2Insupportoftheideaofstate-dependentbiases,(Angeletosetal.,2021)constructimpulseresponses offorecasterrorstoidentifiedmacroeconomicshocksandfindthatforecasterrorsexhibitdifferentpatternsofpredictabilitydependingonthetypeoftheshockandthetimesinceitoccurred. 4

comprehensivelyrevisittheexistingevidenceonexpectationalbiasesthroughtheOOS lens. Our paper is closer in spirit to an older literature which asks whether forecasts from time-series models improve on survey forecasts out of sample (e.g. Pearce, 1987; BonhamandDacy,1991)andusuallyconcludesthattheydonot. Wemodifythisexercise and ask whether the models of bias proposed in the literature are able to improve surveyforecasts. Our paper relates to a number of studies that voice skepticism about biases in expectationsbasedontheexistingempiricalevidence. Andolfattoetal.(2008)arguethat in macroeconomic models with infrequent regime shifts, rational agents that have to learn about the new regimes will make forecast errors that seem predictable in small samples. Hajdini and Kurmann (2022) show that this can even be the case when the regimes are observed by agents. More subtly, Farmer et al. (2021) argue that in-sample predictability in small samples can still be consistent with rational Bayesian updating ifagentsareunsureaboutthelow-frequencybehaviorofthetimeseriesbeingforecast and have relatively strong priors on it, because it takes a very long time for the effect of the bias induced by the prior to fade away. Our paper takes an entirely empirical approach and shows that, from an OOS perspective, the evidence on predictability is often weak to begin with. There is also a strand of the literature that argues that surveyforecastsareoptimal,butjustnotinameansquarederrorsense. Thiscouldbethe case because forecasters have asymmetric or otherwise non-quadratic loss functions (Elliott et al., 2008) or have to deal with Knightian uncertainty (Bhandari et al., 2022). Ourpaperoffersacomplementaryargumentfortheoptimalityofsurveyforecastsfrom anOOSperspective,retainingthestandardmeansquarederrorcriterion. The remainder of this paper is structured as follows. Section 2 describes the data, andSection3laysoutourempiricalprocedure. Section4containsourfindingsforsurveysofprofessionalforecasterswhileSection5containsfindingsforhouseholdsurveys. Section6discussestherobustnessofourfindings. Section7concludes. 5

2 Data WeusedatafromtheSurveyofProfessionalForecasters(SPF),BlueChipFinancialForecasts(BC),theMichiganSurveysofConsumers(Michigan)andtheSurveyofConsumer Expectations (SCE). The surveys differ in their sample length and coverage of forecast variables. Weonlyevaluatenumericalpointforecasts. TheSPFisthelongest-runningquarterlysurveyofmacroeconomicforecastsinthe UnitedStates,startingin1968. Since1990,thesurveyisrunbythePhiladelphiaFed. In the middle of each quarter, participants are asked to forecast a wide range of variables forthecurrentquarterandeachofthefollowingquarters,uptofourquartersout. From this survey, we take the following forecast variables: the GDP deflator, nominal GDP, industrial production; real GDP, consumption, non-residential investment, residential investment,federalgovernmentexpenditures,aswellasstateandlocalgovernmentexpenditures; housing starts, the unemployment rate, and CPI headline inflation. For all variables except the last two, the forecasts in the data are in levels but we transform themintoforecastsofthepercentchangebetweentheforecasthorizonandthequarter preceding the survey date. For CPI inflation, the forecasts are for annualized quarterly inflation rates but we transform them into forecasts of the percent change of the CPI index between the forecast horizon and the quarter preceding the survey. For the unemploymentrate,wedirectlyevaluatethelevelforecasts. Weomitothervariablesinthe SPFastheyhavelessthan20yearsofdatainordertoguaranteeareasonableevaluation periodforourOOStests. Ourmainfocusisonaforecasthorizonofthreequarters,but we also evaluate forecasts from zero quarters ahead (current-quarter nowcasts) to two quartersahead. The BlueChip survey is a monthly survey of forecasts mainly of interest rates. Our BlueChip sample starts in 1988. Because the BlueChip forecast horizons are quarterly, werestrictoursampletothemonthsinthemiddleofeveryquartertoensureconstant forecast horizons. From BlueChip, we take forecasts of the federal funds rate; Treasury yields at three months, one year, two years, ten years, and 30 years maturity; and Aaa and Baa corporate bond yields. We also construct implicit forecasts of the one year- 6

threemonthandtenyear-twoyeartermspread,aswellastheBaa-Aaacorporatebond spread. Whiletherespondentstothesetwosurveysareprofessionalforecasters, theMichigan survey and the SCE are monthly household surveys. The Michigan survey starts in 1978 while the SCE starts in 2013. We only use 12-month ahead inflation forecasts, as thosearetheonlyquantitativeforecastsoftraditionalmacroeconomicdataavailable. Toconstructforecasterrors,wealsoneedrealizeddata. Forsome,likeinterestrates which are market-quoted, this is straightforward. But for others, like GDP, the realized valuesaresubjecttoconsiderablerevisions. Weusevintagedatafromthereal-timedata set for macroeconomists provided by the Philadelphia Fed, and use the first available releasesofthedata. 3 Empirical procedure 3.1 Consensus forecasts Many tests for rational expectations in the literature use consensus forecasts, i.e. the cross-sectionalaverageofindividualforecasts. Thisaverageisthentreatedastheexpectation of a hypothetical aggregate forecaster. In our empirical procedure, we consider the null hypothesis that consensus forecast errors are unpredictable. The regression modelsweevaluatetaketheform: y −y¯ = β(cid:48)x +u (1) t+h t+h|t t t+h where y is the realization of a variable at time t + h, y¯ is the consensus forecast t+h t+h|t of y made at time t, and x are a set of K potential predictors, the values of which t+h t areknownattimet. Whentheconsensusforecastisarationalexpectation,theforecast error has zero mean and is unpredictable by x ; that is, β = 0. If instead, model (1) t capturesabehavioralbias,thenβ (cid:54)= 0. Weusethebehavioralmodeltoconstructaseriesofbias-correctedforecasts: 7

y∗ = y¯ +β ˆ(cid:48)x . t+h|t t+h|t t t ˆ When we fit the model IS, then β is constant over time and simply equals the OLS t coefficients of (1) estimated over the whole sample. When we fit the model OOS, then ˆ β are the OLS coefficients estimated using data available up to time t, either using ret cursive or rolling windows. In the surveys we consider, the end-of-period values of the ˆ forecast variables are not known at time t, so that β is estimated using observations t throughy . t−1 Thepredictionerrorsfortherationalmodelandthebehavioralmodel,respectively, are: eR = y −y¯ t+h t+h t+h|t eB = y −y¯ −β ˆ(cid:48)x . t+h t+h t+h|t t t Under the null of rationality, the rational model should predict better than the behavioral model, as the latter is just injecting noise into the prediction. Following the literature, we evaluate the accuracy of forecasts using the sum of squared errors (SSE). We divide the sample into a training period and an evaluation period, the latter starting at timet ,andcompute: 0 t (cid:88) SSEm = (em)2,m = R,B. (2) t s s=t0 Our main statistic of interest is the difference of the SSE of the rational model and the behavioralmodel: SSER −SSEB ∆SSE = t t . (3) t SSER T If the difference is positive, then the behavioral model predicted better in our sample. If it is negative, then the rational model predicted better. We divide this difference by SSER,thesumofthesquaredforecasterrorsovertheentireevaluationperiod,toallow T 8

foraneasierinterpretationofthemagnitudes. ∆SSE thusrepresentsthedifferencein t predictiveperformanceoftherationalandbehavioralmodeluptotimet,expressedas afractionofthetotalsumofsquaredoriginalforecasterrorsinthedata. Avalueof,say, ∆SSE = 0.1 means that the rational model produces squared forecast errors that are T 10percentlargeroverthesamplethanthebehavioralmodel;inotherwords,correcting forbiasusingthebehavioralmodelreducessquaredforecasterrorsby10percent.3 Although we could use statistical tests for equal forecast accuracy of nested models (e.g. Clark and West, 2007), we obtain critical values for ∆SSE (both IS and OOS) t directlyusingabootstrap, asweuserelativelysmallsamples, potentiallyseriallycorrelatedindependentvariables,andoverlappingobservationswheneverh > 1. OurbootstrapfollowsWelchandGoyal(2007),adaptedtotheparticularstructureof overlapping forecasts, and imposes the null of no predictability of forecast errors. The data-generatingprocessforourbootstrapis: h (cid:88) u = θ (cid:15) (4) t s t−s s=0 p q (cid:88) (cid:88) x = φ x + ψ η , k = 1,...,K (5) k,t s k,t−1 s k,t−s s=0 s=0 We model forecast errors as a MA(h) process and regressands as ARMA(p,q) processes, and estimate the parameters by maximum likelihood using the full sample of observations. The choice of p and q depends on the particular model. The joint residuals ((cid:15) ,η ,...η ) are stored for sampling. Joint sampling preserves the correlation struct 1t Kt turebetweenthevariables. Wethengenerate10,000bootstrappedtimeseriesbydrawingwithreplacementfromtheresiduals. Theinitialobservationx isselectedbypick- −1 ing one date from the actual data at random. For each draw, we compute ∆SSE and t use the resulting distribution to compute critical values. We use one-sided critical valuesbecauseweareonlyinterestedinthewhetherthebehavioralmodelpredictsbetter thanthenull. 3Notethat,forISregressions,∆SSE isnotthesameasR2 becauseweonlysumthesquarederrors T overtheevaluationperiodstartingint ,andthereforecanalsobenegative.InWelchandGoyal(2007),it 0 isnamed“ISforOOSR¯2”. 9

ToseethatitisappropriatetomodelforecasterrorsasMA(h), considerthat, under thenull, h (cid:88) u = y −E[y | F ] = (E[y | F ]−E[y | F ]) (6) t t+h t+h t t+h t+s+1 t+h t+s s=0 whereF istheinformationsetattimet;inparticular,y andx arepartofthisinformat t t tionset. Eachoftheforecastrevisionsinthesumin(4)isuncorrelatedwiththeothers, because rational forecast revisions are martingale differences. Also, u is uncorrelated t withitsownlagsatlaglengthgreaterthanh. Aslongasu isalsocovariance-stationary, t thenu isthereforeanMA(h)process. Moreover,x isuncorrelatedwithallforecastrevit t sionsthatoccuraftertimet. For our OOS tests, we choose an initial estimation window t of 40 periods, after 0 which we begin the OOS forecasts, and restrict ourselves to forecasts for which at least 80periodsofdataareavailable. Anychoiceofthewindowlengthisnecessarilyad-hoc, butourresultsarerobusttothischoice,ascanbeseeninourgraphicalanalysis. ˆ We mainly report results using a recursive window regression to estimate β for our t OOS forecasts. It could be argued that this choice makes it harder to produce forecast improvementsifthetruemodelcoefficientsaretime-varying,ashasbeensuggestedfor example by Coibion and Gorodnichenko (2015). However, time variation also makes it harder to predict forecast errors in real time even if properly accounted for, because pastdatanowcontainlessinformationaboutfuturebiases. InSection6.1,wefindthat using rolling window regressions does little to improve the predictive performance of thebehavioralmodels. 3.2 Individual forecasts Someoftheliteratureconductsrationalitytestsdirectlyonindividualforecasts. Models ofbiasedexpectationsattheindividualleveltaketheform: y −y = β(cid:48)x +u (7) t+h t+h|it it it+h 10

wherey istheforecastofy madeattimetbyindividuali,andx areoneormore t+h|it t+h it potential predictors of forecast errors, the values of which are known to individual i at timet. Again, our null hypothesis is that the individual expectations are rational so that ˆ β = 0, while a behavioral model posits β (cid:54)= 0. As before, β are the OLS coefficients t estimated using data available up to time t, and eR and eB are the prediction errors for it it the rational model and the behavioral model, respectively. The sum of squared errors (SSE)fortherationalandbehavioralmodel,andtheirdifference,arenowdefinedas t (cid:88)(cid:88) SSEm = (em)2,m = R,B (8) t is s=t0i∈Is SSEB −SSER ∆SSE = t t . (9) t SSER T whereI isthesubsetofindividualsforwhichforecastsareavailableattimes. s For the bootstrap, we model the individual forecast errors and regressors analogouslytotheconsensuslevel: h−1 (cid:88) u = θ (cid:15) (10) it+h s i,t+h−s s=0 p q (cid:88) (cid:88) x = φ x + ψ η , k = 1,...,K (11) ik,t+h s ik,t−1 s ki,t−s s=0 s=0 where the choice of p and q depends on the particular model. We estimate the parameters using maximum likelihood, now pooling parameter estimates across forecasters, andstoretheestimatedresidualsforsampling. Whenwesampletheresiduals,wepreservecross-sectionalcorrelationsandmissingvaluesinthefollowingway. WefirstsampleT +1timeindicesrandomlywithreplacementfrom{0,...,T}. Foreachtimeperiod inthebootstrappedsample,wesamplewithreplacementfromtheindividualresiduals only within the corresponding sampled time index. Where possible, we jointly sample residuals from u and x , preserving correlation at the individual level. The panel that it it weobtainisbalanced,whiletheoriginaldatahavemanymissingvalues. Inthelaststep, wethereforereplacesimulateddatawithmissingvalueswherevertheoriginaldatahas 11

missingvalues. Werepeatthissimulation10,000times. This bootstrap implies that forecast errors are not only unpredictable at the individual level, but also at the consensus level. The literature has pointed out that even if individualforecastersmakerationalpredictions,averageforecastsmaystillbebiased.4 Ournullhypothesisissomewhatstrongerthanindividualrationalityandthuseasierto reject. If we fail to reject our null for the behavioral models because of weak OOS performance, then the same will hold true for a weaker null in which consensus forecast errorscanexhibitsomepredictability. 4 Results for Professional Forecasters 4.1 Consensus forecast revisions WefirstpresentdetailedresultsofourtestsforthepopularmodelofCoibionandGorodnichenko (2015), which aims to predict consensus forecast errors with forecast revisions. Themodelpositsthatforecasterrorsarepredictableusingforecastrevisions: (cid:0) (cid:1)(cid:48) y −y¯ = β y¯ −y¯ +u (12) t+h t+h|t t+h|t t+h|t−1 t+h The in-sample coefficient in these regressions is typically positive, which can be interpreted as underreaction: When forecasters revise their expectations upwards, they still make a positive forecast error and thus should have revised more.5 For this model, we set p = q = 0 in (5) as rational forecast revisions are uncorrelated. Coibion and Gorodnichenko (2015) focus mainly on inflation expectations, but also apply the model to a widerangeofothervariables. Wewillfirstpresenttheresultsofthismodelgraphically,byplottingtheseries∆SSE t 4Thisisthecase,forexample,inthenoisyinformationmodelofCoibionandGorodnichenko(2015). Inthatmodel,theaverageforecastrevisionthatappearsontheright-handsideofthebiasregressionis neverobservedbyindividuals. 5The empirical model estimated by Coibion and Gorodnichenko additionally has a constant term. We omit the constant here to give the models the best chance to fit the data OOS because including a constantmakesitconsiderablyhardertorejectthenull, i.e. reducesthepowerofourOOStests. Since forecast errors and forecast revisions have zero mean under the null, setting the constant to zero is a reasonableeconomicprior.Later,wewilltestformeanbiasinforecasterrorsseparately. 12

Figure1: Predictionofconsensusforecasterrorswithrevisions. Inflation rate for the GDP deflator 0.1 0.0 −0.1 −0.2 −0.3 tESS D Real GDP growth rate 0.10 In sample Out of sample 0.05 0.00 −0.05 −0.10 1969 1975 1981 1987 1993 1999 2005 2011 2017 tESS D In sample Out of sample 1969 1975 1981 1987 1993 1999 2005 2011 2017 CPI inflation 0.04 0.03 0.02 0.01 0.00 −0.01 −0.02 tESS D Federal funds rate 0.3 In sample Out of sample 0.2 0.1 0.0 −0.1 −0.2 1982 1987 1992 1997 2002 2007 2012 2017 2022 tESS D In sample Out of sample 1984 1989 1994 1999 2004 2009 2014 2019 Note: Dashed and solid lines represent cumulative squared errors ∆SSE for the in-sample regression t andtheout-of-sampleregression,respectively. Anincreaseinalineindicatesbetterperformanceofthe behavioralmodel;adecreaseinalineindicatesbetterperformanceoftherationalmodel. Dottedverticallinesmarktheendofthetrainingperiodandthebeginningoftheevaluationperiod. Shadedareas representNBERrecessions. overtime. The∆SSE statisticrepresentsthedifferenceofthesquaredoriginalforecast t errorsandthesquaredmodelpredictionerrors(eitherISorOOS)cumulateduptotime t. Thus, whenever a line increases, the behavioral model predicted better; whenever it decreases, the rational model predicted better. The endpoint of the lines represent the improvementinthemeansquaredforecasterror. Forexample,avalueof0.1attheend of our sample would imply that a forecaster who used the behavioral model to correct theirpredictionswouldhavemadeforecastswithatenpercentlowermeansquarederror. Figure 1 shows the evolution of ∆SSE over time for four variables that represent t typicaloutcomesofourtests. 13

The top left panel shows results for forecasts of inflation as measured by the GDP deflator, going back to 1968. The IS fit, represented by the dashed line, is surprisingly weak. Thebehavioralmodel fits thedatabetterovertheentiresamplebyconstruction of the OLS estimator. But this fit is largely achieved during a short period around 1975 andtoasmallerextentaftertheCovid-19recessionof2020. Duringmostofthetime,the IS line is flat, indicating that the behavioral model did not fit the data any better than the rational model. The OOS fit, represented by the solid line, is poor. The solid line is below zero for almost the entire evaluation period (to the right of the vertical dotted line). This means that, if a forecaster had used the Coibion and Gorodnichenko (2015) model to improve forecasts in real time, they would have made larger forecast errors thaniftheyhadtreatedtheoriginalforecastsasoptimalpredictions. The top right panel shows the same results for real GDP growth. The IS fit, represented by the dashed line, is again surprisingly weak. Out of sample, the behavioral modelbeatstherationalmodelintheperiodafterthefinancialcrisisof2008,whenthe solid line moves up markedly. After that, however, the OOS line stayed flat, indicating that the behavioral model had little predictive advantage. And after the Covid-19 recessionin2020,thebehavioralmodelfaredterribly. Importantly,thisbadperformance is not a mechanical result of the large forecast errors realized in 2020. The OOS line would have stayed flat if the behavioral model had made forecast errors that were as large as the unadjusted forecasts. But instead, the behavioral model modified the survey forecasts in the wrong direction, resulting in even larger forecast errors. When the pandemicshockhittheeconomy,economicactivityforecastsreviseddownandthebehavioral model adjusted the forecasts down further, in line with the underreaction of expectations. But subsequently, economic activity rebounded more quickly than impliedbytheoriginalforecasts,whichwouldbemoreconsistentwithoverreaction. This isagoodexampleofanunstablepredictiverelationship. The model does somewhat better on expectations of CPI inflation, shown in the bottom-leftpanel. CPIinflationexpectationsarethemainfocusofCoibionandGorodnichenko, and here their model would have led to a 1 percent reduction in the mean squaredforecasterrorattheendofthesample. Thisimprovementissufficienttoreject 14

thenullusingourbootstrappedcriticalvalues. However, thefigurealsoshowsthatthe gains over the rational model arise largely in 2021 and 2022, when inflation forecasts indeed underreacted to a sharp rise in inflation. Between 2007 and 2020, the bias correctionofthemodelwouldhavemadetheforecastsworse. Tous,thisalsolookslikean unstablemodel. Remarkably, the Coibion and Gorodnichenko model does very well on interest rate forecasts. ThelowerrightpanelofFigure1showsthatfederalfundsrateforecastscould have been substantially improved using this model. What’s more, the performance is indicativeofastablepredictionmodel: TheOOSlinetrendsupalmostthroughtheentire sample. At the end of our sample, a forecaster relying on the behavioral model to correct their forecasts would have reduced the mean squared error of their predictions byanimpressive19percent. Ourresultsdonotdependmateriallyonthesplitbetweenthetrainingandtheevaluationperiod,andthiscanbeseendirectlyfromtheplots. TheISandOOSlinesrepresent cumulative sums of squared errors, and if we start the evaluation period at a later date,thenwecansimplystartsummingthesquarederrorsfromthatdate. In Table 1, we document the IS and OOS performance of the Coibion and Gorodnichenkomodel,measuredbythe∆SSE statisticin(1)–(2),forallvariablesintheSPF T and BlueChip surveys with three-quarter ahead forecast horizons for which at least 20 years of data are available. Stars indicate whether the the null of no predictability is rejectedusingourbootstrappedcriticalvaluesfortheISandOOSversionsof∆SSE . T For almost all of the macroeconomic variables in the top half of the table, even the IS predictive performance is not high enough to reject the null of rationality using our bootstraptest. Notethatthestatistic∆SSE isattimesnegativeIS.Ifwesummedover T thefullsamplein(2),thestatisticwouldequalR2andwouldbepositivebyconstruction oftheOLSestimator,butweonlysumsquarederrorsovertheevaluationperiod,thatis, excluding the first 40 quarters of the sample. The fact that the IS performance is often not significant during this evaluation period already indicates an unstable predictive relationship. TheOOSperformanceforthemacroeconomicvariablesisgenerallynegative,inply- 15

ingthataforecasterwhohadreliedontheCoibion-Gorodnichenkomethodtoremove biasfromtheirforecastswouldhavebeenleftworseoffthanwithnoadjustmentatall. There are some notable exceptions to this rule, however. The behavioral model beats therationalmodelOOSforCPIinflation,industrialproductionandhousingstarts. The improvements are significant and reduce the mean squared forecast error by several percentagepoints. Wherethemodeldoesreallywellisoninterestrateforecasts,shown inthebottomhalfofthetable. Aforecastcorrectionusingthebehavioralmodelwould have led to a reduction in the mean squared forecast errors by up to 21 percent, which arestaggeringnumbersbyforecastingstandards. 4.2 Other models based on consensus forecasts Moving beyond the prominent model of Coibion and Gorodnichenko (2015), we now turntodiscussotherwidelyknownmodelsofbiasinconsensusexpectations. First, we examine a simple model of mean bias, where forecasts are systematically too high or too low. To test for mean bias, consensus forecast errors are regressed only onaconstant: y −y¯ = β +u (13) t+h t+h|t t+h A positive coefficient implies that forecasters always underpredict a variable by a constantamount. Becauseofitssimplicity,weexpectthisbiastobetheeasiesttodetect. Next,wetestforforecastautocorrelation,wherebyforecasterrorsarepredictedwith theirownlag: (cid:0) (cid:1) y −y¯ = β y −y¯ +u (14) t+h t+h|t t−1 t−1|t−h−1 t+h A positive coefficient on the lagged forecast error implies that overpredictions tend to be followed by more overpredictions, akin to a momentum effect in asset returns, implying that forecasts are slow to react to incoming information. Note that, in the data, forecastersattimetonlyknowtherealizationsofthedata(andthustheirownforecast errors)uptoperiodt−1. Forthebootstrap,wesetp = 0,q = h+1. Wealsolookatthewell-knownregressionsofMincerandZarnowitz(1969),inwhich 16

Table1: Predictionofconsensusprofessionalforecasterrorswithrevisions. ∆SSE IS OOS T Inflation(deflator) 0.034** -0.023 Inflation(CPI) 0.027** 0.015** RealGDP -0.005 -0.119 IndustrialProduction 0.094*** 0.035*** NominalGDP 0.021** -0.101 Unemploymentrate 0.003 -0.251 Consumption 0.010 -0.031 Non-residentialinv. 0.019** -0.061 Residentialinv. 0.032*** -0.026 Federalgovt. -0.002 -0.012 Non-federalgovt. 0.002 -0.027 Housingstarts 0.132*** 0.047*** Federalfundsrate 0.219*** 0.203*** 3-monthyield 0.190*** 0.181*** 6-monthyield 0.234*** 0.211*** 1-yearyield 0.220*** 0.196*** 2-yearyield 0.143*** 0.112*** 10-yearyield 0.025* -0.001 Aaayield 0.069*** 0.067*** Baayield 0.061** 0.052** 1y-3mspread -0.002 -0.009 10y-2yspread 0.083*** 0.056*** Aaa-Baaspread -0.004 -0.011 Note:Eachrowshowscumulativesquarederrors∆SSE insampleandoutofsample.***,**and*repre- T sentrejectionofthenullhypothesisofnopredictabilityofforecasterrorsatthe10,5,and1percentlevel usingbootstrappedcriticalvalues. YieldandspreadvariablesaretakenfromBlueChip, othervariables aretakenfromtheSPF. 17

realized outcomes are regressed on forecasts. An equivalent formulation is to regress forecasterrorsonforecastsandaconstant: y −y¯ = β +β y¯ +u (15) t+h t+h|t 0 1 t+h|t t+h Again, the null hypothesis implies β = 0. For this model, we set p = 2 and q = 0 in thebootstrap(5). Apositivecoefficientontheforecasty¯ impliesthatforecastersare t+h|t toooptimisticwhenevertheirforecastsarehigh,thuscapturingaformofextrapolation bias. Finally,welookattheNordhaus(1987)testofforecastefficiency. Insteadofputting forecast errors on the left-hand side, Nordhaus examined the predictability of forecast revisions. Rational expectations imply that forecast revisions are unpredictable, in addition to forecast errors, because of the law of iterated expectations.6 The Nordhaus regressionmodelhastheform: (cid:0) (cid:1) y¯ −y¯ = β y¯ −y¯ +u (16) t|t−h t|t−h−1 t|t−h−1 t|t−h−2 t+h This model can be interpreted as a test of the stickiness of forecasts: A positive coefficient on past revisions implies underreaction of forecasts, as the forecasts will be predictably revised in the same direction as the previous revision. For the bootstrap, here wemodelboththeregressorandtheregressandaswhitenoise(p = q = 0). The results of our OOS tests for these models are documented in Table 2. For the macroeconomicvariables,asimilarpictureemergesregardlessofthemodel: TheISperformanceistypicallyweakandinsufficienttorejectthenull,andtheOOSperformance is typically negative: A real-time bias correction would have made the forecasts worse. In general, none of the models are able to consistently beat the null of no predictability. There are some exceptions throughout the table. For example, forecast errors of deflator-based inflation and housing starts are predictable OOS using the autocorrelation model. To us, these isolated “wins” are likely to arise by chance, as a byproduct 6Strictly speaking, unpredictability of forecast revisions also requires that the information sets are nestedovertime,sothatnoinformationis“forgotten”astheforecastsarerevised. 18

of the large number of tests in this paper. The picture looks a bit better for the Nordhausmodel,whichonlypredictsforecastrevisionsinsteadofactualforecasterrors,and sometimesdoeswellatthat. Thisbetterperformancerelativetotheothermodelsmay reflect that revisions contain less noise than realizations and are therefore more easily predictable. Nevertheless, the quest for a simple, unifying empirical relationship summarizingbiasinmacroeconomicexpectationsofprofessionalforecastersseemselusive. The picture is different for interest rate forecasts, shown in the bottom half of the table. Here,allmodelsareabletoimproveforecastefficiencyOOSforshort-termyields. The mean bias model in particular is able to reduce the mean squared forecast error across all interest rates in the table, by as much as 29 percent. Our interpretation of this strong deviation from rationality is that mean bias is driving the performance of theothermodels,too. Thelevelofinterestrateshasdeclinedsteadilyoverthepastfew decades, and while forecasters continually revised their projections also, they consistentlyunderestimatedthesecularfallininterestrates(RungcharoenkitkulandWinkler, 2022). Asaresult, forecastrevisionsandforecasterrorsareonaveragenegative, resultinginpredictabilityinthemodelsofCoibionandGorodnichenko(2015)andNordhaus (1987); forecast errors are positively correlated; and the Mincer and Zarnowitz (1969) alsoperformswellasitincludesaconstantthatpicksupthemeanbias. In sum, none of the models that feature prominently in the literature can consistently improve forecasts of macroeconomic expectations. However, interest rate forecast errors are robustly predictable with many of these models, perhaps related to the persistentunderpredictionoftheseculardeclineininterestrates. 4.3 Tests based on individual expectations One can argue that consensus forecasts, which average out the idiosyncrasies of individuals, represent a “best case”: If it can be shown that average forecasts are biased, then the individual forecasts must be biased as well. This argument is generally valid as long as the predictor variable is a part of the information set of all individuals. But using consensus forecasts is only an indirect way of testing for biases, because they do 19

Table2: Othermodelsofconsensusprofessionalforecasterrors. (1) (2) (3) (4) Meanbias Autocorrelation Mincer-Zarnowitz Nordhaus ∆SSET IS OOS IS OOS IS OOS IS OOS . . . . . . . . Inflation(deflator) -0 025 -0 416 0 128*** 0 122*** -0 061* -0 476 0 094*** 0 066*** . . . . . . . . Inflation(CPI) -0 008 -0 073 -0 001 -0 040 0 013* -0 043 0 105*** 0 075*** . . . . . . . . RealGDP 0 006 -0 040 0 029* 0 005 -0 004* -0 153 0 004 -0 074 . . . . . . . . IndustrialProduction 0 055* 0 007 0 006 -0 008 0 082* 0 008 0 055*** -0 007 . . . . . . . . NominalGDP 0 025 -0 029 0 018 -0 028 0 022* -0 052 0 013 -0 061 . . . . . . . . Unemploymentrate -0 001 -0 029 0 005 -0 032 0 048 -0 041 0 000 -0 126 . . . . . . . . Consumption 0 012 -0 011 0 052* -0 172 0 021* 0 010 0 027** -0 082 . . . . . . . . Non-residentialinv. -0 004 -0 065 0 000 -0 035 -0 001 -0 112 0 075*** 0 011** . . . . . . . . Residentialinv. 0 002 -0 063 0 075** -0 007 -0 041 -0 306 0 128*** 0 098*** . . . . . . . . Federalgovt. 0 001 -0 062 0 044* -0 035 0 118* 0 027 -0 004 -0 014 . . . . . . . . Non-federalgovt. 0 025 -0 105 0 056* 0 026 0 082* -0 103 0 107*** 0 084*** . . . . . . . . Housingstarts 0 008 -0 043 0 141*** 0 117*** 0 027* -0 100 0 233*** 0 205*** . . . . . . . . Federalfundsrate 0 121** 0 061** 0 071* 0 047** 0 134* -0 007 0 234*** 0 222*** . . . . . . . . 3-monthyield 0 183*** 0 129*** 0 113** 0 089*** 0 221** 0 089** 0 230*** 0 217*** . . . . . . . . 6-monthyield 0 211*** 0 157*** 0 186*** 0 129*** 0 255** 0 115** 0 259*** 0 245*** . . . . . . . . 1-yearyield 0 198** 0 135** 0 152* 0 037 0 208* 0 043 0 240*** 0 227*** . . . . . . . . 2-yearyield 0 212*** 0 154*** 0 128** -0 022 0 229** 0 046** 0 187*** 0 168*** . . . . . . . . 10-yearyield 0 323*** 0 295*** 0 040 -0 046 0 337*** 0 054** 0 068*** 0 042*** . . . . . . . . Aaayield 0 268*** 0 225*** 0 044 0 016 0 260** -0 125 0 101*** 0 093*** . . . . . . . . Baayield 0 439*** 0 402*** 0 172** 0 108** 0 442*** 0 238*** 0 135*** 0 097*** . . . . . . . . 1y-3mspread -0 019 -0 118 0 030 0 011 0 093 -0 148 0 036** 0 017** . . . . . . . . 10y-2yspread 0 002 -0 061 0 008 -0 022 0 078 -0 057 0 118*** 0 090*** . . . . . . . . Aaa-Baaspread -0 036 -0 139 0 075 0 071 0 153 0 036 -0 040 -0 129 Note: Eachrowshowscumulativesquarederrors∆SSE insampleandoutofsampleforanumberof T predictive models of forecast errors. ***, ** and * represent rejection of the null hypothesis of no predictabilityofforecasterrorsatthe10,5,and1percentlevelusingbootstrappedcriticalvalues. Yieldand spreadvariablesaretakenfromBlueChip,othervariablesaretakenfromtheSPF. 20

not represent the forecasts of any one individual. The literature has also documented biasesinexpectationsattheindividuallevel,towhichwenowturn. One of the most prominent recent studies examining the rationality of individual forecastsisBordaloetal.(2020)(BGMS).Theyrunaregressionoftheform: (cid:0) (cid:1)(cid:48) y −y = β y −y +u (17) t+h t+h|it t+h|it t+h|it−1 it Theydocumentthatthepredictionofforecasterrorswithforecastrevisionsalsoworks at the individual level, but often with a negative coefficient on the revision. This negative coefficient is interpreted as overreaction of forecasts: When forecasters raise their forecasts,theytendtooverpredict.7 We further test the autocorrelation, Mincer and Zarnowitz (1969), and Nordhaus (1987)modelsattheindividuallevel: (cid:0) (cid:1) y −y = β y −y +u (18) t+h t+h|it t−1 t−1|it−h−1 it+h y −y = β +β y +u (19) t+h t+h|it 0 1 t+h|it it+h (cid:0) (cid:1) y −y = β y −y +u . (20) t|it−h t|it−h−1 t|it−h−1 t|it−h−2 it+h A variation of the Mincer and Zarnowitz (1969) model has recently been advanced by KohlhasandWalther(2021),whoregressforecasterrorsontherealizedvaluesofavariable. We only include the lagged value of the realization as the current-period value is notpartoftheinformationsetwhentheforecastsaremade: y −y = β +β y +u (21) t+h t+h|it 0 1 t−1 it+h Forthismodel,wesetp = 2andq = 0inthebootstrap(11). Finally, we test a model based on forecast combination. It is well documented that combining forecasts from different people and models almost always improves fore- 7Likeattheconsensuslevel,weexcludeaconstantfromtheregression.Wealsoomitfixedeffectsand pool the regression coefficient. Including individual-specific parameters would make OOS prediction veryhard,duetothesmallnumberofobservationsintheindividualsamples(lessthan10onaveragein theSPF). 21

casting performance in practice (see Timmermann, 2006, for a review). Based on this idea,weconstructamodelofbiasedexpectationsasfollows: (cid:0) (cid:1)(cid:48) y −y = β +β y¯ −y +u . (22) t+h t+h|it 0 1 t−1|t−h−1 t−1|it−h−1 it+h Thevariableontheright-handsideisthelaggeddifferencebetweentheconsensusforecast and the individual forecast. The timing is important: We do not relate individual forecaststotheconsensusforecastinthesameperiod,sincethisobjectisnotknownto the agents at the time they complete the survey. However, past consensus forecasts as wellastheagents’ownpredictionsarepartoftheirinformationset,makingthisavalid testofrationality. Forthebootstrap(11),herewesetp = 1,q = 0. WesubjectthesemodelstoOOStestsatthelevelofindividualforecasts. Ifourfailure to reject the null of no predictability at the consensus level were a problem of small samplesizeandestimationnoise,weshouldexpectpredictiveperformancetoimprove attheindividuallevel,wherethecross-sectionaldimensionofthedatagreatlyexpands thenumberofobservations. InFigure2,weshowchartsplotting∆SSE ofthefirstandlastofthesemodels,using t CPIinflation,unemploymentrate,andthree-monthTreasurybillforecasts. TheleftpanelsshowtheperformanceoftheBGMSmodel(17). ForCPIinflation(top left panel), the model’s performance is mixed. It outperforms the rational benchmark between 2008 and 2020, but these gains are erased at the end of the sample. Overall, we cannot reject the null of no predictability. For real GDP growth (middle left panel), the picture is the reverse: Tepid performance during most of the sample, then a big improvement at the end which lifts the OOS performance to levels for which we can reject the null. The estimated coefficient on revisions is negative throughout the sample, which allows the model to fit the overreaction of (aggregate) expectations after the Covid-19shockin2020. Forfederalfundsrateforecasts(bottomleftpanel),thebehavioral model outperforms the null of rationality by eight percent, which is less than the CoibionandGorodnichenko(2015)modelattheconsensuslevelbutstillquitestrong. TherightpanelsofFigure2showtheperformanceoftheforecastcombinationbias 22

Figure2: Predictionofindividualprofessionalforecasterrors. (a)Revisions. CPI inflation 0.015 0.010 0.005 0.000 tESS D In sample Out of sample 1982 1987 1992 1997 2002 2007 2012 2017 2022 Real GDP growth rate 0.06 0.04 0.02 0.00 −0.02 tESS D In sample Out of sample 1969 1975 1981 1987 1993 1999 2005 2011 2017 Federal funds rate 0.15 0.10 0.05 0.00 −0.05 tESS D (b)Forecastcombination. CPI inflation 0.05 0.04 0.03 0.02 0.01 0.00 −0.01 In sample Out of sample 1984 1989 1994 1999 2004 2009 2014 2019 tESS D In sample Out of sample 1982 1987 1992 1997 2002 2007 2012 2017 2022 Real GDP growth rate 0.08 0.06 0.04 0.02 0.00 −0.02 −0.04 tESS D In sample Out of sample 1969 1975 1981 1987 1993 1999 2005 2011 2017 Federal funds rate 0.10 0.05 0.00 −0.05 −0.10 tESS D In sample Out of sample 1984 1989 1994 1999 2004 2009 2014 2019 Note: Dashed and solid lines represent cumulative squared errors ∆SSE for the in-sample regression t andtheout-of-sampleregression,respectively. Dottedverticallinesmarktheendofthetrainingperiod and the beginning of the evaluation period. An increase in a line indicates better performance of the behavioralmodel;adecreaseinalineindicatesbetterperformanceoftherationalmodel. Shadedareas representNBERrecessions. 23

model (22). We judge the performance of this model to be remarkable. It consistently performswellacrossallvariablesinourdataset. ForCPIinflation(toprightpanel),the reductioninthemeansquaredforecasterrorisaboutthreepercent. Moreover,thenull ofnopredictabilityconsistentlyisrejectedconsistentlyovertime: Theblacklineinthe chartsteadilytrendsupward,nevermovingdownappreciably. Thisisabeautifulexam- ˆ ple of a stable predictive relationship. Indeed, the estimated coefficients β in the OOS t prediction also remain stable over the sample. A similar picture emerges for real GDP forecasts (middle right panel): Here, too, the forecast combination bias model outperforms the null consistently, and the performance is roughly double that of the BGMS model. Forinterestrateforecasts,thismodelachievesareductioninthemeansquared forecasterrorofclosetofivepercent. Whatisremarkableisthatthegainsinpredictive performanceareaccumulatedsteadilyandrobustlyovertime. Results for all variables are shown in Table 3. Starting in Column (1), the BGMS model predicting forecast errors with revisions manages to achieve significant performance gains for a number of variables. This model seems to outperform the null of rationalityinsomeareas, butnotinothers. Overall, wethinkthatitdoesnotrepresent abiasthatisuniversalinprofessionalforecasts. 24

Table3: Predictionofindividualprofessionalforecasterrors. (1) (2) (3) (4) (5) (6) ∆SSET,OOS BGMS Autocorrelation Mincer-Zarnovitz Nordhaus Kohlhas-Walther Forecastcombination . . . . . . Inflation(deflator) 0 007*** 0 114*** -0 513 0 007*** -0 323 0 141*** . . . . . . Inflation(CPI) -0 003 -0 036 0 043** -0 009 -0 036 0 034*** . . . . . . RealGDP 0 022*** -0 029 0 042** 0 004*** 0 013 0 060*** . . . . . . IndustrialProduction -0 010 -0 016 0 036** -0 010 0 002 0 049*** . . . . . . NominalGDP 0 011*** -0 060 -0 071 0 002** -0 061 0 061*** . . . . . . Unemploymentrate -0 114 -0 048 0 010 -0 065 0 024 0 013*** . . . . . . Consumption 0 058*** -0 166 -0 003 0 003** -0 090 0 036*** . . . . . . Non-residentialinv. -0 018 -0 035 -0 055 -0 007 -0 102 0 064*** . . . . . . Residentialinv. -0 019 0 046*** -0 037 -0 021 -0 092 0 108*** . . . . . . Federalgovt. 0 082*** 0 063*** -0 009 -0 014 -0 027 0 148*** . . . . . . Non-federalgovt. 0 123*** 0 039*** 0 126*** -0 007 -0 048 0 206*** . . . . . . Housingstarts 0 004 0 208*** -0 099 -0 009 -0 054 0 108*** . . . . . . Federalfundsrate 0 080*** 0 075*** -0 070 0 083*** 0 035** 0 035*** . . . . . . 3-monthyield 0 071*** 0 132*** 0 025** 0 063*** 0 123*** 0 043*** . . . . . . 6-monthyield 0 108*** 0 145*** 0 091*** 0 089*** 0 132*** 0 031*** . . . . . . 1-yearyield 0 085*** 0 070*** -0 007 0 071*** 0 039** 0 041*** . . . . . . 2-yearyield 0 040*** 0 032*** -0 015 0 032*** 0 04** 0 044*** . . . . . . 10-yearyield -0 003 -0 016 0 008 -0 012 0 099*** 0 069*** . . . . . . Aaayield 0 000 0 005 -0 181 -0 003 -0 160 0 068*** . . . . . . Baayield 0 001 0 197*** 0 012 -0 043 0 248*** 0 130*** . . . . . . 1y-3mspread 0 084*** -0 025 0 232*** 0 108*** -0 233 0 087*** . . . . . . 10y-2yspread -0 002 -0 004 0 029** -0 006 0 002 0 053*** . . . . . . Aaa-Baaspread 0 008 -0 084 0 513*** -0 069 0 066 0 067*** Note: Eachrowshowscumulativesquarederrors∆SSE insampleandoutofsampleforanumberofpredictivemodelsofforecasterrors. ***, T **and*representrejectionofthenullhypothesisofnopredictabilityofforecasterrorsatthe10,5,and1percentlevelusingbootstrappedcritical values. 25

Columns (2) through (5) show the performance of the autoregressive model, the Mincer-Zarnovitzmodel,theNordhausmodel,andtheKohlhas-Walthermodel. Among these models, the autoregressive model in Column (2) performs the best across the macroeconomic variables, so that there is some evidence that professional forecast errors are persistent at the individual level, though not universally so. For interest rates, it is the Nordhaus model in Column (4) that performs best, and with a similar magnitude as the BGMS model. To us, these findings are consistent with inefficient, slowly mean-reverting deviations of individual forecasts from the average. If individual revisions reflect such deviations, then positive revisions negatively predict forecast errors and future revisions, and forecast errors are autocorrelated. Such inefficient forecast dispersion could arise because forecasters respond to strategic diversification incentivesbydeviatingfromoptimalforecasts(GemmiandValchev,2021). TheforecastcombinationbiasmodelinColumn(6)preciselyrepresentssuchinefficientdeviations,asitpredictsthatforecastersthatareoptimisticrelativetotheaverage last period will be too optimistic. This model shows significant predictive gains across all variables in our data set. We judge this to be a remarkable achievement. It is worth noting that this result is stronger than the well-known fact that the consensus forecast is more efficient than individual forecasts, because the behavioral model is based off the lagged value, rather than the current value, of the consensus forecast. Combined with the earlier observation that the performance gains accrue steadily over time, we concludethatinefficientdispersionofindividualforecastsisthemostrobustandstable departurefromrationalexpectationsinsurveysofprofessionalforecasters. 5 Results for Households So far, we have focused on expectations of professional forecasters. These individuals are usually well-educated specialists employed by financial institutions who expend a great amount of time and resources forming their expectations. Our failure to reject the null of rationality with our OOS tests may in fact indicate that the expectations of these individuals are quite close to rational. However, there also exist surveys of less 26

sophisticated forecasters, particularly of households. We expect that households form less accurate expectations, and that it should be easier to reject the null of rational expectations. ThetwomainsurveysofAmericanhouseholdsthatelicitmacroeconomicexpectations are the Michigan survey and the Survey of Consumer Finances (SCE). There are several differences in methodology between these two surveys, but most importantly, the SCE starts in 2013 while the Michigan survey goes back to 1978. We restrict ourselves to 12-month ahead inflation expectations (h = 12), as other expectations either have limited coverage or only have categorical response variables. We define realized inflationasheadlineCPIinflation. We first aggregate the individual forecasts using both the average and the median, since there are meaningful differences between the two for households. At the aggregated level, we test the models (12)–(15). For the Coibion-Gorodnichenko model (12), we use the month-over-month difference in consecutive 12-month ahead inflation expectationsasaproxyforforecastrevisionsduetodatalimitations. Atthedisaggregated individual level, we test a panel version of the mean bias model (13), as well as the Mincer-Zarnovitzmodel(19)andtheforecastcombinationmodel(22). Fortheforecast combinationtest,weusethedifferenceofthecurrentforecastandlastperiod’sconsensusforecasttoproxyforpastdisagreement,againduetodatalimitations. Thebootstrap parametersaresetinthesamewayasfortheprofessionalforecasterdata,exceptforthe forecast errors themselves. Fitting an MA(13) process, which would be natural under thenull,isinfeasibleduetodatalimitations. Instead,wefitanMA(3)process. Table4summarizestheresultsofourtestsforhouseholds. Column(1)showstheresultsfromthesimplemeanbiasmodel. Insample,itisrelatively easy to detect a (positive) mean bias in household inflation expectations.8 But out of sample, this mean bias is difficult to exploit because it is hard to estimate its magnitude in real time. The mean bias model still significantly improves Michigan average forecast errors OOS, but fails badly for the shorter sample in the SCE. The dis- 8Thebiasismorepronouncedfortheaveragecomparedtothemedian,asthedistributionofindividualinflationforecastsisskewedtotheupside. 27

Table4: Predictionofhouseholdinflationforecasterrors. (1) (2) (3) (4) (5) Meanbias Revisions Autocorrelation Mincer-Zarnovitz Forecastcombination ∆SSET IS OOS IS OOS IS OOS IS OOS IS OOS Michiganavg. 0.239*** 0.028** 0.003* 0.002 0.086** 0.063*** 0.181*** 0.008** – – Michiganmedian 0.012 -0.226 0.001 -0.005 -0.001 -0.043 -0.034 -0.252 – – Michiganind. 0.024*** -0.014 – – – – 0.875*** 0.617*** 0.896*** 0.889*** SCEavg. 0.250** -0.104 0.000 -0.002 0.178* 0.032 0.355** -1.491 – – SCEmedian -0.115 -0.562 0.203*** 0.050** -0.013 -0.061 0.184 -2.139 – – SCEind. -0.034 -0.113 – – – – 0.742*** 0.632*** 0.819*** 0.799*** Note: Eachrowshowscumulativesquarederrors∆SSE insampleandoutofsampleforanumberof T predictive models of forecast errors. ***, ** and * represent rejection of the null hypothesis of no predictabilityofforecasterrorsatthe10, 5, and1percentlevelusingbootstrappedcriticalvalues. Forthe bootstraponSCEdata,werestrictthelaglengthoftheMAprocessofu to3. it crepancybetweenthetwosurveyscanbeattributedtotheirdifferentsamplewindows. To see this, we show the evolution of OOS performance of the mean bias model in the MichigansurveyovertimeinFigure3. Theleftpanelshowsthepredictiveperformance for the average expectation. Starting around 1990, the mean bias model made steady gainsasaverageinflationexpectationsstayedstubbornlyaboveactualinflationfortwo decades. However,after2020,inflationsoaredmuchfasterthaninflationexpectations, defying the predictions of a positive mean bias. Over the length of the Michigan sample, the overall predictive performance of the mean bias model is good enough so that weassignstatisticalsignificanceusingourbootstrappedcriticalvalues. Butinthemore limited sample of the SCE (not shown in the figure), the period of high inflation starting in 2021 takes a much greater share of the sample, which explains why the mean bias model underperforms the null of rational expectations in Table 4. Also, the right panel of Figure 3 illustrates that median inflation expectations tracked actual inflation much more closely than average expectations, resulting in a mean bias that averages near zero in sample and dismal OOS performance of the mean bias model applied to medianhouseholdexpectations. Column(2)ofTable4showsthatusingtheCoibionandGorodnichenko(2015)model ofregressingforecasterrorsonrevisionsdoesnotleadtoanysignificantimprovements in Michigan forecast errors, either in sample or out of sample. This model is quite un- 28

Figure3: PredictionofMichiganinflationexpectations: meanbiasmodel. (a)Averageexpectations. CPI inflation 0.4 0.3 0.2 0.1 0.0 −0.1 tESS D (b)Medianexpectations. CPI inflation 0.10 In sample Out of sample 0.05 0.00 −0.05 −0.10 −0.15 −0.20 1979 1984 1989 1994 1999 2004 2009 2014 2019 tESS D In sample Out of sample 1979 1984 1989 1994 1999 2004 2009 2014 2019 Note: Dashed and solid lines represent cumulative squared errors ∆SSE for the in-sample regression t andtheout-of-sampleregression,respectively. Dottedverticallinesmarktheendofthetrainingperiod and the beginning of the evaluation period. An increase in a line indicates better performance of the behavioralmodel;adecreaseinalineindicatesbetterperformanceoftherationalmodel. Shadedareas representNBERrecessions. stable and the estimated coefficient is sometimes positive, sometimes negative. The model does work well for SCE medians, but the entire forecasting performance in that specification is generated in 2022, the last year of the sample. During that year, medianforecasterrorsandmedianrevisionswerepositive, consistentwithunderreaction ofexpectations. Bycontrast,themodelisnotabletoimproveonSCEaverageforecasts because the average expectation—consistently higher than the median—was remarkablyclosetoactualinflationin2022. Column(3)showstheperformanceoftheautoregressivemodel. Likethemeanbias model, this model only works well for Michigan average expectations, but not for SCE averages,ormediansineithersurvey. Thispatternisanexpressionofthesamepositive meanbiasinhouseholdaverageinflationexpectationsinthelasttwodecadesdiscussed above,becausemeanbiasimpliesautocorrelationofforecasterrors.9 Column (4) shows the Mincer-Zarnovitz model. At the mean or median level, this modelfaressimilarlytothemeanbiasmodelduetotheinclusionofaconstanttermin 9Notethatweomitaconstantintheautoregressivemodel,sothatmeanbiasimpliesapositivecoefficientonlaggedforecasterrorsinthatmodel. 29

thatmodel(infact,thecoefficientβ onforecastsin(15)isstablearoundone). Whatis 1 most noticeable, however, is the stunning performance of this model at the individual level: More than 60 percent of the mean squared forecast error is predictable OOS using this model in both surveys. The model consistently predicts β ≈ −1 in Equation 1 (19),whichmeansthattheindividualforecastsaretreatedentirelyasnoise,andtheimproved forecast of that model is just the constant β . This behavior is a consequence 0 ofthefactthatthedispersioninindividualhouseholdinflationexpectationsdwarfsthe variationinmeanforecasts,andsoitisbesttodisregardtheindividualvariationinforecasts. Excessdispersionofforecastsalsoexplainswhytheforecastcombinationmodelin Column (5) performs well. More than 80 percent of the individual mean squared forecast error can be predicted using this model. The coefficient on the lagged difference of an individual’s forecast and the consensus is close to one and stable over time. The fact that the lagged difference predicts individual forecast errors so well points to the persistenceinhouseholdinflationforecasts. Summing up, household inflation forecasts do seem much more biased than those ofprofessionalforecasters. Theirdeviationfromoptimalforecastsoccursmostlyatthe individual level, where inflation forecasts display an excessive degree of dispersion. At theconsensuslevel,thereissomeevidenceofmeanbias,althoughthisbiasisnotstable overtime. Instead,household’smeanbiasininflationexpectationsappearstobetimevarying. 6 Robustness 6.1 Rolling window regression Acentralthemethatemergesfromouranalysisisthatthebehavioralmodelsoftenseem unstable over time. Indeed, in our OOS regressions, many of the estimated model coefficients display sizable variation over time. Biases can be time- or state-dependent. If this is the case, real-time prediction of forecast errors is inherently more difficult, as 30

one now has to not only estimate past bias, but also predict how the bias will change in the future. This may explain the disconnect between IS and OOS predictability that we have often observed in our analysis so far. This explanation has also been put forward in the finance literature (Lettau and Van Nieuwerburgh, 2007). Traditional tests that have a constant parameter hypothesis, such as the ones we have used so far, are thenmisspecified. One way to take time variation into account is to run our regressions with a rolling window instead of a recursive window. In keeping with the simplicity of our empirical approach, wechooserollingwindowsoveramoresophisticatedtime-varyingparameter regression. We keep the initial training sample period of 40 quarters, but now also use this as the size of a rolling window used to estimate the real-time OOS coefficients forprediction. Generally,wefindthatusingrollingwindowregressiondoesnothelpto predictforecasterrors: Theincreasedestimationnoisefromsmallerwindowsizesoutweighs any gain from capturing time variation in the true model coefficients. Because oftheincreaseinestimationnoise,ourbootstrappedcriticalvaluesforrejectingthenull alsodecrease. Thebootstrappedsignificancelevelsremainfairlysimilartothoseofour baseline estimation. The online appendix contains detailed results of rolling window regressions. 6.2 Other forecast horizons Our baseline estimation uses three-quarter ahead horizons, which is used by Coibion andGorodnichenko(2015)andmanyotherstudiesintheliterature. Butwealsoexamine the robustness of our results to the choice of the forecast horizon. As documented in the online appendix, the results do not depend materially on this choice. There is some more OOS predictability at the “nowcast” horizon, i.e. of forecast errors of the current-quarterrealizationsh = 0. 31

6.3 Adding an intercept We have omitted an intercept term from the models whenever this can be motivated with an economic prior. For example, in the model of Coibion and Gorodnichenko (2015), forecast errors and forecast revisions both have zero unconditional mean, and yet the former is predictable by the latter, so an intercept is in principle unnecessary for the empirical regression. While having an intercept in an IS regression is common and amounts to nothing more than demeaning the data, in an OOS test this decision can have important consequences. An intercept is another parameter that needs to be estimatedinrealtime,increasingestimationnoise. Moreover,insmallsamples,distinguishing between the contributions of a highly autocorrelated variable and a constant canbechallenging. Whenweincludeaninterceptinourregressions,thepredictiveperformanceofthe modelstypicallydeterioratesexceptwhenthereisastrongmeanbiasinthedatatostart with. As an illustration, consider the use of the Bordalo et al. (2020) model to predict individual inflation forecast errors using revisions at the individual level. Figure 4 contraststheOOSperformanceofthatmodelwithoutanintercept(leftpanel)andwithan intercept(rightpanel), forforecasterrorsofinflationbasedontheGDPdeflator. Without an intercept, the model is able to outperform the null of rational expectations by a modest margin. But with an intercept, the performance is dismal. The reason is that theinterceptisaregressorthathasnocross-sectionalvariationandoperatespurelyon the time-series dimension of the data, trying to fit mean bias in the average forecast. Because, as previously documented in Table 2, a mean bias model fits professional inflationforecastspoorly,thismodelinheritsthatbadperformance. Theonlineappendix documents that similar patterns hold for all models covered in this paper: Adding an intercept improves performance when the corresponding mean bias model performs well, while it deteriorates performance when the corresponding mean bias model performspoorly. 32

Figure4: Effectofanintercepttermonprediction(usingrevisions,individuallevel). (a)Withoutintercept. Inflation rate for the GDP deflator 0.02 0.01 0.00 −0.01 −0.02 tESS D (b)Withintercept. Inflation rate for the GDP deflator 0.05 In sample Out of sample 0.00 −0.05 −0.10 −0.15 −0.20 −0.25 1969 1975 1981 1987 1993 1999 2005 2011 2017 tESS D In sample Out of sample 1969 1975 1981 1987 1993 1999 2005 2011 2017 Note: Dashedandsolidlinesshowcumulativesquarederrorsforthein-sampleregression∆SSEIS and t theout-of-sampleregression∆SSEOOS,respectively,expressedasafractionofthetotalsumofsquared t forecasterrorsovertheevaluationperiod. Dottedverticallinesmarktheendofthetrainingperiodand the beginning of the evaluation period. An increase in a line indicates better performance of the behavioral model; a decrease in a line indicates better performance of the rational model. Shaded areas representNBERrecessions. 6.4 Data transformations In our tests, we have transformed macroeconomic data following the conventions of theliterature. Inparticular, weconvertlevelforecastsofmacroeconomicaggregatesin the SPF into growth rate forecasts. Our results are robust to a range of data transformations, including taking log differences instead of working with growth rates in percentage points; working with quarter-over-quarter growth rates instead of annualized growth rates; and working with growth rates between the quarter of the forecast horizonandthepreviousquarterinsteadofgrowthratesbetweenthequarteroftheforecast horizonandthequarterbeforethesurveydate.10 7 Conclusion This paper has shown that many models of biases in expectations documented in the literaturearenotrobusttoout-of-sampletests. Thesemodelsseemunstableandwould 10Detailedresultsoftheseadditionaltestsareavailableuponrequest. 33

nothavehelpedaforecastertoimprovetheirpredictionsinrealtime. Thisgeneralfindingholdsforprofessionalforecastersandhouseholds. However,therearesomenotableexceptionstothisfinding. First,interestrateforecasts display a stable mean bias that can be used to greatly improve forecasts out-ofsample. Second,thereissomeevidenceformeanbiasandautocorrelationinhousehold inflationexpectations. Third,individualexpectationsofprofessionalforecastersdisplay excessdispersionfromtheconsensusforeveryvariableandforecasthorizonavailable. Thisexcessdispersionisevenmorestrikingininflationexpectationsinhouseholdsurveys. We hope that our findings will be useful to researchers in behavioral macroeconomics, where facts about deviations from rational expectations abound and models are validated by their success in matching moments corresponding to these facts. Our out-of-sample tests provide a simple and natural way to focus on those facts are most robust in the data and the associated research questions. These are: Why did forecasterssystematicallyoverpredictinterestratesfordecades? Whyhavehouseholdinflation expectations been so high for so long? And why do people disagree so much about the future,seeminglyignorantofthebenefitsofforecastcombination? References Afrouzi, Hassan and Laura Veldkamp, “Biased Inflation Forecasts,” 2019 Meeting Papers894,SocietyforEconomicDynamics2019. Andolfatto, David, Scott Hendry, and Kevin Moran, “Are inflation expectations rational?,”JournalofMonetaryEconomics,2008,55(2),406–422. Angeletos, George-Marios, Fabrice Collard, and Harris Dellas, “Quantifying Confidence,”Econometrica,2018,86(5),1689–1726. , Zhen Huo, and Karthik A. Sastry, “Imperfect Macroeconomic Expectations: EvidenceandTheory,”NBERMacroeconomicsAnnual,2021,35,1–86. 34

Bhandari, Anmol, Jaroslav Borovicˇka, and Paul Ho, “Survey Data and Subjective BeliefsinBusinessCycleModels,”Workingpaper2022. Bianchi,Francesco,SydneyC.Ludvigson,andSaiMa,“BeliefDistortionsandMacroeconomicFluctuations,”AmericanEconomicReview,July2022,112(7),2269–2315. Bonham,CarlS.andDouglasC.Dacy,“InSearchofa”StrictlyRational”Forecast,”ReviewofEconomicsandStatistics,1991,73(2),245–253. Bordalo, Pedro, Nicola Gennaioli, Yueran Ma, and Andrei Shleifer, “Overreaction in MacroeconomicExpectations,”AmericanEconomicReview,September2020,110(9), 2748–82. Bu¨rgi, Constantin and Julio Ortiz, “Overreaction Through Expectation Smoothing,” Workingpaper2022. Clark, ToddE.andKennethD.West, “Approximately normal tests for equal predictive accuracyinnestedmodels,”JournalofEconometrics,2007,138(1),291–311. Coibion,OlivierandYuriyGorodnichenko,“InformationRigidityandtheExpectations FormationProcess: ASimpleFrameworkandNewFacts,”AmericanEconomicReview, August2015,105(8),2644–78. Dovern, Jonas, Ulrich Fritsche, Prakash Loungani, and Natalia Tamirisa, “Information rigidities: Comparing average and individual forecasts for a large international panel,”InternationalJournalofForecasting,2015,31(1),144–154. Elliott,Graham,IvanaKomunjer,andAllanTimmermann,“BiasesinMacroeconomic Forecasts: Irrationality or Asymmetric Loss?,” Journal of the European Economic Association,2008,6(1),122–157. Fama,EugeneF.andKennethR.French,“Dividendyieldsandexpectedstockreturns,” JournalofFinancialEconomics,1988,22(1),3–25. Farmer, Leland, Emi Nakamura, and Jon Steinsson, “Learning About the Long Run,” WorkingPaper29495,NationalBureauofEconomicResearchNovember2021. 35

Gemmi, Luca and Rosen Valchev, “Biased Surveys,” Working paper., Boston College 2021. Goyal,Amit,IvoWelch,andAthanasseZafirov,“AComprehensiveLookattheEmpiricalPerformanceofEquityPremiumPredictionII,”Workingpaper2021. Hajdini,InaandAndreKurmann,“PredictableForecastErrorsinFull-InformationRationalExpectationsModelswithRegimeShifts,”Workingpaper2022. Kohlhas, Alexandre N. and Ansgar Walther, “Asymmetric Attention,” American EconomicReview,September2021,111(9),2879–2925. Lettau,MartinandStijnVanNieuwerburgh,“ReconcilingtheReturnPredictabilityEvidence: The Review of Financial Studies: Reconciling the Return Predictability Evidence,”ReviewofFinancialStudies,122007,21(4),1607–1652. Malmendier, Ulrike and Stefan Nagel, “ Learning from Inflation Experiences,” QuarterlyJournalofEconomics,102015,131(1),53–87. McElroy, Tucker and Simon Sheng, “Augmented Information Rigidity Test,” Working paper2022. Mincer, Jacob A and Victor Zarnowitz, “The evaluation of economic forecasts,” in “Economic forecasts and expectations: Analysis of forecasting behavior and performance,”NBER,1969,pp.3–46. Nagel,StefanandZhengyangXu,“DynamicsofSubjectiveRiskPremia,”WorkingPaper 29803,NBERFebruary2022. Nordhaus,WilliamD.,“ForecastingEfficiency: ConceptsandApplications,”TheReview ofEconomicsandStatistics,1987,69(4),667–674. Pearce,DouglasK., “Short-term Inflation Expectations: Evidence from a Monthly Survey: Note,”JournalofMoney,CreditandBanking,1987,19(3),388–395. Pfa¨uti,OliverandFabianSeyrich, “ABehavioralHeterogeneousAgentNewKeynesian Model,”Workingpaper2022. 36

Rungcharoenkitkul, Phurichai and Fabian Winkler, “The Natural Rate of Interest ThroughaHallofMirrors,”FEDSworkingpaper2022-010,BoardofGovernorsofthe FederalReserveSystem2022. Sargent, Thomas J., “Rational Expectations,” in David R. Henderson, ed., The Concise EncyclopediaofEconomics,LibertyFund,2007. Timmermann, Allan, “Forecast Combinations,” in G. Elliott, C. Granger, and A. Timmermann, eds., Handbook of Economic Forecasting, Vol. 1, Elsevier, 2006, chapter 4, pp.135–196. Welch, Ivo and Amit Goyal, “A Comprehensive Look at The Empirical Performance of EquityPremiumPrediction,”ReviewofFinancialStudies,032007,21(4),1455–1508. Winkler, Fabian, “The role of learning for asset prices and business cycles,” Journal of MonetaryEconomics,2020,114,42–58. 37

A Additional results Inthisappendix,wereportresultsofadditionalOOStestsusingrollingwindowregressions, alternative forecast horizons, an added intercept, and a different way of transformingtherawdata. Table(A1)containsresultsfromtheconsensustestsinSections4.1and4.2butusing rollingwindowregressionswithawindowlengthof40periods,i.e. tenyears. Thecriticalvaluesforthebootstraphavebeenrecomputedfortherollingwindowversionofthe ∆SSE statistic. Across all tests, the predictive OOS performance of the rolling regres- T sionsisworsethanourbenchmarkrecursivewindowregressions: Theaddednoisefrom theshorterwindowlengthoutweighsthebenefitsofaccountingfortimevariationinthe parameters. TableA2containstheresultsfortheindividualmodelsreportedinSection 4.3usingrollingwindowregressions. Here,too,thepredictiveOOSperformanceofthe rolling regressions is mostly worse than our benchmark recursive window regressions, althoughitisalittlebetterfortheKohlhas-Walthermodel. TableA3containstheresults for household data reported in Section 5 using rolling window regressions. Here, the rollingwindowperformanceisalmostidenticaltotherecursivewindowperformance. The next two tables show results for different forecast horizons for a subsection of the variables in professional forecasts. Table A4 contains results at the consensus level while Table A5 contains results at the individual level. The results are broadly similar across forecast horizons. At times, the nowcast (h = 0) has a better performance than longer horizons and also crosses the significance thresholds of our bootstrap. Thus, nowcastsseemsomewhatmorepredictablethanlonger-horizonforecasts. Thismaybe duetohigherpower,astheforecasterrorshavelowervarianceattheshorthorizon. Next, this appendix documents the OOS performance of some of the models when aninterceptisadded. AddinganinterceptmakestheOOSestimationnoisierandlowers thepoweroftheOOStest,butalsoallowstofitthedatabetterifthetruemodelcontains an intercept. Table A6 shows results for professional forecasters at the consensus level andTableA7showsresultsattheindividuallevel. Fortheinterestrateforecasts,adding an intercept to the models generally improves the OOS performance. This is consis- 38

tentwiththegoodOOSfitofthemeanbiasmodelreportedinthepaper. Fortheother variables,theperformancegenerallydeteriorates,especiallyfortheCoibionandGorodnichenko(2015),Bordaloetal.(2020)andforecastcombinationmodels. 39

Table A1: Prediction of consensus professional forecast errors: rolling window regressions. (1) (2) (3) (4) (5) ∆SSET,OOS Revisions Meanbias Autocorrelation Mincer-Zarnovitz Nordhaus Inflation(deflator) -0.052 -0.254 0.039** -0.557 0.051*** Inflation(CPI) -0.028 -0.101 -0.089 -0.177 0.046*** RealGDP -0.083 -0.067 -0.146 -0.190 0.083*** IndustrialProduction -0.052 -0.021 -0.066 -0.076 -0.056 NominalGDP -0.108 -0.073 -0.484 -0.043** 0.028*** Unemploymentrate -0.046 -0.066 -0.156 -0.187 -0.150 Consumption -0.019 -0.069 -0.075 0.014** 0.069*** Non-residentialinv. -0.015 -0.058 -0.034 -0.069 0.062*** Residentialinv. 0.042*** -0.168 0.081** -0.590 0.167*** Federalgovt. -0.063 -0.200 0.012 -0.019 -0.037 Non-federalgovt. -0.179 -0.047 -0.056 -0.090 0.048*** Housingstarts 0.088*** -0.231 0.135*** -0.524 0.223*** Federalfundsrate 0.193*** 0.041** -0.003 -0.080 0.200*** 3-monthyield 0.173*** 0.109*** 0.058** -0.025** 0.197*** 6-monthyield 0.211*** 0.153*** 0.107*** -0.056 0.230*** 1-yearyield 0.193*** 0.128** 0.041 -0.167 0.211*** 2-yearyield 0.143*** 0.151*** 0.010 -0.230 0.171*** 10-yearyield 0.034*** 0.266*** -0.017 0.010** 0.058*** Aaayield 0.024** 0.190*** -0.182 -0.136 0.082*** Baayield 0.031** 0.385*** 0.077** 0.185** 0.092*** 1y-3mspread -0.021 -0.124 -0.035 -0.094 -0.007 10y-2yspread 0.068*** -0.063 0.002 -0.135 0.072*** Aaa-Baaspread -0.026 -0.094 0.059 -0.037 -0.129 Note:Eachrowshowscumulativesquarederrors∆SSE insampleandoutofsample.***,**and*repre- T sentrejectionofthenullhypothesisofnopredictabilityofforecasterrorsatthe10,5,and1percentlevel usingbootstrappedcriticalvalues. YieldandspreadvariablesaretakenfromBlueChip, othervariables aretakenfromtheSPF. 40

TableA2: Predictionofindividualprofessionalforecasterrors: rollingwindowregressions. (1) (2) (3) (4) (5) (6) ∆SSET,OOS BGMS Autocorrelation Mincer-Zarnovitz Nordhaus Kohlhas-Walther Forecastcombination Inflation(deflator) 0.001** 0.102*** -0.280 0.002*** -0.133 0.122*** Inflation(CPI) -0.013 -0.054 -0.077 -0.023 -0.104 0.035*** RealGDP 0.034*** -0.166 -0.081 0.021*** 0.054*** 0.057*** IndustrialProduction -0.024 -0.077 -0.044 -0.005 0.010** 0.049*** NominalGDP 0.006*** -0.340 -0.142 0.001** -0.012 0.057*** Unemploymentrate -0.023 -0.187 -0.135 0.007*** -0.050 0.016*** Consumption 0.077*** -0.112 -0.095 0.007*** 0.021** 0.036*** Non-residentialinv. 0.008** -0.034 -0.042 -0.002 0.031** 0.059*** Residentialinv. -0.011 0.084*** -0.072 -0.015 0.009 0.099*** Federalgovt. 0.081*** 0.066*** -0.276 -0.042 -0.005 0.126*** Non-federalgovt. 0.117*** 0.025*** 0.213*** 0.005*** -0.058 0.169*** Housingstarts 0.008*** 0.212*** -0.560 -0.012 -0.220 0.100*** Federalfundsrate 0.111*** 0.040*** -0.065 0.077*** 0.184*** 0.033*** 3-monthyield 0.087*** 0.097*** 0.004 0.057*** 0.252*** 0.041*** 6-monthyield 0.105*** 0.117*** -0.064 0.073*** 0.267*** 0.030*** 1-yearyield 0.084*** 0.068*** -0.173 0.062*** 0.204*** 0.039*** 2-yearyield 0.060*** 0.053*** -0.225 0.035*** 0.226*** 0.042*** 10-yearyield 0.007** 0.017** -0.004 -0.013 0.257*** 0.064*** Aaayield -0.004 -0.070 -0.145 -0.011 0.102*** 0.063*** Baayield -0.008 0.142*** -0.052 -0.050 0.285*** 0.115*** 1y-3mspread 0.077*** -0.016 0.327*** 0.100*** 0.010 0.079*** 10y-2yspread -0.006 0.004 -0.062 -0.014 -0.034 0.053*** Aaa-Baaspread 0.007 -0.096 0.293*** -0.071 -0.006 0.066*** Note: Eachrowshowscumulativesquarederrors∆SSE insampleandoutofsampleforanumberofpredictivemodelsofforecasterrors. ***, T **and*representrejectionofthenullhypothesisofnopredictabilityofforecasterrorsatthe10,5,and1percentlevelusingbootstrappedcritical values. 41

TableA3: Predictionofhouseholdforecasterrorsforinflation: rollingregressions. (1) (2) (3) (4) (5) ∆SSET,OOS Meanbias Revisions Autocorrelation Mincer-Zarnovitz Forecastcombination Michiganmean 0.120*** -0.008 -0.165*** -0.139 – Michiganmedian -0.089 -0.018 -0.694 -0.358 – Michiganind. 0.007*** – – 0.822*** 0.889*** SCEmean 0.178 -0.013 -0.212 -0.364 – SCEmedian -0.280 0.048** -0.140 -0.357 – SCEind. -0.053 – – 0.676*** 0.803*** Note: Eachrowshowscumulativesquarederrors∆SSE insampleandoutofsampleforanumberof T predictive models of forecast errors. ***, ** and * represent rejection of the null hypothesis of no predictabilityofforecasterrorsatthe10, 5, and1percentlevelusingbootstrappedcriticalvalues. Forthe bootstraponSCEdata,werestrictthelaglengthoftheMAprocessofu to3.Forecastrevisionsareproxt iedbylaggedforecasts. 42

TableA4: Predictionofconsensusprofessionalforecasterrors: alternativehorizons. (1) (2) (3) (4) (5) h Revisions Meanbias Autocorrelation Mincer-Zarnovitz Nordhaus 0 0.008** -0.035 0.040*** -0.039 -0.027 Inflation(deflator) 1 -0.078 -0.140 0.078*** -0.128 0.047*** 2 -0.001 -0.270 0.156*** -0.282 0.066*** 0 -0.225 -0.017 -0.028 -0.054 -0.079 RealGDP 1 -0.195 -0.019 -0.003 -0.094 -0.105 2 -0.160 -0.033 -0.013 -0.137 -0.074 0 -0.074 -0.027 0.077*** 0.064*** -0.106 IndustrialProduction 1 -0.135 -0.016 -0.039 -0.031 -0.068 2 -0.040 -0.006 -0.035 -0.017 -0.007 0 -1.622 0.041*** 0.110*** 0.088*** -0.169 Unemploymentrate 1 -0.369 -0.003 -0.098 0.002 -0.180 2 -0.367 -0.014 -0.057 -0.021 -0.126 0 0.090*** -0.016 0.051*** 0.013** 0.071*** Housingstarts 1 0.018** -0.028 -0.005 -0.039 0.147*** 2 0.066*** -0.037 0.059*** -0.066 0.205*** 0 0.029** 0.439*** 0.387*** 0.508*** 0.144*** 3-monthyield 1 0.126*** 0.160*** 0.127*** 0.180*** 0.186*** 2 0.160*** 0.121*** 0.151*** 0.112*** 0.217*** 0 0.087*** 0.082*** 0.004 0.037*** 0.017** 10-yearyield 1 0.012 0.132*** -0.027 0.030** 0.027** 2 -0.032 0.208*** -0.039 0.061** 0.042*** 0 0.046*** -0.013 0.082*** -0.024 0.030** 10y-2yspread 1 0.028** -0.023 0.032** -0.035 0.055*** 2 0.071*** -0.046 0.024 -0.050 0.09*** Note: Each row shows cumulative squared errors for the in-sample regression ∆SSEIS and out-of- T sample regression ∆SSEOOS, respectively. All series are scaled by SSER, so that values correspond to T T thefractionofthemeansquaredforecasterrorpredictedbythebehavioralmodel.***,**and*represent rejectionofthenullhypothesisofnopredictabilityofforecasterrorsatthe10,5,and1percentlevelusingbootstrappedcriticalvalues. YieldandspreadvariablesaretakenfromBlueChip,othervariablesare takenfromtheSPF. 43

TableA5: Predictionofindividualprofessionalforecasterrors: alternativehorizons. (1) (2) (3) (4) (5) (6) h BGMS Autocorrelation Mincer-Zarnovitz Nordhaus Kohlhas-Walther Forecastcombination 0 0.129*** 0.080*** 0.014** 0.006*** -0.015 0.041*** Inflation(deflator) 1 0.038*** 0.133*** -0.116 0.010*** -0.038 0.076*** 2 0.008*** 0.161*** -0.279 0.007*** -0.091 0.117*** 0 -0.069 -0.006 -0.063 -0.002 -0.016 0.018*** RealGDP 1 0.038*** -0.024 0.062*** 0.005*** 0.006** 0.024*** 2 0.023*** -0.032 0.034** 0.004*** 0.017** 0.048*** 0 0.013*** 0.046*** 0.015** -0.015 -0.027 0.009*** IndustrialProduction 1 -0.022 -0.020 0.010 -0.008 -0.012 0.020*** 2 -0.012 -0.011 0.024** -0.010 -0.007 0.044*** 0 -0.413 0.058*** 0.074*** -0.081 0.022*** 0.010*** Unemploymentrate 1 -0.217 -0.111 0.022*** -0.083 0.000 -0.007 2 -0.202 -0.084 0.012 -0.065 -0.010 0.003*** 0 -0.001 0.087*** -0.015 -0.007 0.007** 0.038*** Housingstarts 1 -0.010 0.036*** -0.034 -0.006 -0.014 0.066*** 2 -0.002 0.148*** -0.057 -0.009 -0.018 0.095*** 0 -0.008 0.217*** 0.315*** 0.053*** 0.287*** 0.025*** 3-monthyield 1 0.045*** 0.123*** 0.162*** 0.065*** 0.181*** 0.012*** 2 0.067*** 0.153*** 0.079*** 0.059*** 0.170*** 0.027*** 0 -0.001 0.020*** 0.008*** -0.004 0.012*** 0.005*** 10-yearyield 1 -0.006 -0.019 0.028*** -0.008 0.062*** 0.024*** 2 -0.010 -0.004 0.051*** -0.012 0.095*** 0.042*** 0 0.014*** 0.020*** 0.010*** -0.008 0.001 0.009*** 10y-2yspread 1 -0.011 0.040*** 0.014*** -0.008 0.002 0.020*** 2 -0.004 0.037*** 0.013*** -0.006 -0.005 0.037*** Note:Eachrowshowscumulativesquarederrorsforthein-sampleregression∆SSEIS andout-of-sampleregression∆SSEOOS,respectively.All T T seriesarescaledbySSER,sothatvaluescorrespondtothefractionofthemeansquaredforecasterrorpredictedbythebehavioralmodel. ***,** T and*representrejectionofthenullhypothesisofnopredictabilityofforecasterrorsatthe10,5,and1percentlevelusingbootstrappedcritical values.YieldandspreadvariablesaretakenfromBlueChip,othervariablesaretakenfromtheSPF. 44

TableA6: Predictionofconsensusprofessionalforecasterrors: addinganintercept. (1) (2) (3) ∆SSET,OOS Revisions Autocorrelation Nordhaus Inflation(deflator) -0.143 -0.084 0.001 Inflation(CPI) -0.053 -0.106 0.005 RealGDP -0.145 -0.021 -0.029 IndustrialProduction 0.032** 0.013 0.032*** NominalGDP -0.157 -0.070 -0.036 Unemploymentrate -0.274 -0.064 -0.124 Consumption -0.049 -0.152 -0.076 Non-residentialinv. -0.125 -0.136 0.014** Residentialinv. -0.088 -0.097 0.111*** Federalgovt. -0.075 -0.091 0.007 Non-federalgovt. -0.134 -0.076 0.089*** Housingstarts 0.021** 0.075*** 0.206*** Federalfundsrate 0.211*** 0.108** 0.213*** 3-monthyield 0.238*** 0.174*** 0.212*** 6-monthyield 0.266*** 0.198*** 0.251*** 1-yearyield 0.243*** 0.149*** 0.238*** 2-yearyield 0.192*** 0.139** 0.191*** 10-yearyield 0.283*** 0.27*** 0.120*** Aaayield 0.226*** 0.205*** 0.127*** Baayield 0.382*** 0.411*** 0.162*** 1y-3mspread -0.144 -0.332 0.030** 10y-2yspread 0.017 -0.070 0.079*** Aaa-Baaspread -0.199 -0.105 -0.180 Note:Eachrowshowscumulativesquarederrors∆SSE insampleandoutofsample.***,**and*repre- T sentrejectionofthenullhypothesisofnopredictabilityofforecasterrorsatthe10,5,and1percentlevel usingbootstrappedcriticalvalues. YieldandspreadvariablesaretakenfromBlueChip, othervariables aretakenfromtheSPF. 45

TableA7: Predictionofindividualprofessionalforecasterrors: addinganintercept. (1) (2) (3) (4) ∆SSET,OOS BGMS Autocorrelation Nordhaus Forecastcombination Inflation(deflator) -0.263 -0.074 -0.062 -0.157 Inflation(CPI) -0.085 -0.088 -0.044 -0.038 RealGDP 0.013 -0.035 0.024*** 0.042** IndustrialProduction 0.013 -0.001 0.014*** 0.06*** NominalGDP -0.020 -0.084 0.007** 0.037** Unemploymentrate -0.136 -0.075 -0.066 -0.012 Consumption 0.027 -0.189 0.004 0.009 Non-residentialinv. -0.091 -0.155 -0.011 -0.008 Residentialinv. -0.084 -0.038 0.003 0.033 Federalgovt. 0.029** 0.019** -0.014 0.082*** Non-federalgovt. 0.091*** 0.011 -0.004 0.139*** Housingstarts -0.029 0.18*** -0.004 0.065** Federalfundsrate 0.111*** 0.111*** 0.062*** 0.117*** 3-monthyield 0.179*** 0.188*** 0.055*** 0.209*** 6-monthyield 0.197*** 0.175*** 0.104*** 0.182*** 1-yearyield 0.163*** 0.140*** 0.093*** 0.165*** 2-yearyield 0.162*** 0.139*** 0.022*** 0.205*** 10-yearyield 0.225*** 0.218*** 0.012*** 0.293*** Aaayield 0.104*** 0.099*** 0.036*** 0.167*** Baayield 0.291*** 0.308*** 0.039*** 0.396*** 1y-3mspread -0.003 -0.162 0.103*** 0.002 10y-2yspread -0.051 -0.051 -0.028 -0.001 Aaa-Baaspread -0.113 -0.531 -0.159 -0.034 Note: Eachrowshowscumulativesquarederrors∆SSE insampleandoutofsampleforanumberof T predictive models of forecast errors. ***, ** and * represent rejection of the null hypothesis of no predictabilityofforecasterrorsatthe10,5,and1percentlevelusingbootstrappedcriticalvalues. 46

Cite this document

APA

Kenneth Eva and Fabian Winkler (2023). A Comprehensive Empirical Evaluation of Biases in Expectation Formation (FEDS 2023-042). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2023-042

BibTeX

@techreport{wtfs_feds_2023_042,
  author = {Kenneth Eva and Fabian Winkler},
  title = {A Comprehensive Empirical Evaluation of Biases in Expectation Formation},
  type = {Finance and Economics Discussion Series},
  number = {2023-042},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2023},
  url = {https://whenthefedspeaks.com/doc/feds_2023-042},
  abstract = {We revisit predictability of forecast errors in macroeconomic survey data, which is often taken as evidence of behavioral biases at odds with rational expectations. We argue that to reject rational expectations, one must be able to predict forecast errors out of sample. However, the regressions used in the literature often perform poorly out of sample. The models seem unstable and could not have helped to improve forecasts with access only to available information. We do find some notable exceptions to this finding, in particular mean bias in interest rate forecasts, that survive our out-of-sample tests. Our findings help narrow down the set of biases that merit closer attention of researchers in behavioral macroeconomics.},
}