Total Recall? Evaluating the Macroeconomic Knowledge of Large Language Models
Abstract
We evaluate the ability of large language models (LLMs) to estimate historical macroeconomic variables and data release dates. We find that LLMs have precise knowledge of some recent statistics, but performance degrades as we go farther back in history. We highlight two particularly important kinds of recall errors: mixing together first print data with subsequent revisions (i.e., smoothing across vintages) and mixing data for past and future reference periods (i.e., smoothing within vintages). We also find that LLMs can often recall individual data release dates accurately, but aggregating across series shows that on any given day the LLM is likely to believe it has data in hand which has not been released. Our results indicate that while LLMs have impressively accurate recall, their errors point to some limitations when used for historical analysis or to mimic real time forecasters.
Finance and Economics Discussion Series Federal Reserve Board, Washington, D.C. ISSN 1936-2854 (Print) ISSN 2767-3898 (Online) Total Recall? Evaluating the Macroeconomic Knowledge of Large Language Models Leland D. Crane, Akhil Karra, Paul E. Soto 2025-044 Please cite this paper as: Crane, D. Leland, Akhil Karra, Paul E. Soto (2025). “Total Recall? Evaluating the Macroeconomic Knowledge of Large Language Models,” Finance and Economics Discussion Series 2025-044. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2025.044. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.
Total Recall? Evaluating the Macroeconomic Knowledge of Large Language Models* LelandD.Crane† AkhilKarra‡ PaulE.Soto† June24,2025 Abstract Weevaluatetheabilityoflargelanguagemodels(LLMs)toestimatehistoricalmacroeconomic variables and data release dates. We find that LLMs have precise knowledge ofsomerecentstatistics,butperformancedegradesaswegofartherbackinhistory. We highlight two particularly important kinds of recall errors: mixing together first print data with subsequent revisions (i.e., smoothing across vintages) and mixing data for past and future reference periods (i.e., smoothing within vintages). We also find that LLMs can often recall individual data release dates accurately, but aggregating across seriesshowsthatonanygivendaytheLLMislikelytobelieveithasdatainhandwhich hasnotbeenreleased. OurresultsindicatethatwhileLLMshaveimpressivelyaccurate recall,theirerrorspointtosomelimitationswhenusedforhistoricalanalysisortomimic realtimeforecasters. *WethankGaryCornwall,AnneHansen,participantsinBoardbrownbags,andparticipantsatthe2025SGE conferenceforusefulcomments.WethankBetsyVrankovichforhertechnicalexpertise. OpinionsexpressedhereinarethoseoftheauthorsaloneanddonotnecessarilyreflecttheviewsoftheFederal ReserveSystemortheBoardofGovernors. †BoardofGovernorsoftheFederalReserveSystem ‡CarnegieMellonUniversity
1 Introduction Theriseoflargelanguagemodels(LLMs)hasgeneratedinterestinhowtheycanbeusedfor economicanalysisandforecasting(e.g.,Korinek2023). TheutilityofLLMsdependsontheir understanding of economics-related facts and their ability to follow instructions precisely. We evaluate LLMs on several dimensions related to these capabilities. First, how well do LLMs estimate important macroeconomic variables from the past? Second, to what extent areLLMs’estimatescontaminatedwithfutureinformation? Andthird, howwelldoLLMs recall data release dates? LLMs which have accurate knowledge of economic history (including data release dates) will likely be more useful when generating hypotheses and doinganalysis. Separately,ifLLMscanproviderealisticquasi-real-timeestimates—simulating forecasters from the past—then we can better understand how the LLM’s forecasting process relates to human forecasts. On the other hand, LLM estimates which are inaccurate or contaminatedwithlook-aheadbiasmaybeofmorelimiteduse. WefindthatforsomevariablesLLMshaveremarkablerecall.1 TheLLMwefocuson— ClaudeSonnet3.5—canrecallthequarterlyvaluesoftheunemploymentrateandCPIwith fairly high accuracy back to WWII. However, it fares much more poorly on more volatile real activity series like real GDP growth and industrial production (IP) growth. The LLM appears to miss many of the high-frequency swings in these series, though it does capture businesscyclevariationwell. Focusing on GDP, we develop evidence that the LLM estimate is a mixture of the first printvalueforthereferenceperiodandsubsequentrevisedvaluesforthatreferenceperiod. ThissmoothingacrossdatavintagesappearsregardlessofwhetherweasktheLLMtoprovide the first print or the fully revised number. LLMs are trained on an enormous amount of data and—unless every part of the corpus is clearly date stamped and that information 1We use the term recall when the LLM is estimating a historical quantity which was (presumably) in its trainingdata.Thisisdistinctfrom“retrieval”inthecontextofretrievalaugmentedgeneration,wheretheLLM isbackedbyasearchengineandreferencedocuments.OurfocusisontheLLMinisolation,andwhichhistorical factsitisabletoestimateaccurately. 1
is embedded in the model weights by the training process—it won’t always be clear when thetextwaswrittenorwhichvintageofGDPitisreferringto. Themixingoffirstprintand fullyreviseddataisproblematic,becauseitmeans(1)themodelhasalessthanaccurateretrospective understanding of the economic situation, and (2) the model will have difficulty simulatingareal-timeforecaster. A related but distinct question is whether LLM estimates for a given reference period are influenced by future and past reference periods, keeping the vintage constant. In other words, are LLM estimates of data published for date t affected by published data values from t+1? We develop a test for whether the LLM’s estimate for a particular date is influenced by future shocks to the series, controlling for expectations. We find suggestive evidence that LLM’s do indeed use future reference period value when constructing an estimate, even when instructed to ignore future information. Any such smoothing is again a challengeforhistoricalanalysisandusingLLMstomimicreal-timeforecasters. Finally,wedocumenttheLLM’sknowledgeofeconomicdatareleasedates. Wefindthat LLMsoftenhaveanaccurateideaofwhenhistoricaldatareleasesoccurred. However,they sometimes miss the true release date by a few days. The results are also sensitive to the details of the prompt; we find that varying the prompt to reduce the number of estimate release dates that are late leads to an increase in estimated release dates that are too early. Our prompt engineering doesn’t lead to a strategy that increases accuracy to a very high level; rather we end up trading off different types of errors. The conclusion is that the LLM doesn’t have a very strong conception of the individual data release dates. We find that—aggregatingacrossmajoreconomicindicators—onatypicaldaythereisagoodchance the LLM falsely believes at least some major data releases have occurred. Interestingly, these errors are exactly the kind we would expect a human to make: sometimes too early, sometimestoolate,andattemptstoreduceonekindoferrorincreasetheother. Our results paint a mixed picture of current LLM capabilities. LLM recall of historical data values and release dates is often very impressive. That said, there are also significant 2
shortcomingsinLLMrecall,andtheerrorsareoftencorrelatedwithinformationfromafter thereferencedate. Atahighleveltheseerrorsareveryhumaninthattheycanbeinterpreted asagood-faithefforttofollowinstructionswhilebeinghamperedbyafuzzyrecollectionof thepast. Thesepatternssuggestthatlook-aheadbiasmaybeanimportantchallengewhen usingLLMs. 2 Literature Review A number of recent papers have used LLMs for economic forecasting and analysis. Kim et al. (2024) find that an LLM can predict firm earnings when prompted with anonymized accounting data. Cook et al. (2023) use LLMs to analyze earnings calls. Pham and Cunningham (2024) present out-of-sample (i.e. post-knowledge cutoff) forecasts for inflation andAcademyAwards. Schoeneggeretal.(2024)showthatGPT4canhelphumanforecasters on a variety of financial and political forecasting tasks, all of which occurred after the knowledgecutoff. Similarly,Phanetal.(2024)compareLLMforecastswithcrowd-sourced forecasts. Jhaetal.(2024)feedearningscalltranscriptstoGPT3.5andshowthatitcanhelp forecastcapitalinvestmentandabnormalreturns. Aspartoftheirrobustnessexercisesthey restrictthesampletothepost-knowledgecutoffperiod,andseparatelytrytoanonymizethe transcripts. Glasserman and Lin (2023) examine GPT3.5’s ability to forecast stock returns fromnewsheadlines;theyanonymizecompanynamestoavoidanin-sample“distraction” effect. Faria-e-Castro and Leibovici (2023) evaluate inflation forecasts from an LLM, both beforeandaftertheknowledgecutoff. Zarifhonarvar(2024)studieshowdifferentprompts andaccesstodifferentinformationaffectGPT4’sinflationexpectations. Separately,astrand oftheliteraturehasusedLLMsasstand-insforhumansinsurveysorstrategicgames(Manningetal.(2024),Kazinnik(2024),Trancheroetal.(2024).) Hansenetal.(2024)contributeto bothliteratures,simulatingSurveyofProfessionalForecasters(SPF)respondentsandevaluatingthepropertiesoftheLLM-derivedforecasts. Finally,anumberofpapersuseLLMsas classifiersforthingslikenewsheadlines,andthenusetheclassificationstobuildindicators 3
likesentimentindexes(Shapiroetal.,2022;Bybee,2023;Cajneretal.,2024;vanBinsbergen etal.,2024). Many of these papers acknowledge look-ahead bias—the potential for an LLM that is supposedtomimicanagentactingattime t touseinformationfrom t+1orlater—andattempttoaddressitwithanonymization, post-knowledge-cutoffcomparisons, andprompting techniques. Somewhat less has been done to directly measure the extent of look-ahead bias.2 SakarandVafa(2024)isoneexception, theyshowlook-aheadbiasarisesintwocontextswhereGPT4isaskedtoactasarealtimeforecaster: first,whenassessingpre-pandemic earningscallsforriskfactors,theLLMsometimesmentionspandemicsandCovid. Second, the LLM is often able to “forecast” the winner of close elections. Lopez-Lira et al. (2025) evaluate recall and look-ahead bias for financial macroeconomic variables; interestingly, their estimates of recall of recall accuracy are higher than ours, suggesting some modelor prompt-specific effects. We complement these papers by developing more formal tests of data leakage in the macroeconomic setting and exploring the LLM’s understanding for datareleasedates,acriticalfactorforreal-timeforecasting. Ludwigetal.(2025)alsodiscuss look-ahead bias in the context of congressional legislation and financial news. To address these concerns Sarkar (2024) and He et al. (2025) develop sequences of LLMs trained only ondatauptoaknownpointintime,butofcoursethesemodelsaremuchsmallerthanthe commercially available ones and do have the full set of capabilities available with frontier models. Look-ahead bias is also a focus of our paper; we add to the literature by quantifying several practically important types of look-ahead bias, e.g. the contamination of an LLM’s memories of first-print data with later revisions and uncertainty about the timing of data releases. We also develop a test for whether LLM’s estimates are contaminated by future datavalues. Assessing look-ahead bias is hard. LLMs have attracted attention from forecasters pre- 2SeeCroushore(2011)foradetaileddiscussionoftherelatedtopicsofdatarevisionsandforecastinstability intraditionalforecasting. 4
cisely because there is reason to think they might prove useful for prediction. This means that high accuracy at forecasting cannot be counted as strong evidence of look-ahead bias; LLMs are capable forecasters we should expect them to beat some other forecasts. In this paper we take an indirect approach, focusing on the LLM’s recall of historical data values/release dates. It appears easier to show that errors in recall are influenced by future informationthanitistoprovethataforecastis“tooaccurate”. NotethatHansenetal.(2024) prompt the LLM with recent values of macroeconomic indicators to ground it and help improveperformance;thisstrategymayalsohelpmitigatelook-aheadbias. Ourworkcomplements theirs by documenting the capabilities and limitations of the raw LLM without additionalinformationpassedintotheprompt. Our assessment goes beyond the topic of look-ahead bias, as we test whether the LLM can accurately recall economic statistics in general. An analyst using an LLM to explore economic hypotheses would want the model to have a clear, precise understanding of economic history. Documenting the extent of recall and the limitations on LLM’s knowledge willassistresearchersconsideringhowtousethesetools. 3 Models and Data Formostofthepaperwefocusonfourmacroeconomictimeseries: GDP,inflation,industrial production, andunemployment. SimilarlytoHansenetal.(2024), werestrictourattention toquarterlyvaluessothatwecancomparetotheSPF.Thedetailsoftheseriesareasfollows: • GrossDomesticProduct(GDP):Theseasonallyadjustedannualizedonequartergrowth rateofrealGDP • Inflation: The four quarter change in the seasonally adjusted Consumer Price Index (CPI) • Industrial Production (IP): The seasonally adjusted annualized one quarter growth rateofIP 5
• Unemployment: The one quarter average of the seasonally adjusted level of the unemploymentrate Weuseboththefully-revised(currentvintage)numbers,aswellasthefirst-printvalues. 3.1 Models We use Anthropic’s Claude Sonnet 3.5 large language model as provisioned through AWS Bedrock.3 Sonnet3.5iswidelyconsideredtobecomparabletoOpenAI’scontemporaneous offerings(thoughitdoesnothavethereasoningcapabilitiesofo1andlatermodels), andit performs very well on benchmarks. Note that this model does not have internet search or tool use enabled; it cannot access any updated information aside from what is included in theprompt. WedonotuseOpenAI’smodelsbecausewedonothaveaneasywaytoaccess them. 3.2 Methodology OurmainqueriesinstructtheLLMtothinkstep-by-step,writeouttheirreasoning,andonly write the final answer at the end. This is intended to improve performance, as LLMs can benefit from reasoning step-by-step before committing to an answer (Wei et al., 2022). The system prompt can be found in Figure 18, and an example user prompt is shown in Figure 19. The responses to the queries are verbose. We use a secondary “summarizer” LLM and prompttoextracttheestimatefromtheresponses. Thesummarizerisinstructedtoreadthe original response and return an answer approximately of the form “Answer:{estimate}”, where{estimate}isthedesiredestimate. Wethenparsethesummarizer’sanswerswitha regularexpression(regex)toextractthenumericpointestimate. Itisworthnotingthatthedevelopmentofthepromptsisaniterativeprocess. Ourinitial 3ThemodelIDisanthropic.claude-3-5-sonnet-20240620-v1:0. Thisis theoriginalSonnet3.5, notthe newerversionofSonnet3.5releasedinOctober2024. 6
attemptsyieldedmanyranges(notpointestimates)andmanyfailurestoanswer. Toaddress this we added instructions to always produce an answer and to avoid giving ranges. As anotherexample,ourparserwouldsometimesfailtolocatetheanswer. Wefoundthiswas becausethesummarizerwasnotconsistentaboutcapitalizing“Answer”,whichwefixedby changingtheregex. 3.3 NondeterminisminAnswers In typical use LLM responses are stochastic. The LLM generates a response one token at a time and the token generated is a function of the text—either in the prompt or the incompleteresponse—uptothatpointintime.4 TheLLMgeneratestokensbysamplingfromthe model’s probability distribution of next tokens, so more probable completions are chosen moreoften. Several parameters govern the sampling process. In older, smaller LLMs (like GPT-2) themostimportantisthetemperature. InsimpleLLMsatemperatureofzerocorrespondsto an essentially deterministic response. However, frontier models include other factors (like mixtureofexperts)thatintroduceothersourcesofrandomness. We run each query several times and average estimates in order to attenuate the randomnessinLLMresponses. Wealsocalculatethestandarderrorofthismeanestimateand use it to plot confidence intervals. The averaged responses are close to deterministic, and theconfidenceintervalsshowuswherethereisstillsignificantrandomness. 3.4 ChoosingtheTemperature Weneedtoevaluatehowmuchthetemperatureparametermattersinourcontextandwhat value to set it to. Figure 1 shows two GDP estimates: one with the temperature set to one (thedefault),andonewiththetemperaturesettozero.5 Thetwoseriesareextremelysimilar. 4Tokensarewordsorwordparts, forexample“the”maybeasingletokenbut“generates”mightbetokenizedasgenerat,es 5Forthetemp.=0versionwealsosetthe"topk"parameterequaltoone;inasimpleLLMthiswouldensure that the LLM chooses only the most probable next token conditional on the set of available tokens and their 7
10 5 0 -5 tnecreP Temp.=0, corr. w/actual: .7855 Temp.=1, corr. w/actual: .7907 1940q11950q11960q11970q11980q11990q12000q12010q12020q12030q1 Note: LLMestimatesofGDPunderdifferenttemperatureparameters. Correlationsarewithactual,finalprint GDP.Covidperiodnotplottedtokeepscalereadable. Source: Authors’calculations,BEA Figure1: TemperatureandRecallofGDP Theircorrelationswithactualfirst-printGDParealsosimilar,thoughthetemp.=1serieshas a marginally higher correlation. Based on this—and the fact that the temperature is set to onebydefault—weusetemp.=1asthemainspecificationinmostofwhatfollows. 3.4.1 Digression: NondeterminismatTemperature=0 Interestingly,the(within-quarter)standarddeviationsofthedifferenttemperatureseriesare also very similar. In particular, for the temp.=1 series the average within-quarter standard deviation of the estimates is 0.786, while the average standard deviation for the temp.=0 series is 0.7616. While the temp.=0 series appears to have marginally less variability, the sizeoftheeffectisverysmall. Alackofcompletedeterminismwithtemp.=0isunderstoodtobeafeatureofthelarger probabilities.Likesettingtemp.=0,thiswouldmaketheresponsedeterministicinasimplerLLM. 8
LLMs.6 But the near-identical results we see above raise questions as to whether the temperature parameter has any material impact at all, or whether our code base is setting it correctly. Table 1 shows that we can in fact document some effect of temperature. For this exercise we look at the raw, text response of the LLM, before parsing and summarization. We fix a character length N (say, 50 characters) and compare the first N characters of two random responses. The comparison is done within quarters, so the prompts for the two responses are identical. We check whether the first N characters of the response are identical, and record an indicator variable that equals 1 for a match and 0 for a difference. Thus eachpairofresponsesgeneratesasingleindicatorvariable,andwerepeattheprocessmany times. Table 1 shows the results. When looking at the first 50 characters, with temperature set to zero 42 percent of response pairs are identical; this drops to 22 percent with temperature set to one. This amounts to a significant change in the variability of the responses, thoughthereisobviouslyagreatdealofvariationinthezerotemperatureresponses. It appears that setting the temperature to zero for Sonnet 3.5 on Bedrock does indeed maketheresponsestringmoredeterministicasmeasuredbyincreasingresponsesimilarity across identical queries. However, setting temperature to zero does not remove randomness by any means and makes very little difference for the substance of the response: the GDPestimate. OurresultsgenerallymirrorthoseofOuyangetal.(2025),whoshowsignificant non-determinism in OpenAI’s GPT-3.5 and GPT-4 models even with temperature set to 0. We would caution users against assuming that temp.=0 ensures deterministic or even mostlydeterministicresults. Evenwithtemp.=0averagingacrossseveralqueriesstillseems necessarytoensurethatresultsarereproducible. 6ThedocumentationforClaudementionsthat“Notethatevenwithtemperatureof0.0,theresultswillnot befullydeterministic.”SeealsoOuyangetal.(2025). 9
SequenceLength Temperature Obs. Mean St. Dev. 0 3150 0.42 0.49 50chars. 1 3150 0.22 0.42 0 3150 0.37 0.48 100chars. 1 3150 0.14 0.34 0 3150 0.27 0.44 200chars. 1 3150 0.03 0.18 Table1: Fractionofresponsesidenticalatvarioussequencelengths 4 Testing LLM Recall InthissectionwetesthowwellLLMsrecallimportantmacroeconomicstatistics. Theprompt— showninFigure19—askstheLLMtouseallinformationavailabletothem(i.e.,theLLMis not instructed to behave as a real time forecaster.) We ask the LLM for estimates through 2027 which it provides even though its knowledge cutoff is in 2024. Examining the LLM responsesinthesecasesshowitdecidestoprovideaforecastinthesecases. Themostrecent actualdataavailableasofthiswritingisfor2025Q1. Figure 2 shows the results for CPI inflation and the unemployment rate. In each panel thebluelineisthetrue,fully-revisedseries. Theredlineistheaverageestimatereturnedby theLLM,andthepinkbandisthe95percentconfidenceintervalbasedonthevariabilityof the 10 iterations of each query. It is evident that the LLM generally recalls something very closetotruthforbothseries. Theonlymajorvisiblegapsappearforpre-1990CPIinflation, where the LLM seems to be biased up when inflation is low. In addition, the confidence bandsaretight,indicatinglittlevariabilityintheLLMresponses. Figure3showsthesameexerciseforrealGDPgrowthandindustrialproductiongrowth. Here,thestoryisquitedifferent. TheLLMconsistentlymissesthehigh-frequencyswingsin these series, though it does track many business cycle movements. Note that the year 2020 isnotplottedsincethepandemicrealactivityswingswoulddwarftherestofthevariation. ItiseasiertoseethedynamicsinFigures4and5, whichfocusonthe1990-2019period. Duringthisperiod, CPIinflationandtheunemploymentratearerecalledprecisely. Onthe 10
20 15 10 5 0 -5 tnecreP CPI CPI: Fully Revised Sonnet 3.5 estimate Confidence interval 1940q1 1950q1 1960q1 1970q1 1980q1 1990q1 2000q1 2010q1 2020q1 2030q1 15 10 5 0 tnecreP Unemployment Unemployment: Fully Revised Sonnet 3.5 estimate Confidence interval 1940q1 1950q1 1960q1 1970q1 1980q1 1990q1 2000q1 2010q1 2020q1 2030q1 Note:LLMestimatesofquarterlyvariables.95%Confidenceintervalsbasedon10repetitionsofthesamequery. Datagothrough2025Q1,LLMestimatesthrough2027Q1. Source: BLS,authors’calculations Figure2: LLMRecallofCPIandUnemployment 11
20 10 0 -10 tnecreP GDP GDP: Fully Revised Sonnet 3.5 estimate Confidence interval 1940q1 1950q1 1960q1 1970q1 1980q1 1990q1 2000q1 2010q1 2020q1 2030q1 40 20 0 -20 -40 tnecreP IP IP: Fully Revised Sonnet 3.5 estimate Confidence interval 1940q1 1950q1 1960q1 1970q1 1980q1 1990q1 2000q1 2010q1 2020q1 2030q1 Note:LLMestimatesofquarterlyvariables.95%Confidenceintervalsbasedon10repetitionsofthesamequery. Covidperiodnotplottedtokeepscalereadable.Datagothrough2025Q1,LLMestimatesthrough2027Q1. Source: BEA,FederalReserveBoard,authors’calculations Figure3: LLMRecallofGDPandIP 12
6 4 2 0 -2 tnecreP CPI CPI: Fully Revised Sonnet 3.5 estimate Confidence interval 1990q1 2000q1 2010q1 2020q1 10 8 6 4 2 tnecreP Unemployment: Fully Revised Sonnet 3.5 estimate Confidence interval 1990q1 2000q1 2010q1 2020q1 Unemployment Note:LLMestimatesofquarterlyvariables.95%Confidenceintervalsbasedon10repetitionsofthesamequery. Source: BLS,authors’calculations Figure4: Pre-PandemicRecentHistory: CPIandUnemployment 13
10 5 0 -5 -10 tnecreP GDP GDP: Fully Revised Sonnet 3.5 estimate Confidence interval 1990q1 2000q1 2010q1 2020q1 10 0 -10 -20 tnecreP IP IP: Fully Revised Sonnet 3.5 estimate Confidence interval 1990q1 2000q1 2010q1 2020q1 Note:LLMestimatesofquarterlyvariables.95%Confidenceintervalsbasedon10repetitionsofthesamequery. Source: BEA,FederalReserveBoard,authors’calculations Figure5: Pre-PandemicRecentHistory: GDPandIP 14
10 8 6 4 2 tnecreP CPI CPI: Fully Revised Sonnet 3.5 estimate Confidence interval 2021q3 2023q1 2024q3 2026q1 2027q3 5.5 5 4.5 4 3.5 tnecreP Unemployment Unemployment: Fully Revised Sonnet 3.5 estimate Confidence interval 2021q3 2023q1 2024q3 2026q1 2027q3 Note:LLMestimatesofquarterlyvariables.95%Confidenceintervalsbasedon10repetitionsofthesamequery. VerticallineshowsSonnet3.5’sknowledgecutoff(April2024).Datagothrough2025Q1,LLMestimatesthrough 2027Q1. Source: BLS,authors’calculations Figure6: Post-2021CPIandUnemployment 15
8 6 4 2 0 -2 tnecreP GDP GDP: Fully Revised Sonnet 3.5 estimate Confidence interval 2021q1 2022q3 2024q1 2025q3 2027q1 10 5 0 -5 tnecreP IP IP: Fully Revised Sonnet 3.5 estimate Confidence interval 2021q1 2022q3 2024q1 2025q3 2027q1 Note:LLMestimatesofquarterlyvariables.95%Confidenceintervalsbasedon10repetitionsofthesamequery. VerticallineshowsSonnet3.5’sknowledgecutoff(April2024).Datagothrough2025Q1,LLMestimatesthrough 2027Q1. Source: BEA,FederalReserveBoard,authors’calculations Figure7: Post-2021GDPandIP 16
otherhand,forGDPandIPtheLLMmissesmanyofthequarterlyswings. TheLLMtracks GDPgrowththroughoutbusinesscycleswellandappearstobecomemoreaccuratetowards the end of the sample. LLM performance on IP growth is not as good; it picks up almost noneofthequarterlyvariationandisconsistentlybiaseddownwardpre-2000. Inaddition, theconfidenceintervalsshowconsiderablevariationintheLLMestimates. Figures6and7focusonthepost-2021period. Thedashedverticallineistheknowledge cutoff for Sonnet 3.5; the date of the last training data for the model.7 Note that Sonnet continues to provide economic estimates well after its knowledge cutoff. These estimates followafairlysmoothtrendjumpingoffoftheknowledgecutoffandofcoursedonotanticipatethelow2025Q1GDPreadingorthestrong2025Q1IPreading. Itappearsthataccuracy fallsoffsomewhatafterastheknowledgecutoffapproaches;inparticular,foreachvariable post-2023 accuracy seems noticeably worse than accuracy before that year. This is shown more clearly in Figure 8, which plots rolling six-quarter trailing root mean squared errors for each variable, normalized to unity in the 2024q2 knowledge cutoff date. The error in each LLM estimate climbs more or less steadily for the year leading up to the knowledge cutoff,suggestingthattheLLMhaslesspreciseinformationabouttheperiodjustbeforethe cutoff. Thoughthesamplesizesaresmall,themagnitudeofthechangeinRMSEsisnotable: for most variables the errors roughly double in size leading up to the cutoff. It is possible thatthetrainingdatabecomemoresparseinthemonthsjustbeforetheknowledgecutoff,as therehasbeenlesstimetocollectdata. Inaddition,whilestatisticalpressreleasesandnews articles will always mention indicators as soon as they are available, books and academic papers discussing the economic situation will only appear months or years after the fact, constrictingtheamountofrelevanttrainingdata. Table 2 collects statistics for estimation error by decade. We include both the average estimation error (the bias) and the root mean squared error for each variable. The bias in 7ThisisApril2024. Itispossibleforthemodeltoobtainsomepost-cutoffinformation, eitherthroughinadvertentmixingofmorerecentdataintothetrainingsetortheimplicitbiasescomingfromthesecondstage “post-training”withhumanswhoknowofeventsafterthecutoff. 17
AverageError RMSE Decade GDP CPI Unemp. IP GDP CPI Unemp. IP 1940s −2.66−0.05 −0.28 0.65 4.37 1.41 0.67 14.66 1950s −0.53−0.66 −0.08 3.95 3.68 1.46 0.36 13.45 1960s −0.45−0.28 0.00 3.47 3.12 0.38 0.11 7.41 1970s 0.01−0.98 −0.12 2.02 3.54 1.19 0.27 7.59 1980s −0.32−0.59 −0.03 0.75 2.36 0.91 0.12 5.25 1990s −0.38−0.01 0.01 2.33 1.12 0.23 0.08 3.74 2000s −0.37−0.07 −0.04 −0.51 1.28 0.30 0.10 2.77 2010s −0.24−0.06 0.03 −0.82 1.07 0.15 0.07 2.60 Table2: EstimationErrorsbyDecade 1.2 1 .8 .6 .4 .2 2q4202 ot evitaler ESMR gnilloR CPI Unemployment GDP IP 2022q1 2022q3 2023q1 2023q3 2024q1 Note:Rolling6-quarterRMSEsoftheLLMestimates,normalizedtounityin2024Q2. Source: BEA,BLS,FederalReserveBoard,authors’calculations Figure8: RootMeanSquaredErrors 18
CPI and unemployment is generally small, though the LLM estimate for CPI is often 0.5-1 percentage points too high prior to the 1990s. The LLM estimate for real GDP growth has been about 0.3 percentage points too high since the 1980s. The estimates for IP show large biases,shiftingfrombeingconsistentlytoolowbefore2000tosomewhattoohighthereafter. Turning to the RMSEs, we see that estimation errors are markedly higher in the early periods than in the late periods. It is tempting to attribute this to a relative lack of training datainthepre-internetera,butweneedtobecautious. Analternative(butnotentirelydistinct)interpretationisthattheLLM’sestimationprocessisstablebutunderlyingeconomic volatility was also higher in the pre-Great Moderation period, so the errors could simply reflect the fact that the series have more “noise”. For example, if the LLM’s estimate is approximately an N quarter moving average we would expect larger errors in more volatile periods. 4.1 RealTimeData Economic time series often revise several times after their initial release, reflecting additionaldata,seasonaladjustment,andmethodologychanges. Fullyreviseddataarethebest retrospectiveestimatesofwhathappenedhistorically. However,datarevisionsrarelymake much imprint in the popular press and are usually only of interest to analysts. The initial datareleasesgarnermuchmoreinterest,soitispossibleLLMswillhavemoreaccuratebeliefsabouttheinitialrelease. InthissectionwefocusonrealGDPgrowthandevaluatethe relationship between the initial release, fully revised data, and LLM estimates of both. We usethePhiladelphiaFed’sReal-TimeDataSetforhistoricalinitialreleasevalues. We modify the prompt slightly (shown in Figure 20) to explicitly ask for the first print value while continuing to instruct the LLM that it can use all of its information set. As before, we run the prompt 10 times for each quarter and average the results. While the prompt refers to the first print of GDP, reading the LLM’s reasoning makes clear that it is at least partially aware that what was published prior to 1991 was Gross National Product 19
AverageError RMSE Decade FullyRevised FirstPrint FullyRevised FirstPrint 1940 -2.66 - 4.37 - 1950 -0.53 - 3.68 - 1960 -0.45 -0.86 3.12 1.62 1970 0.01 -0.65 3.54 3.11 1980 -0.32 -0.99 2.36 2.55 1990 -0.38 -0.73 1.12 1.36 2000 -0.37 -0.05 1.28 1.09 2010 -0.24 -0.32 1.07 0.55 Table3: SummaryofEstimates: GDP (GNP),andtherehavebeenotherrevisionssince. For reference, Figure 9 shows both published fully revised GDP (i.e. the same series in theearlierfigures)andthefirstprintvalue. Whiletheseriesareextremelyhighlycorrelated, thefirstprintdoesdivergenoticeablyattimes. Figure10showsthesamecomparisonforthe LLM estimates—first print vs. full revised. Turning to the estimation errors, Table 3 shows the average errors and RMSEs for first print and fully revised GDP. To be clear, columns 1 and 3 compare published fully revised GDP to the LLM estimate of fully revised GDP, and columns 2 and 4 compare published first print GDP to the LLM estimate of first print GDP. The average errors do not show a clear pattern. For the RMSEs, however, first print GDPseemstobeestimatedmoreaccuratelyformostdeceades,markedsointhe2010s. Itis possiblethattheavailabilityofonlinenewsandanalysissince2000—whichmightfocuson firstprints—hastippedthebalanceoftrainingdatatowardsthefirstprint. One question of interest is whether LLM estimates for fully revised data and first print dataareblendingtogetherinformationfromactualfirstprintswithlaterrevisions. Inother words, is the LLM estimate mixing the first print and fully revised values even though we specify that the estimate should be fully revised? Table 4 shows regressions of the LLM estimateoffullyrevisedGDPonthepublishedfirstprintandfullyrevisedvalues. Thesample period is 1980-2019. Starting from a specification with only fully revised GDP (column 2), adding first print GDP (column 3) raises the R2 of the regression about 3.5 percentage 20
(1) (2) (3) (4) PublishedfirstprintGDP 0.830∗∗∗ 0.336∗∗∗ 0.408∗∗∗ (0.072) (0.075) (0.066) PublishedfullyrevisedGDP 0.753∗∗∗ 0.520∗∗∗ 0.592∗∗∗ (0.054) (0.071) (0.066) Constant 1.060∗∗∗ 0.995∗∗∗ 0.824∗∗∗ (0.239) (0.192) (0.194) RMSE 1.538 1.358 1.280 1.399 AdjustedR2 0.628 0.708 0.742 . Table4: Dependentvariable: LLMestimateoffullyrevisedGDP (1) (2) (3) (4) PublishedfirstprintGDP 0.696∗∗∗ 0.446∗∗∗ 0.656∗∗∗ (0.065) (0.084) (0.068) PublishedfullyrevisedGDP 0.573∗∗∗ 0.263∗∗∗ 0.344∗∗∗ (0.052) (0.072) (0.068) Constant 1.242∗∗∗ 1.348∗∗∗ 1.122∗∗∗ (0.212) (0.184) (0.196) RMSE 1.298 1.373 1.227 1.465 AdjustedR2 0.625 0.578 0.665 . Table5: Dependentvariable: LLMestimateoffirstprintGDP 21
15 10 5 0 -5 -10 tnecreP Fully Revised First Print 1960q1 1970q1 1980q1 1990q1 2000q1 2010q1 2020q1 2030q1 Note:FirstprintandfullyrevisedGDPgrowth Source: BEA,PhiladelphiaFed,authors’calculations Figure9: PublishedData: ComparisonoffullyrevisedandfirstprintrealGDPgrowth points. Column 3 shows that both versions of published GDP are highly statistically significant predictors of the LLM estimate and the coefficients are similar in magnitude. Put differently,thegapbetweenfullyrevisedGDPandtheLLMestimateiscorrelatedwithfirst print published GDP. Column 4 forces the regression to predict the LLM estimate using a convex combination of the published numbers: we remove the constant and constrain the coefficients to add up to unity. This weighted average predictor puts similar equal weight on the two series (though somewhat more on the fully revised series), once again making thepointthattheLLMestimateseemstobemixtureofthetwo. Table 5 repeats the exercise, but uses the LLM estimate of first print GDP as the dependentvariable. Thepatternislargelythesame: bothversionsofpublishedGDPhelpexplain the LLM estimate, and the “wrong” published series—fully revised GDP—reduces the R2 of a regression with the “right” published series as a predictor. Both sets of results suggest 22
10 5 0 -5 -10 tnecreP LLM Estimate: Fully Revised LLM Estimate: First Print 1960q1 1970q1 1980q1 1990q1 2000q1 2010q1 2020q1 2030q1 Note:LLMestimatesoffirstprintandfullyrevisedGDPgrowth Source: Authors’calculations Figure10: LLMEstimates: ComparisonoffullyrevisedandfirstprintrealGDPgrowth 23
the LLM is estimating historical values by—in part—smoothing across data vintages; mixing together various versions of the data that are in its training data. This is not especially surprising. An LLM with imperfect recall would naturally look to both fully revised and first print information when forming an estimate (just as a human might.) Further, LLMs are trained on enormous quantities of sometimes messy data. Even if the LLM was able to interpret and “understand” each segment of text, not all text would include clear date stamps that would signal whether the discussion of GDP was from the days after the first printorsometimelater. Themixtureoffirst-printdatawithlaterrevisionssuggeststhatanLLMinstructedtoact asarealtimeforecastermayknow“toomuch”. Whereasanactualrealtimeforecasterwill only have access to the first-print values of the most recent GDP estimates, the LLM will (inadvertently)beworkingwithaGDPestimatethatincorporatesfuturerevisions,perhaps leadingtoforecaststhatdependonthisdataleakage. Symmetrically,theseresultsshowthat anLLMaskedtoactasretrospectiveanalystwillnotonlyhaveerrorsinhistoricalrecall,but thoseerrorsarepartiallyattributabletorecallingfirstprintratherthanfullyrevisedvalues. 4.2 TestforSmoothingWithinVintages The smoothing across vintages highlighted in the previous section raises issues for exercises using LLMs as real-time forecasters. A distinct form of contamination can come from smoothing data within a fixed vintage. A striking feature of Figure 5 is how much less volatile the LLM estimate is as compared to the published real activity series. This pattern is potentially consistent with the LLM returning estimates that are smoothed across time within a vintage. Abstracting from data revisions, the LLM may estimate each variable by an approximate moving average. If the moving average is two-sided this behavior would be problematic for real-time forecasting exercises, since the LLM’s beliefs about “current” conditionswouldincorporateinformationfromthefuture. Toevaluatethispossibility,weuseaslightlydifferentprompt,showninFigure21. This 24
prompt again asks for the first print value, but specifies that the LLM should not use information after the reference date. In particular, we explicitly instruct the LLM not to use future values of the variable (or any other variable) in making an estimate. If the LLM is abletofollowtheseinstructionstheestimatesshouldbeindependentoffutureshockstothe series. Let y be the first print value of a variable for reference period t, and let yˆ be an estit t matebasedon(possiblyincomplete)informationavailableforreferenceperiodsr ≤ t. The estimationerroris ε = y −yˆ . (1) t t t Even if yˆ is based on incomplete information, ε ought to be orthogonal to true shocks t t which occur in periods t+1 and later. We can approximate such shocks by using the SPF expectations fory t+1 asofperiodt,SPF t t +1 . Thequantity ω t+1 = y t+1 −SPF t t +1 (2) will be the unforecastable period t+1 shock to y, to the extent that the SPF forecast is efficient.8 ThenourtestofwhethertheLLMissmoothingusingfuturedataissimplyatestof whetherε t isindependentofω t+1 . Table6showstheresultsforasimpletestofthiscondition,regressingε t onω t+1 . There is no statistically significant relationship, suggesting that there is no evidence of the LLM smoothing its estimates. Table 7 shows another specification, which controls for period t GDP and its lags (all first prints). These variables may help explain ε , particularly if the t LLM is smoothing using lagged values. But if the LLM is not smoothing the same orthogonalityconditionbetween ε t and ω t+1 shouldhold. Weadditionallycontrolfor SPF t t +1 ,the SPF median expectation for t+1 GDP growth as of quarter t. This is expected component 8CoibionandGorodnichenko(2012)andothersshowthatSPFmedianexpectationsarenotnecessarilyrational&efficientforecasts.Nonetheless,webelievetheSPFisagoodapproximationforthesepurposes. 25
of t+1 GDP growth, while ω t+1 is the unexpected. We see in columns 2 and 3 that there is a statistically significant relationship between the shock to t+1 GDP growth and the LLM’s estimate of period t GDP growth. In addition, the coefficients on ω t+1 and SPF t t +1 are—asexpected—negative: holdingGDP constant,strongerfutureGDPgrowth(whether t expectedorunexpected)leadstoastrongerLLMestimateandmakesε morenegative. The t relationbetweenω t+1 andε t appearstobeeconomicallysignificanttoo. Thebottomlineof thetableshowstheRMSEoftheregressionswhenω t+1 isdropped;thisleadstoa21percent and10percentincreaseintheRSMEsincolumns2and3respectively. We take this as preliminary evidence of smoothing, though it is not decisive. If LLMs were predominantly smoothing the true data to form estimates we would presumably see a strong association between ε t and ω t+1 even in the absence of controls. In addition, it is important to emphasize that we rely on the SPF estimates to capture all relevant period t information relevant for forecasting GDP t+1 . While it is known that this is not literally true—Coibion and Gorodnichenko (2012) and others document deviations from efficiency and rationality—we are comfortable with it as a baseline. To understand why, it is helpful to contrast our approach with one that focuses only on forecasts. Imagine that one evaluated LLM one-quarter-ahead real-time forecasts and found they had smaller errors than SPFforecasts. Thiswouldnotbestrongevidenceoflook-aheadbias,sinceitispossiblethat the LLM is able to synthesize relevant information (while following the information constraints)betterthantheSPF.Putdifferently,itisunderstoodthatSPFforecastsarenotfully efficient so better performance by an alternative—which in some ways has far more data thananySPFparticipant—isnotclearevidenceofdataleakage. Incontrast,ourapproachis to show that the error in the LLM’s recall of GDP is correlated with the SPF forecast error t for GDP t+1 . If the SPF is reasonably close to efficient then we’ve shown that the LLM is usingtheunanticipatedshockto GDP t+1 toestimate GDP t , aclearcaseoflook-aheadbias. Ontheotherhand,iftheSPFisnotefficientandtheLLMhasabetterforecastingmethodology, then the LLM may have observe ω t+1 while respecting the information constraint not 26
1960-2024 1960-1989 1990-2024,ex. 2020 (1) (2) (3) ω t+1 −0.035 −0.109 −0.030 (0.033) (0.107) (0.054) Constant −0.580∗∗∗ −0.956∗∗∗ −0.365∗∗∗ (0.133) (0.312) (0.094) AdjustedR2 0.000 0.003 −0.005 RMSE 1.951 2.812 1.069 Table6: TestsforSmoothing. DependentVariable: ε t use information from beyond t. But the regression shows ω is predictably related to the t LLM’srecallerrors,andagoodforecastwouldeliminateerrorsthatarecorrelatedwiththe informationset. Whatisimplausible—butadmittedlynotimpossible—isthattheLLMhas insight into forecasting beyond what the SPF is capable of yet still make predictable recall errors which could be solved by making use of that information. This mismatch is mostly easilyexplainedbylook-aheadbias. 5 Forecasting with LLMs InthissectionweexaminetheforecastingperformanceofLLMs. Wefollowamethodology similar to Faria-e-Castro and Leibovici (2023), Lopez-Lira et al. (2025), and Hansen et al. (2024): ask the LLM to pretend to be a forecaster at date t, and make a forecast using only information available as of that date. In particular, we ask for 1-quarter-ahead forecasts and ask the LLM to use information available as of the 15th day of the second month of the quarter. Thus, the forecasts for 2024Q2 are made with the information in hand as of February15,2024. ThisismeanttomaketheresultscomparabletotheSPFwhichisfielded inthesecondmonthofeachquarter. Wecomparetheforecaststotherealizationsandcalculaterootmeansquareerrors.9 Fig- 9WeusethepublishedvaluesprovidedbytheSPF.ForCPI,unemployment,andindustrialproductionthese valuesarethoseavailableatthemiddleofthefollowingquarter,andthusmaybesecondprintvalues.ForGDP 27
1960-2024 1960-1989 1990-2024,ex. 2020 (1) (2) (3) ω t+1 −0.015 −0.268∗∗∗ −0.164∗∗∗ (0.053) (0.043) (0.039) SPFt 0.092 −0.297∗∗∗ −0.581∗∗∗ t+1 (0.167) (0.073) (0.086) GDP 0.213∗∗ 0.768∗∗∗ 0.515∗∗∗ t (0.094) (0.052) (0.044) GDP t−1 0.043 −0.041 −0.146∗∗∗ (0.061) (0.054) (0.038) GDP t−2 −0.018 −0.072∗ −0.009 (0.038) (0.037) (0.017) Constant −1.395∗∗∗ −1.761∗∗∗ 0.204 (0.317) (0.209) (0.214) AdjustedR2 0.254 0.816 0.648 RMSE 1.692 1.209 0.635 altRMSEw/oω t+1 1.682 1.471 0.699 Table7: TestsforSmoothing. DependentVariable: ε t 28
ure11showstheresults. Forcomparison,wealsoshowtheRSMEsforthefullinformation LLM estimate (i.e. the prompt which instructs the LLM to use all available information) andtheSPFmedian. TheLLMforecastRMSEsarealwayshigherthanthefullinformation values, suggesting that the LLM is attempting to follow the prompt and not use future information in its estimate. Interestingly, the RSMEs are generally similar: asking the LLM to ignore all knowledge of the reference quarter and subsequent history only produces a modest reduction in accuracy. This is perhaps puzzling, we might expect having the LLM ignore all information from date t onward, including the realization of the variable, would significantlyreduceaccuracy. NotealsothattheLLMforecastsarecomparabletotheaccuracy of the SPF and often somewhat better; if the RMSEs are valid then LLMs could be an invaluable tool for forecasting. However, the evidence of look-ahead bias in the previous sectionssuggeststhatweshouldnotgothatfar—theRMSEsmaybeafunctionoftheLLM drawingondatathatpostdatestbutarestillinthetrainingset. 6 Recall of Release Dates Inthissectionwefocusonthedatethatdatawerereleased,ratherthanthevalueofthedata release. DatareleasedatesareanotherusefulwaytoassesstheLLM’shistoricalknowledge. In real time, a forecaster’s information set is governed by the release dates of the relevant series. IfanLLMcanaccuratelyrecallthereleasedatesofimportantreleasesitmaybeable tosimulatearealtimeforecaster. If,ontheotherhand,theLLMhasincorrectbeliefsabout datareleasedates,anyattempttosimulatearealtimeforecasterwillbeproblematic. Macroeconomicdatareleasedatesareagoodwaytoevaluatelook-aheadbiasforseveral reasons. First, the data release are important for forecasters and widely watched. Second, theyareregularandcanbepinpointedtoaparticularday,unlikeothernewsstorieswhich might circulate informally before breaking in major publications. Third, the release dates arenotcompletelyregular: Theexactdayofreleasedependsonholidaysandotherfactors. thevalueusedisthefirstprint. 29
4 3 2 1 0 CPI GDP IP Unemp. LLM: Full info LLM: Forecast SPF Median Forecast Note:Rootmeansquareerrorsfor1995Q1-2019Q4 Source: SurveyofProfessionalForecasters,authors’calculations Figure11: RootMeanSquareErrors 30
ThismeansthattoanswercorrectlytheLLMhastoknowmorethanasimplerule,ithasto recalltheactualdateoftherelease. Ourapproachisasfollows. Foreachday t, weasktheLLMtopretendtobeananalyst livingat5pmonday t. TakingCPIasanexample, weasktheLLMtogiveusthereference period of the most recent CPI release available at that time. We repeat this ten times for each data release and each day. For this section, we focus only on monthly indicators (i.e., not GDP). To build a more complete picture expand the set of series beyond CPI, IP and unemployment. We draw additional indicators mostly from the set of Principal Federal Economic Indicators (PEI); the PEI series are designated by the Office of Management and Budget and subject to rules about the release of data. In general these series are widely watched by forecasters, widely reported on, and many move markets. We do the exercise above for each indicator and for a 60 week period beginning on January 1, 2014. Example promptsarefoundinSectionA.1. WeuseALFRED10 togetreleasedatesforeachseries. Figures12and13presentdailyfractionofqueriesthatsufferfromlook-aheadbias: The LLM states that data has been released which in fact will only be released in the future. Thegreenlineplotsthisfraction,andthereddotsmarkactualreleasedates. Severalthings standout. First,significantlook-aheadbiasisfairlyrare;mostseriesonlyhaveahandfulof days where more than half of the responses indicate look-ahead bias. Second, as we might expect,look-aheadbiastendstooccurinthedaysjustbeforetheactualdatarelease. Inother words,theLLMclearlyknowsapproximatelywhenthedatareleaseissupposedtohappen, butsometimesmissesbyacoupledays. ThisisconsistentwiththeLLMhavingonlyfuzzy recall of the exact dates. Third, there is significant variation across series. For example, the unemploymentratesuffersverylittlelook-aheadbias,whileCPIhasmore. The mistakes the LLM makes appear generally sensible. For example, in early January 2015theLLMconfidentlysaysthattheDecember2014unemploymentrate(i.e. theemploymentsituationreport)hadbeenreleased. ItturnsoutthattheDecember2014employment 10https://alfred.stlouisfed.org/ 31
CPI Unemployment Rate Fraction of responses Fraction of responses 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 01jan2014 01apr2014 01jul2014 01oct2014 01jan2015 01apr201501jan2014 01apr2014 01jul2014 01oct2014 01jan2015 01apr2015 Date Date Fraction w/lookahead bias Release dates Fraction w/lookahead bias Release dates Industrial Production Construction Spending Fraction of responses Fraction of responses .6 .6 .4 .4 .2 .2 0 0 01jan2014 01apr2014 01jul2014 01oct2014 01jan2015 01apr201501jan2014 01apr2014 01jul2014 01oct2014 01jan2015 01apr2015 Date Date Fraction w/lookahead bias Release dates Fraction w/lookahead bias Release dates Retail Sales Producer Price Index Fraction of responses Fraction of responses .1 .8 .08 .6 .06 .4 .04 .2 .02 0 0 01jan2014 01apr2014 01jul2014 01oct2014 01jan2015 01apr201501jan2014 01apr2014 01jul2014 01oct2014 01jan2015 01apr2015 Date Date Fraction w/lookahead bias Release dates Fraction w/lookahead bias Release dates Housing Starts Housing Sales Fraction of responses Fraction of responses 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 01jan2014 01apr2014 01jul2014 01oct2014 01jan2015 01apr201501jan2014 01apr2014 01jul2014 01oct2014 01jan2015 01apr2015 Date Date Fraction w/lookahead bias Release dates Fraction w/lookahead bias Release dates Note: PromptasksLLMtostatethereferencedateofthemostrecentdatareleaseasofagivenday. Greenline showsthefractionofresponsesthatcitedareferencedatethathasnotbeenreleasedyet. Reddotsarethetrue datareleasedates. Source: ALFRED,authors’calculations Figure12: LLMRecallofreleasedates 32
Wholesale Trade International Trade Fraction of responses Fraction of responses .8 .6 .6 .4 .4 .2 .2 0 0 01jan201401apr201401jul2014 01oct201401jan201501apr2015 01jan201401apr201401jul2014 01oct201401jan201501apr2015 Date Date Fraction w/lookahead bias Release dates Fraction w/lookahead bias Release dates Durable Goods Orders JOLTS Fraction of responses Fraction of responses .8 1 .8 .6 .6 .4 .4 .2 .2 0 0 01jan201401apr201401jul2014 01oct201401jan201501apr2015 01jan201401apr201401jul2014 01oct201401jan201501apr2015 Date Date Fraction w/lookahead bias Release dates Fraction w/lookahead bias Release dates Personal Income Fraction of responses 1 .8 .6 .4 .2 0 01jan201401apr201401jul2014 01oct201401jan201501apr2015 Date Fraction w/lookahead bias Release dates Note: PromptasksLLMtostatethereferencedateofthemostrecentdatareleaseasofagivenday. Greenline showsthefractionofresponsesthatcitedareferencedatethathasnotbeenreleasedyet. Reddotsarethetrue datareleasedates. Source: ALFRED,authors’calculations Figure13: LLMRecallofreleasedates 33
situation was released unusually late, on January 9th 2015. The BLS typically releases the employmentreportonthethirdFridayaftertheconclusionofthereferenceweek(theweek containing the 12th), which is generally in the first seven days of the release month. So in January 2015 the LLM appears to have expected the data release on the 2nd, which would havebeenmorestandard. Tosummarizetheexercisesmorecompactly,wedevelopametrictomeasurelook-ahead bias across series. For each series and each day, we flag the day as problematic if more thanhalfthequeriessufferfromlook-aheadbias(i.e.,theLLMcitesafuturedatareleaseas current). This is a fairly conservative criteria, as we might ask an LLM to never cite future datainsteadofloweringthebartoonlyhalfthetime. Then,wecountthenumberofdaysin the sample that had any problematic series among the 13 we consider. Again, this is fairly conservative as we have restricted ourselves to prominent, well-reported series. It turns out that 20.2 percent of days have at least one problematic series. This high number is the productofeachserieshavingareasonablylowproportionofproblematicdays(lessthan7 percentforCPI,andlowerforallothers),butthosedaysaredifferentforeachseries. A20.2percenterrorrateshouldgiveuspause. Whiletheresultsforanysingleseriesare impressive, anLLMpretendingtobearealtimeforecasterwouldmakefrequentmistakes. Fromtheperspectiveofhistoricalanalysis,anLLMmaynotreliablyrecallthedetailsofreal timedataflowduringhistoricalepisodes,limitingthereliabilityofhistoricalanalysis. TofurtherexamineLLMperformancewerenormalizethedata, averagingperformance inawindowarounddatareleasesandplotperformancemeasuresfor15daysoneitherside of data releases. Figures 14 and 15 plot the results. The x-axis counts days before and after a data release. The red line shows the fraction of LLM responses for a particular (relative) date that is correct. The green line shows the fraction suffering from look-ahead bias: the responsecitesdatathathavenotbeenreleasedyet. Thebluelineshowsthefractionmaking the opposite error: the LLM cites an old data release when a newer one is available. Note thatthelinesalladduptoone. 34
Many of the patterns from the other charts are apparent here: look-ahead bias peaks in the days just before a data release, and there is often very different behavior across series. Unemployment, in particular, is recalled very accurately. This may be because unemployment (and the Employment Situation report) is very widely reported. In addition, the EmploymentSituationisalmostalwaysreleasedonthethirdFridayaftertheendoftheweekof the12th,whichinturnisgenerallythefirstFridayafterthereferencemonth. Thisregularity likelyassistswithaccuraterecall. Interestingly, “look-behind bias”—citing stale data—is fairly common. Series such as construction spending, the PPI, and housing sales all have big spikes in citing stale data in the days after a data release. This highlights the multiple risks from using LLMs to understand real-time phenomena: While they may engage in data peeking they also may fail to properlyupdatetheirinformationsets. Alongsomedimensions,theseerrormightroughly offset, leading to decent forecasting performance that is a mix of look-ahead bias contaminationandstaledata. FromFigures14and15itisapparentthat“look-behindbias”ismorecommonthanlookaheadbias. AnexaminationoftheLLMresponsesshowsthattheLLMsometimesstatesitis being“conservative”,inthesenseofonlysayingadatareleasehasoccurredwhenitisvery surethatisthecase. NotethatwhilethepromptdidnotasktheLLMtobeconservativein thissense,itisapparentlyimposinganasymmetricpenaltyonitself. To explore this further we change the prompt to explicitly tell the LMM to not be conservative: if it is 51% sure that a new data release is available, that should be the answer, if it is only 49% certain than it should not be. The full prompt is in Figure 25. We run thenewpromptforCPIandconstructionspending,seeFigures16and17. Forconstruction spending,thenewpromptisabigimprovementinaccuracy: Whilethefractionofresponses suffering look-ahead bias increases, accuracy is higher and look-behind bias is lower. The fractions of look-ahead and look-behind bias are roughly equal as we would expect with a symmetricpenalty. 35
CPI Unemployment Rate Fraction of responses Fraction of responses 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 −20 −10 0 10 20 −20 −10 0 10 20 Days relative to release date Days relative to release date LLM is correct LLM response is stale LLM is correct LLM response is stale LLM lookahead bias LLM lookahead bias Industrial Production Construction Spending Fraction of responses Fraction of responses 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 −20 −10 0 10 20 −20 −10 0 10 20 Days relative to release date Days relative to release date LLM is correct LLM response is stale LLM is correct LLM response is stale LLM lookahead bias LLM lookahead bias Retail Sales Producer Price Index Fraction of responses Fraction of responses 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 −20 −10 0 10 20 −20 −10 0 10 20 Days relative to release date Days relative to release date LLM is correct LLM response is stale LLM is correct LLM response is stale LLM lookahead bias LLM lookahead bias Housing Starts Housing Sales Fraction of responses Fraction of responses 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 −20 −10 0 10 20 −20 −10 0 10 20 Days relative to release date Days relative to release date LLM is correct LLM response is stale LLM is correct LLM response is stale LLM lookahead bias LLM lookahead bias Note: PromptasksLLMtostatethereferencedateofthemostrecentdatareleaseasofagivenday. Resultsare normalizedsothatthetruedatareleaseisonday0andthenaveraged. Source: ALFRED,authors’calculations Figure14: LLMRecallofreleasedates: Relativetoreleasedate 36
Wholesale Trade International Trade Fraction of responses Fraction of responses 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 −20 −10 0 10 20 −20 −10 0 10 20 Days relative to release date Days relative to release date LLM is correct LLM response is stale LLM is correct LLM response is stale LLM lookahead bias LLM lookahead bias Durable Goods Orders JOLTS Fraction of responses Fraction of responses 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 −20 −10 0 10 20 −20 −10 0 10 20 Days relative to release date Days relative to release date LLM is correct LLM response is stale LLM is correct LLM response is stale LLM lookahead bias LLM lookahead bias Personal Income Fraction of responses 1 .8 .6 .4 .2 0 −20 −10 0 10 20 Days relative to release date LLM is correct LLM response is stale LLM lookahead bias Note: PromptasksLLMtostatethereferencedateofthemostrecentdatareleaseasofagivenday. Resultsare normalizedsothatthetruedatareleaseisonday0andthenaveraged. Source: ALFRED,authors’calculations Figure15: LLMRecallofreleasedates: Relativetoreleasedate 37
CPI CPI: Symmetric penalty prompt Fraction of responses Fraction of responses 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 −20 −10 0 10 20 −20 −10 0 10 20 Days relative to release date Days relative to release date LLM is correct LLM response is stale LLM is correct LLM response is stale LLM lookahead bias LLM lookahead bias Note: PromptasksLLMtostatethereferencedateofthemostrecentdatareleaseasofagivenday. Resultsare normalizedsothatthetruedatareleaseisonday0andthenaveraged. Source: ALFRED,authors’calculations Figure16: EffectofalternativepromptonCPI Construction Spending Const. Spending: Symmetric penalty prompt Fraction of responses Fraction of responses 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 −20 −10 0 10 20 −20 −10 0 10 20 Days relative to release date Days relative to release date LLM is correct LLM response is stale LLM is correct LLM response is stale LLM lookahead bias LLM lookahead bias Note: PromptasksLLMtostatethereferencedateofthemostrecentdatareleaseasofagivenday. Resultsare normalizedsothatthetruedatareleaseisonday0andthenaveraged. Source: ALFRED,authors’calculations Figure17: Effectofalternativepromptonconstructionspending 38
The story is different for CPI. Originally, CPI had roughly equal look-ahead and lookbehindbiases. Underthenewprompt,look-aheadbiasworsens,look-behindbiasisalmost nonexistent,andaccuracyislower. Thus,itappearsthatthenewpromptdoesnotnecessarily improve accuracy; instead, it trades off more look-ahead bias for less look-behind bias andtheoutcomedependsontheinitialbalance. 7 Conclusion LLMsarebecomingimportanttoolsforeconomicanalysis. Ourresultspaintacomplicated pictureofcurrentcapabilitiesandshortcomings. WefindthatcurrentLLMshaveexcellent retrospective knowledge of some macro variables (like CPI and the unemployment rate), butmuchnoisierknowledgeofGDPandIPgrowth. LLMrecallofdatareleasedatesisalso impressivelyaccurate,butstillsuffersfromnoise;andthenoiseaccumulatesasmoreseries areconsidered. Our results point to problems when LLMs are used as real-time forecasters and evaluated before the knowledge cutoff. Fuzzy knowledge of data release dates, LLM estimates that smooth future data values into current estimates, and the mixture of first-print values withlaterrevisionsallsuggestlook-aheadbiascontaminatesLLMforecasts. Thisraisesthe question of why—if LLMs have significant look-ahead bias—LLM forecasts are only modestlybetterthanSPFforecasts. OurevidencesuggeststhatLLMsmaybebothtoogoodand too bad: the fuzzyness and imperfect recall that lead to look-ahead bias also limit forecast accuracy, since the LLM has limited recall of both the true target value and the historical variablesitmightuseaspredictors. TheseoffsettingerrorsmayleaveLLMswithin-sample forecasts that are good, but not implausibly good. Some of these issues may be attenuated bymoresophisticatedpromptingstrategies,suchasprovidingmoreinformationtoground theLLM.Weleaveexplorationofthesemarginsforfuturework. 39
References Bybee,J.Leland,“TheGhostintheMachine: GeneratingBeliefswithLargeLanguageModels,”2023. Cajner,Tomaz,LelandD.Crane,ChristopherJ.Kurz,NormanJ.Morin,PaulE.Soto,and Betsy Vrankovich, “Manufacturing Sentiment: Forecasting Industrial Production with Text Analysis,” Finance and Economics Discussion Series 2024-026, Board of Governors oftheFederalReserveSystem(U.S.)May2024. Coibion, Olivier and Yuriy Gorodnichenko, “What Can Survey Forecasts Tell Us about InformationRigidities?,”JournalofPoliticalEconomy,2012,120(1),116–159. Cook,ThomasR.,SophiaKazinnik,AnneLundgaardHansen,andPeterMcAdam,“EvaluatingLocalLanguageModels: AnApplicationtoBankEarningsCalls,”ResearchWorkingPaperRWP23-12,FederalReserveBankofKansasCityNovember2023. Croushore, Dean, “Frontiers of real-time data analysis,” Journal of economic literature, 2011, 49(1),72–100. Faria-e-Castro,MiguelandFernandoLeibovici,“ArtificialIntelligenceandInflationForecasts,”WorkingPapers2023-015,FederalReserveBankofSt.LouisJuly2023. Federal Reserve Bank of St. Louis , “ALFRED, Archival Federal Reserve Economic Data,” https://alfred.stlouisfed.org/. Glasserman,PaulandCadenLin,“AssessingLook-AheadBiasinStockReturnPredictions GeneratedByGPTSentimentAnalysis,”2023. Hansen, Anne Lundgaard, John J. Horton, Sophia Kazinnik, Daniela Puzzello, and Ali Zarifhonarvar, “Simulating the Survey of Professional Forecasters,” November 15 2024. SSRNWorkingPaper. 40
He,Songrun,LinyingLv,AsafManela,andJimmyWu,“ChronologicallyConsistentLarge LanguageModels,”Workingpaper2025. Jha, Manish, Jialin Qian, Michael Weber, and Baozhong Yang, “ChatGPT and Corporate Policies,”WorkingPaper32161,NationalBureauofEconomicResearchFebruary2024. Kazinnik, Sophia, “Bank Run, Interrupted: Modeling Deposit Withdrawals with GenerativeAI,”2024. Kim, Alex G., Maximilian Muhn, and Valeri V. Nikolaev, “Financial Statement Analysis withLargeLanguageModels,”2024. Korinek, Anton, “Generative AI for Economic Research: Use Cases and Implications for Economists,”JournalofEconomicLiterature,January2023,61(4),1281â1317. Lopez-Lira,Alejandro,YuehuaTang,andMingyinZhu,“TheMemorizationProblem: Can WeTrustLLMs’EconomicForecasts?,”arXivpreprintarXiv:2504.14765,2025. Ludwig,Jens,SendhilMullainathan,andAsheshRambachan,“LargeLanguageModels: AnAppliedEconometricFramework,”2025. Manning,BenjaminS.,KehangZhu,andJohnJ.Horton,“AutomatedSocialScience: LanguageModelsasScientistandSubjects,”2024. Ouyang,Shuyin,JieMZhang,MarkHarman,andMengWang,“Anempiricalstudyofthe non-determinismofchatgptincodegeneration,”ACMTransactionsonSoftwareEngineering andMethodology,2025,34(2),1–28. Pham, Van and Scott Cunningham, “Can Base ChatGPT be Used for Forecasting without AdditionalOptimization?,”2024. Phan, Long, Adam Khoja1, Mantas Mazeika, and Dan Hendrycks, “LLMs Are SuperhumanForecasters,”2024. 41
Research Department, Federal Reserve Bank of Philadelphia, “Survey of Professional Forecastors,” https://www.phil.frb.org/research-and-data/real-time-center/survey-ofprofessional-forecasters/. Sakar, Suproteem and Keyon Vafa, “Lookahead Bias in Pretrained Language Models,” 2024. Sarkar,Suproteem,“StoriesLM:AFamilyofLanguageModelsWithTime-IndexedTraining Data,”Mar2024. AvailableatSSRN:https://ssrn.com/abstract=4881024. Schoenegger, Philipp, Peter S. Park, Ezra Karger, and Philip E. Tetlock, “AI-Augmented Predictions: LLMAssistantsImproveHumanForecastingAccuracy,”2024. Shapiro,AdamHale,MoritzSudhof,andDanielJ.Wilson,“Measuringnewssentiment,” JournalofEconometrics,2022,228(2),221–243. Tranchero,Matteo,Cecil-FrancisBrenninkmeijer,ArulMurugan,andAbhishekNagaraj, “TheorizingwithLargeLanguageModels,”2024. van Binsbergen, Jules H, Svetlana Bryzgalova, Mayukh Mukhopadhyay, and Varun Sharma,“(Almost)200YearsofNews-BasedEconomicSentiment,”WorkingPaper32026, NationalBureauofEconomicResearchJanuary2024. Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,”Advancesinneuralinformationprocessingsystems,2022,35,24824–24837. Zarifhonarvar, Ali, “Evidence on Inflation Expectations Formation Using Large Language Models,”2024. 42
A Prompts You are a resourceful and knowledgeable economic analyst, with deep knowledge of macroeconomic data and forecasting. Think step-by-step, writing out your reasoning, and only write your final answer at the end of your response. Figure18: MainSystemPrompt Based on all knowledge available to you, tell me the fully-revised value of {var} for {reference_quarter}. This should be the value after all subsequent revisions, not necessarily as initially released. Do not try to forecast revisions that occur beyond your knowledge cutoff. If you are unsure, make an estimate based on what you know. Please give me a numeric point estimate, not a range. Use all of your powers of analysis and use all of the information you have available to you. Figure19: PromptforFully-RevisedData Based on all knowledge available to you, tell me the first print value of {var} for {reference_quarter}. This should be the value as initially released without any subsequent revisions. If you are unsure, make an estimate based on what you know. Please give me a numeric point estimate, not a range. Use all of your powers of analysis and use all of the information you have available to you. Figure20: PromptforFirstPrintData Tell me the first print value of {var} for {reference_quarter}. This should be the value as initially released without any subsequent revisions. If you are unsure, make an estimate based on what you know, but do not base your estimate on any data for reference periods after {reference_quarter}. In particular, don’t use values of {var} from after {reference_quarter} in constructing your estimates. Please give me a numeric point estimate, not a range. Use all of your powers of analysis and use all of the information you have available to you, subject to the constraints above. Figure21: PromptforFirstPrintData—RealTimeInformationSet 43
You are a specialist in extracting information from the output of other Large Language Models. You are succinct in your responses and response with exactly what is asked of you. Figure22: SummarizationSystemPrompt I have the following output from a large language model: {llm_output} This piece of text is an economic forecast. I want you to summarize and extract the prediction for the economic variable mentioned in the following format for me, please: "answer: {...}" Please replace the placeholders denoted with {...} with the answer over the requested quarter only. PLEASE ONLY put ONE NUMBER in that location. Refrain from reporting a range of values; please try to report a single value. Please only return this format with the right value and NO additional text. Thank you! Figure23: SummarizationPrompt 44
A.1 PromptsforDataReleaseDates Assume that it is 5pm, close of business on {month} {day_of_month}, {year}. Tell me the most recent month for which BLS has released any CPI estimate, i.e. the reference month for the most recent release on or before {month} {day_of_month}, {year}. Briefly explain your reasoning and only give your answer at the end. Give an exact month, no ranges. Make an estimate if you have to. When you give your answer, give it in year, "M", month format, i.e. 2035M2 for the February 2035, or 2001M11 for the November 2001. Make sure the final answer is given exactly in that format. Figure24: PromptforCPIReleaseDate Assume that it is 5pm, close of business on {month} {day_of_month}, {year}. Tell me the most recent month for which BLS has released any CPI estimate, i.e. the reference month for the most recent release on or before {month} {day_of_month}, {year}. Briefly explain your reasoning and only give your answer at the end. Give an exact month, no ranges. Make an estimate if you have to. It is equally bad to make mistakes in either direction: if you think there is a 51 percent chance the more recent release has occured, that should be your answer. If you think there is only a 49 percent chance the more recent release has occured, it should not be your answer. Do not be "conservative", we only care about raw accuracy. When you give your answer, give it in year, "M", month format, i.e. 2035M2 for the February 2035, or 2001M11 for the November 2001. Make sure the final answer is given exactly in that format. Figure25: PromptforCPIReleaseDate,RiskNeutralVersion You are a resourceful and knowledgeable economic analyst, with deep knowledge of macroeconomic data and forecasting. Think step-by-step, writing out your reasoning, and only write your final answer at the end of your response. Figure26: DataReleaseDates,SystemPrompt You are a specialist in extracting information from the output of other Large Language Models. You are succinct in your responses and response with exactly what is asked of you. Figure27: DataReleaseDatesSummarizer,SystemPrompt 45
I have the following output from a large language model: {llm_output} This piece of text contains a monthly date as an answer to a question. I want you to extract the date, which should be given as a year followed by a "M" followed by a month number with no spaces. Examples would be 2021M1, or 2035M11. Convert the date to that format if needed. Reply only with following format, please: "answer: {...}" Please replace the placeholders denoted with {...} with the data. Refrain from reporting a range of values; please try to report a single value. Make sure you extract the correct date; other dates might be discussed in the passage but make sure to extract the one given as the answer. Please only return this format with the right value and NO additional text. Thank you! Figure28: DataReleaseDatesSummarizerPrompt 46
Cite this document
Leland D. Crane, Akhil Karra, & and Paul E. Soto (2025). Total Recall? Evaluating the Macroeconomic Knowledge of Large Language Models (FEDS 2025-044). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2025-044
@techreport{wtfs_feds_2025_044,
author = {Leland D. Crane and Akhil Karra and and Paul E. Soto},
title = {Total Recall? Evaluating the Macroeconomic Knowledge of Large Language Models},
type = {Finance and Economics Discussion Series},
number = {2025-044},
institution = {Board of Governors of the Federal Reserve System},
year = {2025},
url = {https://whenthefedspeaks.com/doc/feds_2025-044},
abstract = {We evaluate the ability of large language models (LLMs) to estimate historical macroeconomic variables and data release dates. We find that LLMs have precise knowledge of some recent statistics, but performance degrades as we go farther back in history. We highlight two particularly important kinds of recall errors: mixing together first print data with subsequent revisions (i.e., smoothing across vintages) and mixing data for past and future reference periods (i.e., smoothing within vintages). We also find that LLMs can often recall individual data release dates accurately, but aggregating across series shows that on any given day the LLM is likely to believe it has data in hand which has not been released. Our results indicate that while LLMs have impressively accurate recall, their errors point to some limitations when used for historical analysis or to mimic real time forecasters.},
}