Validating Large Language Model Annotations
Abstract
This paper proposes a validation framework for LLM-generated measurements when reliable benchmarks are unavailable. Validity is established by testing whether an LLM can reconstruct passages from annotated labels while maintaining semantic consistency with the original text. The framework avoids circular reasoning by establishing testable prerequisite properties that must be met for a validation to be considered successful. Application to news article data demonstrates that the framework serves as a practical alternative to human benchmarking, which offers advantages in objectivity, scalability, and cost-effectiveness while identifying cases where LLMs capture economic meaning that human evaluators miss.
Finance and Economics Discussion Series Federal Reserve Board, Washington, D.C. ISSN 1936-2854 (Print) ISSN 2767-3898 (Online) Validating Large Language Model Annotations Anne Lundgaard Hansen 2026-020 Please cite this paper as: Hansen,AnneLundgaard(2026). “ValidatingLargeLanguageModelAnnotations,”Finance and Economics Discussion Series 2026-020. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2026.020. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.
Validating Large Language Model Annotations Anne Lundgaard Hansen ‗ FederalReserveBankofRichmond BoardofGovernorsoftheFederalReserveSystem Firstdraft: September25,2025. Thisdraft: March11,2026. Abstract: ThispaperproposesavalidationframeworkforLLM-generatedmeasurementswhenreliablebenchmarksareunavailable. Validityisestablishedbytesting whetheranLLMcanreconstructpassagesfromannotatedlabelswhilemaintaining semanticconsistencywiththeoriginaltext. Theframeworkavoidscircularreasoningbyestablishingtestableprerequisitepropertiesthatmustbemetforavalidation tobeconsideredsuccessful. Applicationtonewsarticledatademonstratesthatthe framework serves as a practical alternative to human benchmarking, which offers advantages in objectivity, scalability, and cost-effectiveness while identifying cases whereLLMscaptureeconomicmeaningthathumanevaluatorsmiss. Keywords: Large Language Models, Validation Framework, Text Annotation, SentimentAnalysis. JELCodes: C18,C45,C80. ‗ TheauthoriswiththeQuantitativeSupervisionandResearchgroupattheFederalReserveBankofRichmond and affiliated with the Financial Stability Cyber and AI Sentinel Lab at the Board of Governors of the Federal ReserveSystem. E-mailaddress: Anne.Hansen@rich.frb.org. Theviewsexpressedinthispaperaresolelythose oftheauthorsanddonotreflecttheopinionsoftheFederalReserveBankofRichmondortheBoardofGovernors of the Federal Reserve System. The author thanks Jeff Allen, Steve Ge, Sophia Kazinnik, Seung J. Lee, Huiyu Li, Viviana Luccioli, David MacArthur, Nitish Sinha, Christina Wang, and seminar participants at the Federal ReserveBoardforvaluablediscussionandfeedback. 1
1 Introduction Large language models (LLMs) are proving to be powerful tools for addressing questions within economics and finance. These models are increasingly used to quantify textual data, e.g., by labeling sentiment scores or classifying topics, which are subsequently plugged into downstream estimation problems.1 One major concern with such approaches is the black-box nature of LLMs. With billions of parameters (which are unknown to users if the model is propriety) andextraordinarily largetrainingdata sets, itis difficult(if notimpossible) totease outtheinnerworkingsofthesemodels. Whatismore,LLMsaredesignedtosatisfyendusers andthereforetendtoprovideanswers, evenincaseswhereinstructionsareunclearandcarry multiple interpretations. LLMs are also known to occasionally hallucinate, and it is poorly understood under which circumstances hallucinations occur and how they can be avoided. It followsthatresearchersshouldquestionthevalidityofLLMannotations. To build trust in LLM-generated measurements, researchers typically benchmark results againsthuman-generateddata,atleastforasubsetoftheirsample.2 ThisapproachisalsorecommendedbyLudwigetal.(2025),whofurthersuggesttousehuman-generatedmeasurements to debias LLMs. However, the validity of human-generated annotations is also questionable. Humanmeasurementsarelikelyinfluencedbysubjectivityandinconsistenttreatmentofdata, e.g., due to learning as the task progresses, changing external environments under which the taskiscompleted,andinattentionorstraight-upburnout.3 Againstsuchconsiderations,LLMs andAIingeneralhavebeencitedfortheirabilitytogenerateobjectiveandconsistentmeasurements (Sharma et al., 2025; Mirzakhmedova et al., 2024; Du et al., 2025; Törnberg, 2023). In additiontotheseconcerns,humanvalidationistime-consumingandcostly,especiallyifrelying 1 See,e.g.,Bertschetal.(2025);Fanetal.(2024);Jhaetal.(2024);LiuandShi(2025);Shahetal.(2024);Kirtacand Germano(2024). 2 See,e.g.,Baueretal.(2024);Chenetal.(2022);Cooketal.(2025);HansenandKazinnik(2024);Shapiroetal.(2022). 3 Theseconcernsareacknowledgedintheliterature,andoftendealtwithbyaveragingtheresponsesofmultiple humanevaluatorsintheattempttocancelouterrors(HansenandKazinnik,2024;Maloetal.,2014;Cooketal., 2025). Thismethod,however,onlyworksiferrorsaresymmetricallydistributedaroundzeroandthenumberof humanevaluatorsislarge. 2
ongenuinedomainexperts. Manuallabelingtasksarethereforeoftencrowdsourced,withthe riskthatcrowdworkersuseLLMsforthetask.4 This paper addresses the question: How can researchers validate the LLM measurements, in the absence of reliable external benchmark data, such as human-generated measurements? I propose minimal requirements that a researcher should confirm before deploying LLMs for measuring inputs to a downstream estimation problem. Then, I present methods to test these requirements. Inmyframework,acombinationofanLLMandapromptdesignedtoextractsomemeasure from a text are considered valid as an entity if the measure repeatedly helps reconstructs the original text. This idea is similar to the practice of checking goodness-of-fit of an estimated modelfromtraditionaleconometrics. InthecontextofLLMs,thedatageneratingprocessisthe combinationofanLLMandaprompt,andthecoefficientisthemeasureonewishestoextract fromthetext. Justastheinterpretationofacoefficientreliesonthedatageneratingprocessin traditionaleconometrics,thevalidationframeworkrequiresthecombinationoftheLLManda prompttovalidatethelabel. Toruleoutconcernsofcirculardependencyproblems,Ispecifytwoadditionalrequirements. First, the annotation backtranslation property requires that the LLM system can translate betweenlabelandtextwithoutintroducingerrors. Second,theseparationpropertyrequiresthat twotextsgeneratedbasedondifferentlabelscanbeclearlyseparated. Together,theseconditions ensureinmostcasesthatanerroneousmeasuredoesnotpassthevalidationframework. ThisworkbuildsontheeconometricframeworkforLLMsproposedbyLudwigetal.(2025). Theirframeworkassumesthatthereexistsameasurementthatcanbeobservedwithoutusing LLMs, albeit it may be costly to do so, e.g., using human-generated labels. In this context, they argue that LLMs should only be used for estimation problems establishing economic relationships when (i) the training data of LLMs do not overlap with the estimation data and (ii)afterexplicitlyaccountingforLLMerrorsbycomparinghumanandLLM-generatedresults. ContrastingLudwigetal.(2025),Ifocusonproblemswheretheresearcherdoesnothaveaccess toanexternalbenchmark,e.g.,duetohumanbiasesanderrors. Thissettingisrelevantinmany 4 Veselovskyetal.(2023)estimatethat33-46%ofcrowdworkersuseLLMsforcompletingtasks. 3
applications,ashumanmeasurementsmaynotbereliable. Benchmarking and evaluating the capabilities of LLMs has garnered considerable attention among researchers and practitioners. This trend is exemplified by the proliferation of leaderboards that rank LLMs according to various standardized benchmark tests targeting diverse competencies, including coding proficiency, knowledge retention, human-like reasoning, and scientificreasoning. However,amodel’srankingonsuchleaderboardsprovideslimitedinsight into its reliability for specific tasks (Ludwig et al., 2025). Furthermore, these ranking systems havebeencriticizedfortheirsusceptibilitytogaminganddistortion(Singhetal.,2025). Patwardhanetal.(2025)suggestsassessingthetrustworthinessofLLMsbyevaluatingtheconsistency inanswersacrossmultiplemodelruns. Theyalsoconsideracross-validationtechnique,where theresponsegeneratedbyoneLLMiscomparedwiththeresponsesofotherLLMs. However, Ludwig et al. (2025) and Reiss (2023) argue that the errors of LLMs are unpredictable and do notnecessarilycenteraroundtruevalues,implyingthattheaverageofalargenumberofLLM predictionsdoesnotnecessarilyconstituteacorrectone. Wangetal.(2023)suggeststogenerate multipleasetofreasoningfromthesamemodelandpickthemostconsistentone. Theyshow thatthismethodimprovesresultsinreasoningtasks,whichareinherentlydifferentfromannotationtasks.5 WhilethevalidationframeworkIproposeisalsousingtheLLMtovalidateitself, similar to these contributions, the criteria for validity is rooted in the original passage rather thanpurelyfocusingonLLMoutput. Iproceedasfollows: Section2definesrequirementsforvalidityofLLMannotations. Section 3 proposes a test that assesses whether choice of LLM and prompting strategy satisfies these requirements. Section4presentsillustrativeexamplesofhowthetestcanbeusedtovalidatethe ability of LLMs to classify sentiment, clarity, temporal focus, and the identification of specific topic mentioning. This section also applies the validation framework across different sizes and generations of LLMs, showing that larger and more recent models pass the prerequisite testsneededfortheframeworkmorefrequently. InSection5,Ipresentafull-scaleapplication showcasinghowthetestscanbereplacehumanbenchmarkinginLLMannotationworkflows. 5 Indeed, the ability to generate consistent and logical arguments may not coincide with the ability to produce meaningfulannotations. 4
LimitationsarediscussedinSection6,andconclusionsfollowinSection7. 2 Requirements for Validity ConsidertheproblemofusinganLLMtoproduceanannotationβ ∈ S fromapassageoftext y. For example, y could be news article headlines or earnings call transcripts, and β could be sentimentlabeledfromthesetS = {positive,neutral,negative}orothermeasuresoflinguistic characteristics. Thepassageycanbeasubsetofafulldocument,e.g.,sentencesorpassagesfrom earningscalldocuments.6 Themeasureisassumedtobediscrete,butcanbeeithernumericor categorical. The application of LLMs involve the choice of a model and a prompting strategy. Both elements impact the results and are difficult to disentangle: a prompt may work in one way with one LLM and another way with another LLM. I shall therefore refer to these choices as oneentity,whichIcallafunction. Importantly, as in Ludwig et al. (2025), I do assume that there exists a “true” measurement. But,whereasLudwigetal. reliesonhumanannotationstodefinetruth,Iassumethatthetrue measurement cannot necessarily be recovered outside of LLMs, e.g., due to human bias and inattention. Let β˚denote this true annotation, and let f˚be the combination of an LLM and a promptthatgeneratedtheobservedpassageoftextybasedonβ˚∈ S. Thetruedatagenerating processcanthusbedescribedby: (cid:16) (cid:17) y = f˚ β˚ . (1) Other than the explicit dependency on the annotation β˚, there are no requirements on the promptcomponentoff˚. Itmayincorporatevariousenhancementssuchasretrieval-augmented generation (RAG) with external data, detailed instructions, few-shot examples, and chainof-thought reasoning, or it may simply be a straightforward request to generate a text with characteristicβ˚. 6 FormanyLLMs,chunkingoffull-documenttextsisnecessitatedbyrestrictionsonlengthofthemodel’scontext windows. 5
Thetruedatageneratingprocessgivenbyf˚andthusβ˚areunknown. Instead,theresearcher is using a choice of LLM and prompting strategy to annotate the text y. Let f−1 denote the chosenmethod,andletβˆ∈∈ S betheresultingannotation: βˆ= f−1(y). (2) Thequestionis: howcanβˆbevalidatedasameasurementofβ˚,whenonlyy isobserved? Iftheproblemwasnumeric,aneconometriciancouldestimateβˆandcheckthatthegoodnessof-fitofthefittedvaluesf(βˆ). Forexample,inalinearregressionmodel,OLSestimationyields βˆ= f−1(y) = (X′X)−1X′yandthefittedvaluesaregivenbyyˆ= f(βˆ) = Xβˆ. Toformulatethis idea in the context of textual data with categorical labels, suppose that the annotator function f−1 isaccompaniedbyatextgeneratorfunctionf thatgeneratesapassageoftextfromalabel β. Analogoustothelinearregressionexample,Iconsiderβˆavalidannotationforβ˚ifyˆ= f(βˆ) provides a satisfying goodness-of-fit to y. In the context of textual data, I propose to measure goodness-of-fitbysemanticsimilarity. Continuingthelinearregressionanalogy,measuringgoodness-of-fitisnotsufficienttobuild confidence in an estimated quantity. An estimator should at least be consistent, ensuring that estimates converge in probability to β˚in the limit. Similarly, I impose requirements on the functions f and f−1. Specifically, I impose two requirements on the generator function f: Theannotationbacktranslation7andtheseparationproperties. Theannotationbacktranslation propertyensuresthatf andf−1aremutuallyconsistent,whiletheseparationpropertyensures that the function f generates texts with different defining characteristics for different labels. Thesepropertiesaredetailedbelow. 7 Annotationbacktranslationisavariationofthebacktranslationpropertyfrommachinelearning,whereaccuracy ofatranslatedtextisassessedbyre-translatingitbacktoitsoriginallanguage(Sennrichetal.,2016). Lietal. (2024)alsoadoptstheideaofbacktranslationtogenerateinstructionpromptsusedtosimulatetrainingdatafor fine-tuninglanguagemodels;theydenotetheirmethodinstructionbacktranslation. 6
Annotation Backtranslation Property: The functions f and f−1 satisfy the annotation backtranslationpropertyifforanyannotationβ ∈ S, f−1(f(β)) = β. (3) □ Separation Property: The function f satisfies the separation property if for all annotations γ ̸= β withγ ∈ S,f(γ)isnotsemanticallysimilartof(β). □ The annotation backtranslation property addresses the concern that validation fails for a correct βˆ due to an erroneous simulation function f, i.e., βˆ = β˚but f(β˚) is not semantically similar to y. In this case, βˆ would be incorrectly rejected as a valid annotation because of the failureoff. Choosingf andf−1 suchthat(3)issatisfiedrulesoutsuchcasesbecauseifβˆ= β˚ butf(βˆ) = yˇ,whereyˇisnotsemanticallysimilartoy,then f−1(yˇ) ̸= β˚, (4) which contradicts (3). Similarly, the property also captures cases where a wrong label is validatedduetoanerroneoussimulationfunction. Another concern is that βˆ is incorrect, i.e., βˆ ̸= β˚, but f(βˆ) is generating a text that is semantically similar to y. In this case, both f and f−1 fails, but in a way such that their combinationappearsvalid. Thisisproblematicbecausetheresearcherwouldacceptanincorrect annotation βˆ. The separation property rules out this type of error. Specifically, the separation propertyensuresthatifβˆ̸= β˚,thenf(βˆ)isnotsemanticallysimilartoy. Given this setup, I propose the following definition for an annotation to be a valid measure ofβ˚: Valid Annotation: The annotation βˆ defined in (2) is a valid measure of β˚if for a function f that accompanies the annotation function f−1, such that the annotation backtranslation and separationpropertiesaresatisfied,yˆ= f(βˆ)issemanticallysimilartoy. □ 7
Thisdefinitionestablishestruthbasedonthesemanticsimilaritybetweentextgeneratedwith this label and the original text. This standard is intuitively appealing beyond the analogy to linear regression presented above: If the generated text is not similar to the original one, then theannotatedlabelisnotcapturingthetext’sessenceandshouldnotbeconsideredvalid. Note thatthisdefinitionalsoinvalidatesannotationproblemsthatareill-defined,e.g.,scoringaspects ofatextthathavelittlerelevanceforthetext’scharacter. Theresearcherhasconsiderableflexibilityinspecifyingthefunctionf,includingthechoice of prompt, model, and hyper-parameters such as temperature. The only requirement is that thefunction satisfiesthe prerequisiteproperties jointlywith theannotationfunction f−1. The researcher may therefore introduce elements from the original text into the prompt, e.g., its characteristics such as style and length, to increase semantic similarity with the original text. This is permissible as long as the inclusion does not prevent the function from simulating semanticallydistincttextsfromdifferentlabels(theseparationproperty). 2.1 Addressing Concerns of Circular Dependency The concept of using LLMs to validate themselves raises important concerns of circular dependency. The annotation backtranslation and separation properties will jointly detect most issues. However,thereremainsonetypeoferrorthatcouldpassthevalidationframeworkdue to circular dependency. This type of error is highly specialized, requiring multiple aspects to always fail in a specific way, and is therefore unlikely to occur. Nevertheless, examining this scenarioindetailisvaluableforunderstandingtheboundariesandpotentialvulnerabilitiesof theframework. Consider the simple problem of annotating the sentiment of a text as either positive or negative. If an LLM annotation is valid, the translation between true label β˚, original text y, annotatedlabelβˆ,andsimulatedtextyˆclearlyseparatespositivefromnegativeasfollows: f˚ f−1 f β˚ y βˆ yˆ + + + + − − − − 8
Anannotationfunctionf−1thatfailsthevalidityrequirementeitherannotatesapositivetext asnegative,anegativetextaspositive,orboth: f˚ f−1 f β˚ y βˆ yˆ + + + + − − − − This type of error is easily detected by the validation framework, because y will not be semanticallysimilartoyˆ.8 But,whathappensifnotonlyf−1isinvalidbutalsof generatestext with a different sentiment than intended by βˆ? For example, a system where the annotation functionisbiasedtowardsnegativesentimentbutthesimulationfunctionisbiasedintheother direction,y andyˆmaybesemanticallysimilardespiteβˆ̸= β˚: f˚ f−1 f β˚ y βˆ yˆ + + + + − − − − In this case, the LLM system yields a positive text yˆ based on a positive origin y but an erroneousnegativelabelβˆ. But,thevalidationframeworkcapturesthiscaseastheprerequisite properties fail. Specifically, the separation property fails because f generates a positive text regardlessofthevalueforβˆ. Thevalidationframeworkthusdetectscaseswheretheannotationandsimulationfunctions arebiasedinonedirection. Theonlyremainingcaseofconcernoccursiftherearecounteracting biasesthatoccurringinbothdirections: f˚ f−1 f β˚ y βˆ yˆ + + + + − − − − 8 Itisalsolikelythattheannotationbacktranslationpropertywillfail. 9
This type of error will not be detected by the validation framework proposed in this paper. However,Ideemthesecasestobehighlyunlikelyastheyrequireerrorstooccursystematically andinawaywherebiasesinf−1 andf areperfectlycounteractingeachother. 3 Testing Validity This section presents methods for testing the requirements for validity outlined above. An annotation βˆ as defined by (2) should be rejected as a valid measurement for the passage y if a passage simulated from βˆ, f(βˆ), is sufficiently different from y. Semantic similarity can be measured using cosine similarity between vector representations of y and f(βˆ). As f(βˆ) is non-deterministic,Iproposeassessingvaliditybasedontheaveragecosinesimilaritybetween yandalargenumberofoutcomesforf(βˆ). Inotherwords,βˆisnotvalidifthestatistic, N 1 (cid:88) (cid:16) (cid:17) τ = cossim y,f(βˆ) , (5) i N i=1 where f is chosen such that f and f−1 satisfy the annotation backtranslation and separation properties, is sufficiently small. To determine a threshold for this condition, I propose to simulatetheteststatisticunderthenullhypothesisthatβˆisavalidannotation,akintobootstrap testing from the traditional econometric toolbox. The bootstrap can be performed using the followingsteps: 1. Generateatextgivenβˆ: y˜= f(βˆ). (cid:16) (cid:17) 2. Simulatetheteststatisticgiveny˜: τ˜ = 1 (cid:80)N cossim y˜,f(βˆ) . b N i=1 i 3. Repeatsteps1-2alargenumberoftimes(B)toobtainadistributionofteststatisticsunder thenullhypothesis: {τ˜}B . b b=1 4. Rejectatsignificancelevelαiftheα’thpercentileof{τ˜}B exceedsτ. b b=1 Thedistributionofτ¯ showstherangeofvaluesonewouldexpecttoobserveifβˆisthetrue b measurementfory. Ifthestatisticbasedontheobservedpassageyfallsinthefarlefttailofthis 10
distribution(asdefinedbythesignificancelevel),theannotationβˆshouldnotbeconsidereda validmeasurement. Therequirementforvalidityhingesontheannotationbacktranslationandseparationproperty. Theseassumptionsaretestablewithinsimilarframeworks. Testing the Annotation Backtranslation Property: For a choice of f and f−1, and a given annotation β, e.g., β = βˆ, the annotation backtranslation property can be assessed using the followingsteps: 1. Fixanannotationβ. 2. Generateapassagegivenβ: y˜= f(β).9 3. Generateanannotationfory˜: β˜= f−1(y˜). 4. Defineabinaryvariablethattakesthevalueoneifβ˜= β andzerootherwise: I = ⊮ . i β˜=β 5. Repeatsteps2-3alargenumberoftimes. Theaverage 1 (cid:80)N I definestheaccuracyand N i=1 i shouldbeclosetoone. Testing the Separation Property: For a choice of f and f−1, and a given annotation β, e.g., β = βˆ, testing the separation property is a test of the null hypothesis that for any label γ ̸= β, f(γ)isnotsemanticallysimilartof(β). Considertheteststatistic, N N 1 (cid:88)(cid:88) ξ = cossim(f(β) ,f(γ) ). (6) N2 i j i=1 j=1 Thenullhypothesisisrejectedifξ issufficientlylarge. Thefollowingbootstrappingalgorithm canbeusedtocomputearejectionthreshold: 1. GenerateN passagesgivenanannotationγ ̸= β: y˜ = f(γ)fori = 1,2,...,N. i 2. Simulatetheteststatisticgiveny˜: ξ¯= 1 (cid:80)N (cid:80)N cossim(f(β) ,y˜ ). i N2 i=1 j=1 i j 9 Ifβ ischosensuchthatβ = βˆ,thisstepcanbeperformedbyreusingthepassagesgeneratedduringthetestof validity,orviceversa. 11
3. Repeatsteps1-2alargenumberoftimes(B)toobtainadistributionofteststatisticsunder thenullhypothesis: {ξ¯}B . i i=b 4. Rejectatsignificancelevelαifξ exceedsthe(1−α)’thpercentileof{ξ¯}B . i b=1 3.1 Interpretation If an LLM and associated prompting strategy pass both the validity test and the prerequisite tests, an annotation βˆ generated by this model-prompt combination can be regarded as valid according to the definition put forth in this paper. Passing the validation test is not only validation of the LLM annotation, but also the function (prompt and model) used to generate thatlabelandthesimulationfunctionthattranslatesthelabelintotext. Incontrast,ifthevalidationtestisrejectedatconventionalsignificancelevels(regardlessofthe testoutcomesoftheprerequisitetests),βˆshouldnotbeappliedinfurtheranalyses. Technically, βˆ could still be correct. For example, randomly choosing between sentiment labels negative, neutral, positive will on average yield a correct result in one third of the times. However, it will often be wrong and the results are therefore not reliable. It is important to note that the validation test rejects the annotation and simulation functions jointly. The source of rejection could be the model, the specific prompts, how the text is parsed (e.g., passages for which multiplelabelsapplywilllikelynotbevalidatedbytheframework),thelabelingtaskincluding the set of possible labels S. Adjusting any of these settings, such as deploying an alternative promptingstrategyorusingadifferentLLM,maychangethetestoutcomes. What happens if the validation test passes, but the prerequisite tests fail? The prerequisite properties are necessary conditions for the validation framework to work. Failing these tests will therefore invalidate the framework and the annotation cannot be concluded to be neither validnorinvalid. 4 Illustrative Examples Iillustratetheproposedtestsonfivedifferenttextpassages,allrelatedtofinancialandeconomic applications. Afull-scaleapplicationtoalargernumberofpassagesispresentedinthefollowing 12
section. To test the methods on a wide variety of textual data, I consider passages taken from abank’s10-Kfiling,anearningscalltranscript,thetitleandsubtitleofanewsarticle,aspeech givenbyagovernoroftheFederalReserveSystem,andacommentfromaRedditconversation thread. These passages vary substantially in linguistic style and length. Together, they offer a variedpanelthatillustratestheeffectivenessofthetests. Thepassagesaredefinedbelow: 10-KFiling: “An adverse change in market conditions in particular segments of the economy, such as a sudden and severe downturn in oil and gas prices or an increase in commodity prices, severe declines in commercial real estate values, or sustained changes in consumer behavior that affectspecificeconomicsectors,couldhaveamaterialadverseeffectonourclientswhoseoperations orfinancialconditionaredirectlyorindirectlydependentonthehealthorstabilityofthosemarket segmentsoreconomicsectors,aswellasclientsthatareengagedinrelatedbusinesses.” (JPMorgan Chase&Co.,Form10-K,December31,2024;slightlyalteredtoremovefirmname). EarningsCall: “This client value proposition combined with disciplined pricing helped drive a 9% year over year increase in net interest income. Another strong highlight this quarter was expensemanagement,whichenabledustodelivermorethan600basispointsofoperatingleverage. Continued innovation and the deployment of advanced technology and tools helped us to hold expense growth to just 1% year over year while revenue grew significantly. As a result, our efficiency ratio improved falling below 50% for the quarter. We continue to invest in high-tech, which drove higher digital engagement and we continue to invest in high touch.” (Bank of America,Transcriptof2025Q3earningscall,October15,2025). NewsArticle: “CryptoInvestorsCelebratedforMostof2025. ThenCametheHangover. Bitcoin finishedtheyearintheredasinvestorsgrappledwiththeAItradeandmacroeconomicrisk.” (Wall StreetJournal,January1,2026). Speech: “Living and teaching in Michigan during the Great Recession, I saw firsthand how the financial system’s fragility contributed directly to job losses. One example is how the default of Lehman Brothers contributed, via a chain of events, to declines in employment in Michigan. Lehman’s failure in September 2008 led a money market fund to "break the buck"—the fall in the value of its assets meant it could no longer redeem shares for the $1 that investors expected 13
to receive—prompting a run on the funds. In turn, the funds pulled back from riskier assets, includingasset-backedcommercialpaper. Butthemajorautofinancecompaniesdependedonthat commercialpapertofinanceloanstoconsumers; hence, theycameunderstress.3Withlesscredit available,autosalesplummeted,andMichiganwashitveryhard. Manypeople—includingsome ofmyfamilymembers,mystudents’andcolleagues’familymembers,friends,andneighbors—lost their jobs and experienced significant hardship” (Lisa D. Cook, “A Policymaker’s View of FinancialStability”deliveredatGeorgetownUniversity’sMcDonoughSchoolofBusiness PsarosCenterforFinancialMarketsandPolicy,Washington,D.C.,November20,2025). RedditComment: “Interest rates aren’t really high, these are normal. You got spoiled previously. Mid 5s is where rates we will be if new buyers get lucky.” (Reddit comment by @memorabiliafan,April20,2025).10 For these passages, I apply the test to various common tasks. Specifically, Sections 4.1- 4.5 test the validity of measuring simple sentiment, granular sentiment, clarity, the temporal focus,andthediscussionofcertaintopics. Themainsetofexperimentsispresentedusingthe Claude4.5Sonnetmodel(referredtoastheLLM),afrontierLLMatthetimeexperimentswere conducted,releasedSeptember29,2025.11 Section4.7isdevotedtodiscussvalidationofLLMs across generations. The Amazon Titan Text Embedding v2 model is used to generate vector embeddings that are nuanced and context-aware for accurately computing cosine similarity. AllapplicationsarefacilitatedthroughtheAWSBedrockAPI. 4.1 Experiment I: Measuring Simple Sentiment The first set of experiments consider the problem of extracting sentiment scores from each of the passages. Starting simple, I first test the validity of the LLM annotating each passage as positive,neutral,ornegative. Thevalidationtestanditsprerequisitesrequirethespecification 10 The Reddit conversation is available at https://www.reddit.com/r/Mortgages/comments/1k3tn81/comment/ mo5t9d9/,retrievedJanuary3,2026. 11 Themodelisimplementedwithatemperatureof1.0andunconstrainednucleussampling(correspondingtoa top-pparameterof1.0),ensuringmaximumvariationinthesimulations. Nosystempromptisspecified,deferring tothemodel’sdefaultbehavior. 14
ofanannotationpromptandasimulationprompt. Theformerrequeststhemodeltoclassifya text provided as an input using one of the three labels. Abstracting from instructions on how toreturnoutputinaJSONobject,theannotationpromptisgivenasfollows: TASK: Read the following passage and classify the sentiment using one of the (cid:44)→ following labels: [Positive, Neutral, Negative]. PASSAGE: {text} Specificationofthesimulationpromptismoresubtle. Ontheonehand,instructingthemodel tosimulateapassagesimilartotheoriginalpassagemayincreasetheteststatistic. Ontheother hand,providingtoodetailedinstructionsmayinflatethedistributionofcosinesimilarityunder thenullhypothesisleadingtoarejectionofvalidity. Itmayalsocausetheseparationpropertyto fail,invalidatingthevalidationtest. Forallpassages,Ifoundthatdescribinginbroadtermsthe content,length,andlinguisticstyleoftheoriginalpassageinthepromptbalancesthistrade-off. The simulation prompt, given a sentiment label, is thus defined for the 10-K Filing passage as follows(abstractingfrominstructionsonthereturnedJSONobject):12 TASK: Write {n} arbitrary passages from a bank’s annual 10-K filing with sentiment (cid:44)→ characterized as {label}. The passages should describe how market conditions (cid:44)→ may affect the bank’s clients. Write the passages in passive voice. The length (cid:44)→ of each passage should be around 90 words. Results are shown in Table 1. The table shows results for both prerequisite tests and the validation test. Panel (a) shows test outcomes for the 10-K filing. The model annotates the passage with negative sentiment. The annotation backtranslation property is therefore tested given the negative label. Specifically, the simulation prompt provided above is used with the LLM to simulate one hundred passages with a negative label. The annotation prompt is then used with the LLM to score the sentiment of all simulated passages. The table shows that all simulated texts are scored with a negative label, resulting in an accuracy of 100%. The backtranslationpropertyisthereforesatisfied. Theseparationpropertyistestedforallbutthe estimated sentiment of the original passage, i.e., it is tested for the positive and neutral labels. The test statistics are respectively 0.58 and 0.60, which are lower than their associated critical 12 Simulation prompts for the other passages follow a similar structure. All prompts are available in the online appendixpublishedonmywebsite,https://sites.google.com/view/anneh/. 15
values of 0.60 and 0.62. The test of this property is therefore passed as well. Given these test outcomes, the validation test is meaningful. The validation test passes as the test statistic of 0.623 exceeds the critical value of 0.621. It follows that the LLM annotation of the 10-K filing passagecanbeconsideredvalid. The validation test also passes for the remaining passages, reported in panels (b)-(e). These conclusionsthereforesupportthehypothesisthatLLMsareabletocorrectlyclassifysentiment onasimplescale. Notably,theperformanceofLLMssometimesdependonthedefinitionofpassages. Thetests providedinthispapercanalsohelpguidehowtextsshouldbeoptimallyparsedforobtaining the highest accuracy. For example, in addition to the Reddit comment listed above, I also testedthevalidityoftheLLMscoringthesentimentofaconversationbetweenmultipleusers. Specifically,IconsideredthefollowingextendedversionoftheRedditpassage: “[Firm_Care_7439:] Will we ever get COVID type rates ever again? [FastSunlul:] No because I’m finally old enough and with money therefore it won’t happen. [Khandious:] Yeah, Rates will be 2.75% on Monday - June 12, 2028 @ 9:53AM EST. [memorabiliafan:] Interest rates aren’t really high, these are normal. You got spoiled previously. Mid 5s is where rates we will be if new buyers get lucky. 13 [Big-Business1921:] Absolutely! Ifmycalculationsarecorrect,Ianticipateithappeningaround2092.” The LLM rates this passage as negative. However, the validation test fails even when the simulation and annotation functions pass the annotation backtranslation and separation tests. Indeed, even from a manual reading of this passage, the sentiment is unclear as sarcastic ping-pong between users makes it difficult to objectively rate the passage. For the following experiments,IproceedonlywiththesimpleRedditcommentpresentedabove. 4.2 Experiment II: Measuring Detailed Sentiment Applications in finance and economics often involve scoring texts on a detailed scale, e.g., ranging from one to five. In the second experiment, I continue to focus on sentiment, but expandthescaleonwhichsentimentismeasuredtothefollowingfivelabels: positive,mostly 13 The Reddit conversation is available at https://www.reddit.com/r/Mortgages/comments/1k3tn81/comment/ mo5t9d9/,retrievedJanuary3,2026. 16
positive,neutral,mostlynegative,andnegative. Thesimulationandannotationpromptsfrom the first experiment are maintained, only with slight adjustments to accommodate the new scale. Table 2 reports the results. The LLM continues to rate the 10-K filing passage as negative, and all tests for this case pass. The results, both in terms of assigned label and test outcomes, are also identical to the those from the simple sentiment experiment for the earnings call and Federal Reserve governor speech. In contrast, the passages from the news article and the Redditconversationarenowclassifiedasmostlynegative,andtheseclassificationsdonotpass as valid as the annotation backtranslation property fails with accuracy of just 5-7%. The test results thus indicate that the task of scoring sentiment on a more granular scale is more often invalid. This result is consistent with human benchmark data, which typically involve more disagreementwhenthegranularityoflabelsincreases. Forexample,thereismoredisagreement aboutwhetherapassageisconsiderednegativeormostlynegativeversuswhetherapassageis considerednegativeorneutral. 4.3 Experiment III: Measuring Clarity Next, consider the problem of measuring the textual clarity of the passages using the labels clearandvague. Whilesentimentisawell-definedconcept,clarityismoreambiguousandcan beinterpretedinvariousways. Fortheannotationbacktranslationpropertytobesatisfied,itis thereforeimportanttodefinetheconceptofclarityintheannotationandsimulationprompts.14 Specifically,Idefinetheannotationpromptasfollows: TASK: Read the following passage and classify the clarity using one of the (cid:44)→ following labels: [Clear, Vague]. A passage should be classified as ’Clear’ if it is objective, has one clear (cid:44)→ interpretation, and explicitly states information rather than implying it. A passage should be classified as ’Vague’ if it is subjective, is subject to (cid:44)→ multiple possible interpretations, or uses hedging words such as ’sort of’, ’ (cid:44)→ perhaps’, and ’kind of’. 14 Forexample,theannotationbacktranslationtestfailsforthe10-Kfilingpassagewithanaccuracyof65%when clarityisnotdefined. 17
PASSAGE: {text} Thesimulationpromptisgivenanalogously. Forexample,forthe10-Kfilingpassage:15 TASK: Write {n} arbitrary passages from a bank’s annual 10-K filing. The passages (cid:44)→ should describe how the bank’s business model and technological developments (cid:44)→ impacted income and performance for the quarter. {clarity_instructions}. The (cid:44)→ length of each passage should be around 90 words. whereclarity_instructionsdependsonthelabelasfollows: if label == ’clear’: clarity_instructions = "The passages should be written in clear language, i (cid:44)→ .e., they should be objective, have one clear interpretation, and (cid:44)→ explicitly state information rather than implying it." elif label == ’vague’: clarity_instructions = "The passages should be written in vague language, i (cid:44)→ .e., they should be subjective, potentially carrying multiple (cid:44)→ interpretations, and/or use hedging words such as ’sort of’, ’perhaps’, (cid:44)→ and ’kind of’." ThetestresultsforallpassagesareshowninTable3. Thepassagesrepresentbothclearand vaguetextsasclassifiedbytheLLM.Forallpassages,thechoiceofsimulationandannotation functionssatisfytheannotationbacktranslationandseparationproperties. Thevalidationtest, however,failsforthreeoutoffivepassages: the10-Kfilingpassage(vauge),theFederalReserve governorspeechextract(clear),andtheRedditcomment(vague). Clarity,evenwhenproperly defined, is thus more difficult to classify than sentiment. The validation test passes for both a passage classified as clear (the passage from the earnings call transcript) and vague (the news article passage). The LLM annotation of clarity is thus often invalid and there is no pattern in the distribution of test outcomes across labels. These results underscore the importance of testingvaliditybeforeusingsuchannotationsindownstreamapplications. 15 Simulation prompts for the other passages follow a similar structure. All prompts are available in the online appendix. 18
4.4 Experiment IV: Measuring Temporal Focus LLMs are also often used to assess the temporal orientation of texts, e.g., to classify whether a text is forward-looking, focusing on the present, or backward-looking. Table 4 assesses the validityofLLMannotationsoftemporalfocusinthefivepassages. Thepassagesrepresentamix offorward-looking(10-KfilingandRedditcomment)andbackward-lookingtexts(earningscall transcript,newsarticle,andFederalReservegovernorspeech),butnoneofthemareclassified as focusing on the present. The prerequisite properties are satisfied for all passages, and the validationtestspassforallbuttheRedditcomment. Theseresultsindicatethattemporalfocus isstraightforwardfortheannotationfunctiontoclassify,similartosimplesentiment. Manually reading the Reddit comment, the passage contains elements that would fit all three labels: “these are normal” is a statement about the present, “you got spoiled previously” is backward-looking,and“mid5siswhererateswillbe”isapredictionforthefuture. Itistherefore comfortingthatthevalidationtestfailsforthispassage. 4.5 Experiment V: Measuring Topics Finally, I consider the ability of the LLM to identify certain topics within the passages. For each passage, I identify a topic that is discussed within the text and one that is not discussed in the text. Specifically, topics that are discussed are chosen as “economic risks” for the 10-K filing passage, “technological developments” for the earnings call transcript passage, “crypto investing”for thenewsarticle passage, “financialcrises”for thespeechpassage, and“interest rates”fortheRedditcomment. Thetopicsnotdiscussedarechosensuchthattheyareplausible topicsthatcouldverywellhavebeenpresentinthesepassages. Specifically,Iuse“investments” forthe10-Kfilingpassage,“geopoliticalrisks”fortheearningscalltranscriptpassage,“interest rates” for the news article passage, “new technologies” for the speech passage, and “stock market”fortheRedditcomment. IthenprompttheLLMtoidentifywhethereachofthesetopicsarediscussedinthepassage using true/false labels. This exercise is different from the previous experiments in the sense thatthereisacorrectandwronganswer. Namely, thelabelsshouldbetrueforthetruetopics (truepositiveidentification)andfalseforthefalsetopics(truenegativeidentification). 19
Table 5 reports results from the true positive identification exercise. The results show that the LLM correctly identifies the topics for all passages, and the validation test passes for all buttheearningscalltranscript. TurningtothetruenegativeidentificationreportedinTable6, mostofthepassagesarecorrectlyannotatedwiththefalselabelandthevalidationtestpasses. Two exceptions are observed: For the Reddit comment, the labels is false as expected, but the validationtestdoesnotpass. Thisresultlikelyoccursbecausethechosenfalsetopic(thestock market) is indirectly related to the actual topic discussed (interest rates). It is therefore up to interpretationwhetherthepassageisdiscussingthestockmarket,tosomeextent. Thesecond exception is the passage from the 10-K filing, for which the topic “investments” is incorrectly identified. Interestingly, the validation test fails with a statistic that is much lower than the criticalvalue(0.55vs0.59). Thetestthuscorrectlyidentifiesthewronglabel. 4.6 Discussion Table 7 provides an overview of the results for all experiments discussed so far. In the table, a green check mark indicates that the validation test passes in a setting where the annotation backtranslation and separation properties are satisfied, while the cross marks represent cases where validity is rejected. The cross mark is yellow if the prerequisite tests fail and red if the prerequisites are satisfied but the validation test fails. Overall, the test outcomes suggest thatsentimentannotationisgenerallyvalidaslongasthescaleissimpleandnottoogranular. Classifyingthetemporalfocusandidentifyingtopicsisalsooftenpromising. However,scoring clarityevenonasimplebinaryscaleoftenfails. Theresultsalsosuggestthattestoutcomescan varysignificantlyacrossapplications: Annotationsofpassagesfromtheearningscalltranscript andthenewsarticleoftenpassthevalidationtests,whereasannotationsoftheRedditcomment are often rejected as valid. Due to this case-dependency, researchers should validate their applications before interpreting LLM annotations or using them in downstream estimation problems. ComparingtheLLMannotationswithmanualreadingsofthepassagessuggeststhatthetest isconservative. TherearecasesofLLMlabelsthataresensible,butforwhichthetestfails. For example, the LLM identifies the topic “technological developments” in the passage from the 20
earnings call transcript, but the validation test of this annotation fails. Reading the passage, it isclearthattechnologicaldevelopmentsisathemeinthepassage. But,forallcaseswherethe validationtestpasses,theLLMannotationseemsappropriateforthepassage. Finally,theexperimentshighlighttheimportanceofexplicitlydefiningconceptsandcontext inthesimulationandannotationprompts. Specificallyforthesimulationprompt,thevalidation test is more likely to pass if the prompt includes details on the content and linguistic characteristicsoftheoriginalpassage. Providingsuchdetailsisacceptableaslongastheprerequisite propertiesaresatisfied. 4.7 Validations Across Model Sizes and Generations All experiments presented thus far are based on the Claude 4.5 Sonnet, which is a very large, state-of-the-artlanguagemodel. Thissectionrepeatsthesentimentexperiments(onthesimple and detailed scales) with different models varying in size, complexity, and release date. In addition to the baseline model, I consider the Claude 3 Haiku model and two Llama models ofdifferentsizes(70Band8Bparameters).16 Thesemodelsareconsideredlarge,medium,and smalllanguagemodels,respectively. Theyalsorepresentadifferentmodelgenerationthanthe baselinemodelwithreleasedatesinMarch2024(Claude3Haiku),January2024(LLama70B), andApril2024(Llama8B). Table 8 shows the test results from validating annotations of sentiment for each model, experiment, and passage.17 In panel (a), sentiment is scored on the simple scale (positive, neutral negative). Labels are identical across all models, but the smaller models do fail to validate the annotations more often, predominantly due to failing prerequisite tests. As such, thelargestandmostrecentLLM(Claude4.5Sonnet)ismorereliablethanpreviouslyreleased andsmallermodels. Inpanel(b)ofTable8,resultsareshownacrossmodelsforsentimentscoringonthedetailed scale (positive, mostly positive, neutral, mostly negative, and negative). On this scale, there is not full agreement on labels across models. For example, the 10-K filing passage is rated 16 Allmodelsareimplementedwithatemperatureof1.0. 17 Detailedtestresultsareavailableuponrequest. 21
negativebytheClaude4.5SonnetandLlama70Bmodels,mostlynegativebyClaude3Haiku, and neutral by the Llama 8B model. Interestingly, the annotations of the outlier models fail the validation either due to failed prerequisites (Llama 8B) or failed validation test (Claude 2 Haiku). Validation of the annotation of the Reddit comment fails across all models, and the models disagree whether the passage is mostly negative (Claude 4.5 Sonnet) or neutral (all othermodels). TheseresultsemphasizethepotentialissuesofinterpretingLLMannotationsof sentimentscoredonagranularscale. 5 Full-Scale Application Whiletheexperimentspresentedintheprevioussectionareusefulforillustratingtheapproach, relevant applications involve the task of annotating large set of passages rather than a few examples. Thissectionillustrateshowthevalidationtestcanbeimplementedinsuchsettings. 5.1 Data Iapplythetesttoawell-knownbenchmarkdatasetinlanguageprocessing,namelytheFinancial Phrasebank data from Malo et al. (2014). This data set contains around 5000 sentences from financial newspaper articles written in English, for which the sentiment has been manually annotated on a simple positive-neutral-negative scale by the average of 5-8 human evaluators (mostly master’s students with majors in finance, accounting, and economics). Choosing this data allows me to compare the validation testing framework with the traditional method of humanbenchmarking. 5.2 Performance Evaluation Metrics ToevaluatetheaccuracyofLLMannotationsusingthevalidationtestingframework,Icompute the fraction of sentences that passes the validation test along with the tests of prerequisite properties. Thismeasureofaccuracycanthenbedirectlycomparedwiththehumanbenchmark accuracycomputedasthefractionofsentencesforwhichtheLLM-andhuman-annotatedlabels coincide. 22
It is important to note that there it is no ground truth to the question of which method of evaluatingaccuracyismorecorrect. Deviationsbetweenconclusionsobtainedbythevalidation testandbycomparingLLMannotationstohumanlabelscanthereforebeevidenceofthefailure of either method, or potentially failing of both methods. Which conclusion to trust lies with the researcher’s definition of truth: Is truth defined by the average of human labels, or by the consistencybetweenLLMsimulationandannotation? 5.3 Annotation and Simulation Functions SimilartotheexperimentspresentedinSection4,IusetheClaude4.5Sonnetmodel18 inboth theannotationandsimulationfunctions. Theannotationandsimulationpromptsaredescribed below. Since the study is focused only on financial and economic domains, the human annotators were asked to consider the sentences from the view point of an investor only; i.e. whether the news may have positive, negative or neutral influence on the stock price. As a result, sentenceswhichhaveasentimentthatisnotrelevantfromaneconomicorfinancialperspective are considered neutral. For LLM validation, these details are provided in the simulation and annotationprompts. The annotation prompt is as follows, excluding instructions on returning output in a JSON object: TASK: Read the following sentence and classify the sentiment using one of the (cid:44)→ labels: [Positive, Neutral, Negative]. When classifying the sentence, determine sentiment from the view point of an (cid:44)→ investor only; i.e. whether the news may have positive, negative or neutral (cid:44)→ influence on the stock price. Sentences which have a sentiment that is not (cid:44)→ relevant from an economic or financial perspective are considered neutral. SENTENCE: {text} The experiments showed the importance of including characteristics of the original texts in the simulation prompt. Since the application involves running the validation test across two 18 Themodelisimplementedwithatemperatureof1.0.. 23
samples each consisting of 100 sentences, instructions has to be automated. I achieve this by includingtheoriginalsentenceasaninputtothesimulationprompt(example)asfollows: TASK: Write {n} arbitrary sentences from financial/economic news paper articles (cid:44)→ with sentiment characterized as {label}, from an investor’s point of view. { (cid:44)→ additional_instructions} The length of each sentence should be around {length} (cid:44)→ tokens, but make sure the sentence is complete. The sentences should be similar (cid:44)→ in terms of style and topic (not necessarily in terms of sentiment) to the (cid:44)→ following sentence: {example}. The parameter length is the word count of example. The input additional_instructions takesavalueonlyifthelabelisneutraltoprovideadditionalguidanceonthedefinitionofthis label: if label == ’neutral’: additional_instructions = "A neutral sentence is neutral in the sense that (cid:44)→ it not expected to have any impact on stock prices. A sentence not (cid:44)→ related to finance or economics is therefore considered neutral." else: additional_instructions = None Thelastsentenceofthepromptinstructsthemodeltogeneratetextsaresimilarinstyleandtopic to example. This inclusion increases semantic similarity with the original text, and thus the likelihoodofpassingthevalidationframework. Critically,theseparationpropertytestprevents the generated text from being excessively dependent on the original content by ensuring that textsgeneratedfromdifferentlabelsaresemanticallydistinct. 5.4 Results Table9showsaccuracymeasurescomputedusingthevalidationtestaswellasusingthehuman benchmark. NotethattheseaccuracymeasuresinvolvethesamesetofLLMannotations;they onlydifferbythewayinwhichaccuracyisevaluated. Thevalidationtestingframeworksuggests that the LLM annotations are accurate in 68% (low agreement sentences) to 82% of the times. This range is narrower than the accuracy suggested by human benchmarking which ranges between 65% and 92%. The fact that the validation testing framework is less accurate for the 24
low-agreementdatasuggeststhatthesesentencesaremoredifficultcases. However,according to the validation testing framework, the difference between the two data sets is not as stark as suggestedbyhumanbenchmarking. ThetablealsoshowsaccuracyacrosstheLLM-annotatedlabels. Forbothaccuracyevaluation frameworks,thesentenceslabeledaspositivebytheLLMaretheleastaccurate. Forthevalidity test,thesentenceslabeledasnegativeandneutralareassociatedwithsimilaraccuracy. However, accordingtothehumanbenchmarkingmethod,accuracyishighest(andequal100%regardless of the level of human agreement in the data sets) for the neutral sentences. This result likely reflect the tendency of humans to classify uncertain cases as neutral, which implies that there arenocaseswheretheLLMclassifuesasentenceasneutralandhumanannotatorsdonot. Table 10 shows in detail how validation test accuracy distributes across all combinations of LLMandhumanlabels. Thetablereportsthenumberofsentencesineachcombinationalong withtheassociatedaccuracyinparentheses. WhiletheLLM-andhuman-generatedlabelsare identicalforamajorityofthesample,therearerespectivelysevenand29sentencesinthefullagreementandlow-agreementdatasetsinwhichtheLLMdisagreeswiththehumanannotation. These sentences are all labeled with neutral sentiment by humans, and a majority is labeled with positive sentiment by the LLM. These labels would be considered incorrect if relying on human benchmarking. However, the validity test passes for around half of the positive ∼neutral(LLMlabel∼humanlabel)sentencesandforallofthenegative∼neutralsentences. To understand such cases better, Table 11 shows selected sentences where the LLM and human labels disagrees, but the validation test and its prerequisites pass. These examples emphasize thelackofagroundtruth. Eventhoughalloramajorityofthehumanannotatorslabeledthese sentences as neutral, most if not all of these sentences may impact the future stock prices of the firms involved, implying that they may not be neutral. The validation testing framework validatestheLLMannotationsthatcapturesuchnuances. 25
6 Limitations Despite the promising results demonstrated by the proposed validation framework, several limitationswarrantdiscussion. Afundamentallimitationoftheapproachisthatvalidatinganannotationfunctionrequires thespecificationofanassociatedsimulationfunction. Wheninsufficientdetailsareprovidedin thesimulationprompt,thetestmayfalselyrejectvalidity,whichmayexplaintheconservative natureofthetestasobservedintheexperimentalresults. The simulation prompt is central to how validity is defined within the framework: without a well-formulated function that specifies the data generating process in terms of a label, it becomesconceptuallychallengingtoassesswhetherthatlabeliscorrect. IfAIisprojectedonto humanintelligence,thisrequirementseemsunfairlystrict. Forhumans,recognitionabilityused forannotationandproductionabilityusedtogeneratetextsaredistinctskills,withrecognition typicallybeingmucheasier.19 Forexample,literarycriticscanidentifyexcellentprosewithout being novelists and most people can identify and enjoy comedy without being comedians. However, LLMs use the same neural network, parameters, and learned representations for bothclassificationandgeneration. Unlikehumans,therearenotseparatecognitivesystemsfor recognitionandproduction. Ifthemodelhaslearnedrepresentationsthatdistinguishnegative from positive sentiment, those representations should be available during generation. It is thereforereasonabletorequirethatanLLMcansimulatetextbasedonalabeltobeconsidered a valid annotator. Drawing an analogy to traditional econometrics, this corresponds to an attempt to evaluate the properties of an ordinary least squares estimator of a coefficient β withoutassumingamodelspecificationthatrelatesβ tothedata. Itis, however, importanttoemphasizethatduetotherelianceonasimulationfunction, the frameworkcannotbeusedtovalidatealltypesofannotationsbecauseLLMsareconstrainedin certainareasintermsofwhattheywillgenerate. Forexample,anLLMmightcorrectlyidentify hatespeech,assuggestedbyHuangetal.(2023)andZhuetal.(2023),butrefusetogenerateit. In addition,amodelheavilyfine-tunedforclassificationmightbeworseatcontrolledgeneration, 19 AlthoughRichardFeynmanfamouslyremarked,“whatIcannotcreate,Idonotunderstand.” 26
creating a gap between valid simulation and annotation that may lead to false rejection of validity. Another limitation is the reliance on quantifying semantic similarity, implemented here through cosine similarity between vector representations. Consequently, the quality of the validation test is inherently bound by the underlying word embedding model. Specifically, the employed embedding model should be context-aware to correctly define terms that have differentmeaningsindifferentcontexts. Usingthesameembeddingmodelforallstepsofthe validation framework mitigates the risk of making wrong conclusions based on wrong vector representations as an erroneous embedding model is unlikely to pass both the validation test andthetestoftheseparationproperty. Forexample,considertwopassagesoftextsy andy givenasfollows: 1 2 y1: “Theiraggressivedepreciationstrategyreducedtaxableincomesubstantially.” y2: “Thestockpricedeclinedafterdisappointingsales.” The first passage has a positive sentiment (a successful business strategy), while the second passage is clearly negative. However, a simple embedding model, e.g., based on a word2vec algorithmusingeverydayEnglishlanguage,maynotcapturethesecontextualnuances. Instead, it might assign excessive weight to individual words that appear negative when taken out of context(suchas"aggressive,""depreciation,""reduced,""taxable"). Suchamodelmaytherefore estimateafalselyhighsemanticsimilaritybetweeny1andy2,andthuspassthevalidationtest. Atthesametime,themodelwouldrepeatsimilarmistakesinthetestoftheseparationproperty, assigningfalselyhighsimilaritybetweendistinctpassageswhichwouldfailthisprerequisite. Finally, the framework incurs substantial computational costs, as it requires multiple LLM callstogenerateandannotatemultiplepasagesoftext. Whilethisconcernwilllikelydiminish as models become more efficient and less expensive to run, it represents a current practical limitation. However, it is worth noting that compared to the human benchmarking approach, thecomputationalcostremainsnegligible,offeringasignificantadvantageintermsofobjective validationandscalability. 27
7 Conclusion ThispaperproposesaframeworktoassessthevalidityofLLM-generatedmeasurementswhen reliable benchmarks are unavailable. The framework establishes validity based on whether an LLM can simulate texts from annotated labels that are semantically similar to the original passages, requiring that the annotation and simulation functions satisfy two key properties: annotationbacktranslationandseparation. Through systematic experiments on diverse financial and economic texts, I demonstrate thattheframeworkeffectivelydistinguishesbetweenreliableandunreliableLLMannotations. Simplesentimentclassification,temporalfocusidentification,andtopicdetectiongenerallypass validation, while more nuanced tasks like granular sentiment scoring and clarity assessment oftenfail. Theseresultsalignwithintuitionabouttaskdifficultyandprovideempiricalguidance forpractitionersdeployingLLMsinresearchapplications. TheapplicationtotheFinancialPhrasebankdatasetshowsthatthevalidationframeworkcan serveasapracticalalternativetohumanbenchmarking. Importantly,theframeworkidentifies cases where LLM annotations capture economic meaning that human evaluators may miss, particularlyforneutral-labeledsentencesthatcontaininformationrelevanttostockprices. Whiletheapproachrequirescarefulspecificationofsimulationpromptsandincurscomputational costs, it offers significant advantages in objectivity, scalability, and cost-effectiveness comparedtohumanvalidation. AsLLMsbecomeincreasinglycentraltoempiricalresearchin economics and finance, rigorous validation methods like the one proposed here are essential for ensuring the credibility and reproducibility of research findings. The framework provides researchers with a systematic tool to assess whether their specific combination of model and promptingstrategyproducesreliablemeasurementsfordownstreamanalysis. 28
References M. Bauer, D. Huber, E. Offner, M. Renkel, and O. Wilms. Corporate Green Pledges. SSRN WorkingPaper,2024. C.Bertsch,I.Hull,R.L.Lumsdaine,andX.Zhang.Centralbankmandatesandmonetarypolicy stances: Through the lens of Federal Reserve speeches. Journal of Econometrics, 249:105948, 2025. Y.Chen,B.T.Kelly,andD.Xiu. ExpectedReturnsandLargeLanguageModels. SSRNWorking Paper,2022. T.R.Cook,A.L.Hansen,S.Kazinnik,andP.McAdam. UnderPressure: StrategicSignalingin BankEarningsCalls. AvailableatSSRN5382397,2025. H.Du,R.Li,andE.Gehringer. ObjectiveMetricsforEvaluatingLargeLanguageModelsUsing ExternalDataSources. 2025. J.Fan,Q.Liu,Y.Song,andZ.Wang. MeasuringMisinformationinFinancialMarkets. Available atSSRN4922648,2024. A.L.HansenandS.Kazinnik. CanChatGPTDecipherFedspeak? SSRNWorkingPaper,2024. F. Huang, H. Kwak, and J. An. Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. In Companion Proceedings of the ACMWebConference2023,page294–297,April2023. M.Jha,J.Qian,M.Weber,andB.Yang. ChatGPTandCorporatePolicies. NBERWorkingpaper, 2024. K. Kirtac and G. Germano. Sentiment Trading with Large Language Models. Finance Research Letters,62:105227,April2024. URLhttp://dx.doi.org/10.1016/j.frl.2024.105227. X.Li,P.Yu,C.Zhou,T.Schick,O.Levy,L.Zettlemoyer,J.Weston,andM.Lewis. Self-Alignment withInstructionBacktranslation,2024. URLhttps://arxiv.org/abs/2308.06259. 29
T. Liu and Y. Shi. News Sentiment and Investment Risk Management: Innovative Evidence FromtheLargeLanguageModels. EconomicsLetters,247:112124,2025. J.Ludwig, S.Mullainathan, andA.Rambachan. LargeLanguageModels: AnAppliedEconometricFramework. NBERWorkingPaper,33344,2025. P.Malo,A.Sinha,P.Korhonen,J.Wallenius,andP.Takala. GoodDebtorBadDebt: Detecting SemanticOrientationsinEconomicTexts. JournaloftheAssociationforInformationScienceand Technology,65(4):782–796,2014. N.Mirzakhmedova,M.Gohsen,C.H.Chang,andB.Stein.AreLargeLanguageModelsReliable ArgumentQualityAnnotators?,2024. URLhttps://arxiv.org/abs/2404.09696. A. Patwardhan, V. Vaidya, and A. Kundu. Automated Consistency Analysis of LLMs, 2025. URLhttps://arxiv.org/abs/2502.07036. M. V. Reiss. Testing the Reliability of ChatGPT for Text Annotation and Classification: A CautionaryRemark,2023. URLhttps://arxiv.org/abs/2304.11085. R. Sennrich, B. Haddow, and A. Birch. Improving Neural Machine Translation Models with MonolingualData. InK.ErkandN.A.Smith,editors,Proceedingsofthe54thAnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany,Aug2016.AssociationforComputationalLinguistics. A. Shah, A. Hiray, P. Shah, A. Banerjee, A. Singh, D. Eidnani, S. Chava, B. Chaudhury, and S.Chava.NumericalClaimDetectioninFinance: ANewFinancialDataset,Weak-Supervision Model,andMarketAnalysis,2024. URLhttps://arxiv.org/abs/2402.11728. A.H.Shapiro,M.Sudhof,andD.J.Wilson. MeasuringNewsSentiment. JournalofEconometrics, 228(2):221–243,2022. N. Sharma, N. Agarwal, and K. Sirts. Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation, 2025. URL https://arxiv.org/ abs/2511.01482. 30
S. Singh, Y. Nan, A. Wang, D. D’Souza, S. Kapoor, A. Üstün, S. Koyejo, Y. Deng, S. Longpre, N. A. Smith, B. Ermis, M. Fadaee, and S. Hooker. The Leaderboard Illusion, 2025. URL https://arxiv.org/abs/2504.20879. P. Törnberg. ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political TwitterMessageswithZero-ShotLearning,2023.URLhttps://arxiv.org/abs/2304.06588. V. Veselovsky, M. H. Ribeiro, and R. West. Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks, 2023. URL https: //arxiv.org/abs/2306.07899. X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models, 2023. URL https://arxiv.org/abs/2203.11171. Y.Zhu,P.Zhang,E.-U.Haq,P.Hui,andG.Tyson. CanChatGPTReproduceHuman-Generated Labels? A Study of Social Computing Tasks, 2023. URL https://arxiv.org/abs/2304. 10145. 31
Table1:ValidatingClassificationofSimpleSentiment Thetableshowsresultsfromtestsoftheprerequisiteproperties(annotationbacktranslationandseparation)and thevalidationtestoftheabilityoftheClaude4.5Sonnetmodeltoclassifysentimentamongthelabels{Positive, Neutral,Negative}. Theannotationbacktranslationpropertyrejectionthresholdissetas90%accuracy. Allother criticalvaluesarebasedona5%significancelevel. Teststatisticsandbootstrapproceduresareimplementedusing 100simulatedtrajectories. (a)10-KFiling Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Negative 100% 90% Pass Separation Positive 0.577 0.600 Pass Neutral 0.596 0.621 Pass Validation Negative 0.623 0.620 Pass (b)EarningsCallTranscript Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Positive 100% 90% Pass Separation Neutral 0.626 0.647 Pass Negative 0.600 0.614 Pass Validation Positive 0.626 0.609 Pass (c)NewsArticle Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Negative 100% 90% Pass Separation Positive 0.569 0.583 Pass Neutral 0.560 0.578 Pass Validation Negative 0.605 0.591 Pass (d)FederalReserveGovernorSpeech Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Negative 100% 90% Pass Separation Positive 0.572 0.594 Pass Neutral 0.579 0.615 Pass Validation Negative 0.611 0.587 Pass (e)RedditComment Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Negative 99% 90% Pass Separation Positive 0.611 0.639 Pass Negative 0.612 0.647 Pass Validation Negative 0.580 0.575 Pass 32
Table2:ValidatingClassificationofDetailedSentiment Thetableshowsresultsfromtestsoftheprerequisiteproperties(annotationbacktranslationandseparation)andthe validationtestoftheabilityoftheClaude4.5Sonnetmodeltoclassifysentimentamongthelabels{Positive,Mostly Positive,Neutral,MostlyNegative,Negative}. Theannotationbacktranslationpropertyrejectionthresholdissetas 90%accuracy. Allothercriticalvaluesarebasedona5%significancelevel. Teststatisticsandbootstrapprocedures areimplementedusing100simulatedtrajectories. (a)10-KFiling Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Negative 99% 90% Pass Separation Positive 0.566 0.590 Pass MostlyPositive 0.568 0.596 Pass Neutral 0.604 0.640 Pass MostlyNegative 0.639 0.675 Pass Validation Negative 0.620 0.606 Pass (b)EarningsCallTranscript Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Positive 100% 90% Pass Separation MostlyPositive 0.639 0.662 Pass Neutral 0.622 0.643 Pass MostlyNegative 0.592 0.619 Pass Negative 0.587 0.618 Pass Validation Positive 0.630 0.613 Pass (c)NewsArticle Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation MostlyNegative 7% 90% Fail Separation Positive 0.558 0.576 Pass MostlyPositive 0.565 0.578 Pass Neutral 0.567 0.582 Pass Negative 0.621 0.639 Pass Validation MostlyNegative 0.605 0.582 Pass Tablecontinuesonnextpage. 33
Table2:ValidatingClassificationofDetailedSentiment(continued) (d)FederalReserveGovernorSpeech Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Negative 97% 90% Pass Separation Positive 0.599 0.632 Pass MostlyPositive 0.602 0.624 Pass Neutral 0.599 0.633 Pass MostlyNegative 0.627 0.668 Pass Validation Negative 0.625 0.611 Pass (e)RedditComment Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation MostlyNegative 5% 90% Fail Separation Positive 0.622 0.630 Pass MostlyPositive 0.605 0.631 Pass Neutral 0.591 0.623 Pass Negative 0.619 0.640 Pass Validation MostlyNegative 0.584 0.584 Fail 34
Table3:ValidatingClassificationofClarity Thetableshowsresultsfromtestsoftheprerequisiteproperties(annotationbacktranslationandseparation)and thevalidationtestoftheabilityoftheClaude4.5Sonnetmodeltoclassifytextualclarityamongthelabels{Clear, Vague}. Theannotationbacktranslationpropertyrejectionthresholdissetas90%accuracy. Allothercriticalvalues arebasedona5%significancelevel. Teststatisticsandbootstrapproceduresareimplementedusing100simulated trajectories. (a)10-KFiling Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Vague 100% 90% Pass Separation Clear 0.601 0.611 Pass Validation Vague 0.550 0.599 Fail (b)EarningsCallTranscript Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Clear 100% 90% Pass Separation Vague 0.597 0.612 Pass Validation Clear 0.599 0.598 Pass (c)NewsArticle Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Vague 100% 90% Pass Separation Clear 0.543 0.569 Pass Validation Vague 0.582 0.574 Pass (d)FederalReserveGovernorSpeech Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Clear 100% 90% Pass Separation Vague 0.602 0.623 Pass Validation Clear 0.584 0.589 Fail (e)RedditComment Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Vague 100% 90% Pass Separation Clear 0.584 0.601 Pass Validation Vague 0.566 0.579 Fail 35
Table4:ValidatingClassificationofTemporalFocus Thetableshowsresultsfromtestsoftheprerequisiteproperties(annotationbacktranslationandseparation)andthe validationtestoftheabilityoftheClaude4.5Sonnetmodeltoclassifythetemporalfocusamongthelabels{Backward- Looking,FocusingOnThePresent,Forward-Looking}. Theannotationbacktranslationpropertyrejectionthreshold issetas90%accuracy. Allothercriticalvaluesarebasedona5%significancelevel. Teststatisticsandbootstrap proceduresareimplementedusing100simulatedtrajectories. (a)10-KFiling Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Forward-Looking 100% 90% Pass Separation FocusingOnThePresent 0.605 0.632 Pass Backward-Looking 0.592 0.625 Pass Validation Forward-Looking 0.611 0.607 Pass (b)EarningsCallTranscript Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Backward-Looking 100% 90% Pass Separation Forward-Looking 0.605 0.640 Pass FocusingOnThePresent 0.618 0.640 Pass Validation Backward-Looking 0.610 0.609 Pass (c)NewsArticle Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Backward-Looking 100% 90% Pass Separation Forward-Looking 0.555 0.572 Pass FocusingOnThePresent 0.557 0.582 Pass Validation Backward-Looking 0.622 0.587 Pass (d)FederalReserveGovernorSpeech Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Backward-Looking 100% 90% Pass Separation Forward-Looking 0.587 0.617 Pass FocusingOnThePresent 0.592 0.616 Pass Validation Backward-Looking 0.592 0.581 Pass (e)RedditComment Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation Forward-Looking 99% 90% Pass Separation FocusingOnThePresent 0.590 0.601 Pass Backward-Looking 0.577 0.614 Pass Validation Forward-Looking 0.565 0.594 Fail 36
Table5:ValidatingTruePositiveIdentificationofTopic Thetableshowsresultsfromtestsoftheprerequisiteproperties(annotationbacktranslationandseparation)and thevalidationtestoftheabilityoftheClaude4.5Sonnetmodeltoidentifyaspecifictopicwithinapassage. Topics arechosensuchthattheyarepresentinthepassages. Specifically,topicsare“economicrisks”forthe10-Kfiling passage, “technological developments” for the earnings call transcript passage, “crypto investing” for the news article passage, “financial crises” for the speech passage, and “interest rates” for the Reddit comment. Possible labelsare{True,False}. Theannotationbacktranslationpropertyrejectionthresholdissetas90%accuracy. Allother criticalvaluesarebasedona5%significancelevel. Teststatisticsandbootstrapproceduresareimplementedusing 100simulatedtrajectories. (a)10-KFiling Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation True 96% 90% Pass Separation False 0.554 0.566 Pass Validation True 0.606 0.585 Pass (b)EarningsCallTranscript Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation True 100% 90% Pass Separation False 0.561 0.577 Pass Validation True 0.568 0.586 Fail (c)NewsArticle Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation True 94% 90% Pass Separation False 0.530 0.545 Pass Validation True 0.575 0.558 Pass (d)FederalReserveGovernorSpeech Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation True 100% 90% Pass Separation False 0.585 0.600 Pass Validation True 0.603 0.600 Pass (e)RedditComment Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation True 100% 90% Pass Separation False 0.539 0.557 Pass Validation True 0.583 0.572 Pass 37
Table6:ValidatingTrueNegativeIdentificationofTopic Thetableshowsresultsfromtestsoftheprerequisiteproperties(annotationbacktranslationandseparation)and the validation test of the ability of the Claude 4.5 Sonnet model to identify a specific topic within a passage. Topics are chosen such that they are not present in the passages. Specifically, topics are “investments” for the 10-Kfilingpassage,“geopoliticalrisks”fortheearningscalltranscriptpassage,“interestrates”forthenewsarticle passage,“newtechnologies”forthespeechpassage,and“stockmarket”fortheRedditcomment. Possiblelabelsare {True,False}. Theannotationbacktranslationpropertyrejectionthresholdissetas90%accuracy. Allothercritical values are based on a 5% significance level. Test statistics and bootstrap procedures are implemented using 100 simulatedtrajectories. (a)10-KFiling Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation True 100% 90% Pass Separation False 0.554 0.589 Pass Validation True 0.545 0.590 Fail (b)EarningsCallTranscript Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation False 100% 90% Pass Separation True 0.544 0.578 Pass Validation False 0.619 0.586 Pass (c)NewsArticle Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation False 100% 90% Pass Separation True 0.524 0.530 Pass Validation False 0.526 0.521 Pass (d)FederalReserveGovernorSpeech Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation False 98% 90% Pass Separation True 0.573 0.585 Pass Validation False 0.591 0.585 Pass (e)RedditComment Test Label Statistic CriticalValue Conclusion AnnotationBacktranslation False 100% 90% Pass Separation True 0.545 0.557 Pass Validation False 0.533 0.537 Fail 38
Table7:OverviewofResults ThetableshowsanoverviewofallvalidationresultsreportedindetailinTables1-6. Agreencheckmarkrepresents caseswhereboththeprerequisitetests(annotationbacktranslationandseparation)andthevalidationtestpass. A redcrossmarkrepresentscaseswheretheprerequisitetestspass,butthevalidationtestfails. Ayellowcrossmark represents cases where validation fails because one or both prerequisite test fails. All tests conducted using the Claude4.5Sonnetmodel. Sentiment Granular Clarity Temporal True True Sentiment Focus Positive Negative Topic Topic 10-KFiling ✓ ✓ ✗ ✓ ✓ ✗ EarningsCallTranscript ✓ ✓ ✓ ✓ ✗ ✓ NewsArticle ✓ ✗ ✓ ✓ ✓ ✓ FederalReserveGovernorSpeech ✓ ✓ ✗ ✓ ✓ ✓ RedditComment ✓ ✗ ✗ ✗ ✓ ✗ 39
Table8:ResultsOverviewAcrossModels The table shows an overview of validation results for annotations of (a) simple sentiment ({Positive, Neutral, Negative}) and (b) detailed sentiment ({Positive, Mostly Positive, Neutral, Mostly Negative, Negative}). Results are reported across different large language models. A green check mark representscaseswhereboththeprerequisitetests(annotationbacktranslationandseparation)andthevalidationtestpass. Aredcrossmarkrepresents caseswheretheprerequisitetestspass,butthevalidationtestfails. Ayellowcrossmarkrepresentscaseswherevalidationfailsbecauseoneorboth prerequisitetestfails. (a)SimpleSentiment Claude4.5Sonnet Claude3Haiku Llama70B Llama8B ModelSize VeryLarge Large Medium Small 10-KFiling Negative(✓) Negative(✗) Negative(✓) Negative(✓) EarningsCallTranscript Positive(✓) Positive(✓) Positive(✓) Positive(✓) NewsArticle Negative(✓) Negative(✓) Negative(✓) Negative(✓) FederalReserveGovernorSpeech Negative(✓) Negative(✓) Negative(✗) Negative(✗) RedditComment Negative(✓) Neutral(✗) Neutral(✗) Neutral(✗) (b)DetailedSentiment Claude4.5Sonnet Claude3Haiku Llama70B Llama8B ModelSize VeryLarge Large Medium Small 10-KFiling Negative(✓) MostlyNegative(✗) Negative(✓) Neutral(✗) EarningsCallTranscript Positive(✓) Positive(✗) Positive(✓) MostlyPositive(✗) NewsArticle MostlyNegative(✗) MostlyNegative(✓) Neutral(✗) MostlyNegative(✓) FederalReserveGovernorSpeech Negative(✓) MostlyNegative(✗) Negative(✗) MostlyNegative(✗) RedditComment MostlyNegative(✗) Neutral(✗) Neutral(✗) Neutral(✗) 40
Table9:AccuracyofSentimentScoringofFinancialPhrasebankData Thetableshowstheaccuracyofsentiment({Positive,Neutral,Negative})scoresgeneratedbytheClaude4.5Sonnet model. AccuracyisdeterminedusingtheLLMvalidationtestandbycomparingtothehumanbenchmarkprovided in the Financial Phrasebank data set. Accuracy is computed for 100 sentences drawn randomly from the set of sentenceswith(a)100%agreementamonghumanannotationsand(b)50-66%agreement. Sentencesforwhichthe pre-requisitetestsfailareexcluded. Thetableshowsbothoverallaccuracyandtheaccuracycomputedseparately foreachLLM-generatedlabel. (a)Sentenceswith100%HumanAgreement AccuracyValidityTest AccuracyHumanBenchmarking Overall 82.14% 91.67% ByLLMannotation: Positive 65.52% 82.76% Neutral 89.74% 100.00% Negative 93.75% 87.50% (b)Sentenceswith50-66%HumanAgreement AccuracyValidityTest AccuracyHumanBenchmarking Overall 68.29% 64.63% ByLLMannotation: Positive 53.85% 50.00% Neutral 100.00% 100.00% Negative 87.50% 81.25% 41
Table10:ConfusionMatrixforSentimentScoringofFinancialPhrasebankData Thetableshowsthedistributionofsentencesacrosslabelsgeneratedbyhumans(providedbytheFinancialPhrasebankdataset)andanLLM(Claude4.5Sonnetmodel). Numbersinparenthesesarethefractionofsentencesthat passesthevalidationtest. Thedatasetconsistsof100sentencesdrawnrandomlyfromthesetofsentenceswith(a) 100%agreementamonghumanannotationsand(b)50-66%agreement. Sentencesforwhichthepre-requisitetests failareexcluded. (a)Sentenceswith100%HumanAgreement Humanannotation: Positive Neutral Negative Total LLMannotation: Positive 24(66.67%) 5(60.00%) 0 29 Neutral 0 39(89.74%) 0 39 Negative 0 2(100.00%) 14(92.86%) 16 Total 24 46 14 84 (b)Sentenceswith50-66%HumanAgreement Humanannotation: Positive Neutral Negative Total LLMannotation: Positive 26(57.69%) 26(50.00%) 0 52 Neutral 0 14(100.00%) 0 14 Negative 0 3(100.00%) 13(84.62%) 16 Total 26 43 13 82 42
Table11:SelectedSentencesfromFinancialPhrasebankData The table shows selected sentences from the Financial Phrasebank data. The selected sentences represent cases wheretheLLMandhumanlabelaredifferent,butthevalidationtestanditsprerequisitespass. LLMLabel HumanLabel SourceData Sentence Positive Neutral 100% “Themachinerynoworderedwillbeplacedinanewmillwith an annual production capacity of 40 000 m3 of overlaid birch plywood.” Positive Neutral 50-66% “Inaddition,YIThasreservedEPIRussiatherighttoexpand thelogisticscenterbyabout100,000m2.” Negative Neutral 100% “The total restructuring costs are expected to be about EUR 30mn,ofwhichEUR13.5mnwasbookedinDecember2008.” Negative Neutral 50-66% “ComparedwiththeFTSE100index,whichrose51.5points( or0.9%)ontheday,thiswasarelativepricechangeof-0.6%.” 43
Cite this document
Anne Lundgaard Hansen (2026). Validating Large Language Model Annotations (FEDS 2026-020). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2026-020
@techreport{wtfs_feds_2026_020,
author = {Anne Lundgaard Hansen},
title = {Validating Large Language Model Annotations},
type = {Finance and Economics Discussion Series},
number = {2026-020},
institution = {Board of Governors of the Federal Reserve System},
year = {2026},
url = {https://whenthefedspeaks.com/doc/feds_2026-020},
abstract = {This paper proposes a validation framework for LLM-generated measurements when reliable benchmarks are unavailable. Validity is established by testing whether an LLM can reconstruct passages from annotated labels while maintaining semantic consistency with the original text. The framework avoids circular reasoning by establishing testable prerequisite properties that must be met for a validation to be considered successful. Application to news article data demonstrates that the framework serves as a practical alternative to human benchmarking, which offers advantages in objectivity, scalability, and cost-effectiveness while identifying cases where LLMs capture economic meaning that human evaluators miss.},
}