feds · December 14, 2025

LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora*

Abstract

Large Language Models (LLMs) are highly accurate in classification tasks, however, substantial computational and financial costs hinder their large-scale deployment in dynamic environments. Knowledge Distillation (KD) where a LLM âteacherâ trains a smaller and more efficient âstudentâ model, offers a promising solution to this problem. However, the distillation process itself often remains costly for large datasets, since it requires the teacher to label a vast number of samples while incurring significant token consumption. To alleviate this challenge, in this work we explore the active learning (AL) as a way to create efficient student models at a fraction of the cost while preserving the LLMâs performance. In particular, we introduce M-RARU (Multi-class Randomized Accept/Reject Uncertainty Sampling), a novel AL algorithm that significantly reduces training costs. M-RARU employs an innovative strategy combining uncertainty with a randomized accept-reject mechanism to select only the most informative data points for the LLM teacher. This focused approach significantly minimizes required API calls and data processing time. We evaluate M-RARU against random sampling across five diverse student models (SVM, LDA, RF, GBDT, and DistilBERT) on multiple benchmark datasets. Experiments demonstrate that our proposed method achieves up to 80% reduction in sample requirements as compared to random sampling, substantially improving classification accuracy while reducing financial costs and overall training time.

Finance and Economics Discussion Series Federal Reserve Board, Washington, D.C. ISSN 1936-2854 (Print) ISSN 2767-3898 (Online) LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora* Viviana Luccioli, Rithika Iyengar, Ryan Panley, Flora Haberkorn, Xiaoyu Ge, Leland Crane, Nitish Sinha, Seung Jung Lee 2025-108 Please cite this paper as: Luccioli,Viviana,RithikaIyengar,RyanPanley,FloraHaberkorn,XiaoyuGe,LelandCrane, Nitish Sinha, and Seung Jung Lee (2025). “LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora*,” Finance and Economics Discussion Series 2025-108. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2025.108. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora* VivianaLuccioli,RithikaIyengar,RyanPanley,FloraHaberkorn, XiaoyuGe,LelandCrane,NitishSinha,SeungJungLee Abstract LargeLanguageModels(LLMs)arehighlyaccurateinclassificationtasks,however,substantial computational and financial costs hinder their large-scale deployment in dynamic environments. Knowledge Distillation (KD) where a LLM ”teacher” trains a smaller and more efficient”student”model,offersapromisingsolutiontothisproblem. However,thedistillation processitselfoftenremainscostlyforlargedatasets,sinceitrequirestheteachertolabelavast numberofsampleswhileincurringsignificanttokenconsumption. Toalleviatethischallenge, inthisworkweexploretheactivelearning(AL)asawaytocreateefficientstudentmodelsat afractionofthecostwhilepreservingtheLLM’sperformance. Inparticular,weintroduceM- RARU(Multi-classRandomizedAccept/RejectUncertaintySampling),anovelALalgorithm thatsignificantlyreducestrainingcosts. M-RARUemploysaninnovativestrategycombining uncertainty with a randomized accept-reject mechanism to select only the most informative datapointsfortheLLMteacher. ThisfocusedapproachsignificantlyminimizesrequiredAPI calls and data processing time. We evaluate M-RARU against random sampling across five diverse student models (SVM, LDA, RF, GBDT, and DistilBERT) on multiple benchmark datasets. Experimentsdemonstratethatourproposedmethodachievesupto80%reductionin sample requirements as compared to random sampling, substantially improving classification accuracywhilereducingfinancialcostsandoveralltrainingtime. JELclassification: C38,C45,C55 *The views expressed herein are those of the authors, and do not reflect those of anyone else at the Board of GovernorsoftheFederalReserveSystem. 1

Keywords: Knowledge Distillation, Large Language Models (LLM), Active Learning, Uncertainty Sampling, Multi-Class Randomized Accept/Reject Uncertainty Sampling (M-RARU), Text Classification,MachineLearning,EconomicsTextCorpora 1 Introduction With the unceasing expansion of unstructured text in the modern data landscape, text classification has become a central tool for extracting insights at scale. For instance, in the financial sector, this capability is especially critical for a diverse array of tasks, ranging from analyzing markettrendsinnewsreportsandcorporatefilingstoassessingcreditriskandensuringregulatory compliance (1; 2). As the volume and complexity of this textual data grow, a fundamental challengearises: balancingthetrade-offbetweenamodel’spredictivepoweranditscomputationaland financial cost. Meeting this challenge is crucial for deploying effective text classification systems inreal-world,resource-constrainedenvironmentswheretimelyanalysisisparamount. Consider, for example, the task of classifying news articles based on their implications for GDP trends, as illustrated in Figure 1. Financial institutions must process thousands of such articles daily to inform investment decisions and economic forecasts. While an LLM can achieve highaccuracyindeterminingwhetheranarticlesuggestsGDPis’falling,’’rising,’or’stayingflat,’ the computational cost of processing this volume of text at the required speed is prohibitive. Conversely, a traditional classifier might process articles quickly but miss subtle contextual cues that indicate economic direction. This exemplifies the broader challenge we address: how can we developclassifiersthatcapturethenuancedunderstandingofLLMswhilemaintainingtheefficiency necessaryforreal-time,large-scaledeployment? To address this problem, two primary categories of models have been widely adopted: largescale transformer models and traditional machine learning algorithms. Transformer architectures, first introduced in (3) and popularized by Large Language Models (LLMs) like GPT, Claude, and Gemini, represent the state-of-the-art in performance (4). By leveraging complex self-attention mechanisms and deep semantic embeddings, they achieve a nuanced understanding of language that often translates to superior classification accuracy. However, this power comes at a steep price. Their immense size, with billions of parameters, makes both training and inference exceedingly slow and expensive, hindering their widespread adoption for many practical applications. In contrast, traditional machine learning algorithms such as Support Vector Machines (SVMs) (5), Gradient-Boosting Decision Trees (GBDTs) (6), or Random Forests (7) are significantly more ef- 2

Figure1: TextclassificationforGDPtrends. ficient, offering rapid training and classification at a fraction of the cost. More importantly, their decisions are far more interpretable, a critical feature in domains where justifying a model’s reasoning is paramount. Yet, these models typically requires domain specific supervision and has much smaller and simpler model structure, which can limit their ability to capture the complex relationshipswithintext,oftenleadingtoloweraccuracycomparedtoLLMs. A promising approach to bridge this gap is Knowledge Distillation (KD), a technique where a large, high-performing “teacher” model (the LLM) is used to train a smaller, more efficient “student” model (the traditional ML algorithm) (8; 9; 10). The goal is to transfer the teacher’s sophisticated “knowledge” to the student, thereby combining the high accuracy of an LLM with the efficiency and interpretability of a classical algorithm. However, a major bottleneck persists: the distillation process itself. Typically, it requires the expensive teacher model to label a massive dataset to create the training curriculum for the student. This step consumes significant computational resources and incurs high financial costs from API calls, undermining the very efficiency thatKDaimstoachieve. Fortunately, this challenge of minimizing labeling costs by selecting only the most valuable data points is precisely the problem addressed by the field of active learning (AL) (11). The core idea of active learning is to allow a machine learning algorithm to intelligently choose the data from which it learns (12). Rather than passively receiving a large, randomly selected training set, an active learning system iteratively queries an oracle (in our case, the teacher LLM) to label only 3

the most informative unlabeled samples. By focusing the labeling effort on instances the model is most needed, AL has the potential to achieve high accuracy with a fraction of the labeled data requiredbytraditionalmethods(13). In this paper, we propose a novel approach that combines the principles of Knowledge Distillation with an intelligent active learning strategy called M-RARU (Multi-class Randomized Accept/Reject Uncertainty Sampling). Our approach works within an iterative loop: the student model first identifies a pool of candidate samples it is most uncertain about. Then, M-RARU’s accept-reject mechanism strategically selects a subset of these candidates to be sent to the LLM teacher for labeling. This ensures that only the most valuable examples are used for training, dramatically improving the efficiency of the knowledge transfer process. This approach ensures the final student model is not only accurate but also retains the speed, cost-effectiveness, and interpretabilityoftraditionalmachinelearning. WeexperimentallyevaluatedM-RARUagainstastandardrandomsamplingbaselineonmultiplebenchmarkdatasets,usingfivedifferentstudentmodels(SupportVectorMachine(SVM),LinearDiscriminantAnalysis(LDA),RandomForest(RF),GradientBoostedDecisionTree(GBDT), and DistilBERT (14)). The experimental results show that student models trained with M-RARU substantially outperform their randomly-sampled counterparts in accuracy and balanced accuracy. More importantly, M-RARU achieves this superior performance while drastically reducing the number of required teacher labels, leading to substantial savings in financial costs and overall training time. The resulting student models also offer much faster inference, providing a practical pathtoharnessLLMpowerinresource-constrainedapplications. Specifically,ourcontributionsinthispaperareasfollows: • We propose a novel approach that hybridizes Knowledge Distillation with Active Learning to address the high cost of training performant classifiers, efficiently leveraging an LLM teachertotrainasmallerstudentmodel(15). • We introduce Multi-class Randomized Accept/Reject Uncertainty Sampling (M-RARU), a specific AL algorithm that intelligently selects data to create a small yet highly effective trainingset,maximizingstudentmodelperformancewhileminimizingLLMlabelingcosts. • Weconductextensiveexperimentsonmultiplelarge,real-worldtextcorpora,demonstrating that our proposed method substantially outperforms a random sampling baseline across a diverse set of student models, verifying our approach as a practical path to developing fast, accurate,andcost-effectiveclassifiers. 4

Therestofthepaperisstructuredasfollows. Section2introducesthebackgroundandproblem definitions. Section3presentsoursolutions. Section4describestheexperimentalenvironmentand presents the evaluation results. Section 5 describes works that are closely related to us. Finally, Section6concludes. 2 Problem Definition and Background In this section, we formally introduce our problem and provide the necessary background for ourapproach. 2.1 Knowledge Distillation Task To frame the knowledge distillation task addressed in this work, we consider a scenario involving high-dimensional text data. Each data item (e.g., a sentence or document) is represented as a high-dimensional vector via an embedding model. A large, complex “teacher” model, which has high performance but is computationally expensive, already exists. The primary challenge is to train a smaller, more efficient “student” model to replicate the teacher’s predictive capabilities. Consequently, the goal of our active learning approach is to strategically select a small, highly informative subset of unlabeled data for the teacher to label. This dataset is then used to train the student model, aiming to achieve performance comparable to the teacher with minimal labeling cost. 2.2 Problem Settings Toformalizetheknowledgedistillationproblemaddressedinthiswork,considerad-dimensional dataspaceD containingN dataitems,whereeachitembelongstooneofC possibleclasses. This formulationtargetsbothbinary(C = 2)andmulti-class(C > 2)classificationtasks. Further,considerapowerful“teacher”model,M ,whichcanprovideahigh-qualityclasslabel T for any item in D, and a smaller “student” model, M , that we aim to train. The training process S usesasmallsubsetofdata,L ⊂ D,ofsizen(wheren ≪ N),whichisinteractivelyselectedfrom theunlabeledpoolandlabeledbytheteachermodelM . T The objective is to construct a student model M that accurately predicts the class labels for S 5

the entire dataset D, effectively mimicking the behavior of M , by using a query strategy to build T themostinformativetrainingsetL. Thesuccessofthisknowledgetransferismeasuredbythepredictiveperformanceofthestudent model. We focus on accuracy and balanced accuracy, as they are particularly well-suited for this task. • Accuracy is the most direct measure of performance, defined as the proportion of all data items that are correctly classified. It provides a clear, overall assessment of the model’s correctness. • Balanced Accuracy is crucial in scenarios with imbalanced class distributions, which are common in real-world text datasets. It is calculated as the average of the recall for each class, ensuring that the student model is evaluated fairly across all classes and not rewarded forsimplypredictingthemajorityclass. Ourgoalistodesignadataselectionsolutionthatmaximizesthesemeasuresforafixedbudget ofnlabelsprovidedbytheteacher. 2.3 Active Learning Active learning is a paradigm in machine learning that aims to achieve high accuracy while minimizing the amount of labeled data required for training ((16)). It employs query strategies to iteratively select the most informative unlabeled sample (i.e., data object) from unlabeled data, obtainthetruelabelsfromanexpertsource(inourcaseanLLM),andthenupdatethemodelwith thisnewinformation. Thequerystrategydictateshowdatapoints/informationalinputsarechosen. Numerousquerystrategies(17)havebeenproposedtodefinethe“informativeness”ofsamples intheliterature,including: UncertaintySampling,Query-By-Committee,ExpectedModelChange, Expected Error Reduction, and Expected Model Output Change. Among these query strategies, Uncertainty Sampling is the most commonly used one because of its simplicity and efficiency, as pointedoutin(17). UncertaintySampling Uncertaintysampling(18)isaquerystrategythatcanbeusedwithanyprobability-basedclassification model (Naive Bayes, SVM, etc..). It selects samples based on the model’s uncertainty 6

about their classification ((19)). The intuition underlying uncertainty sampling is that patterns with high uncertainty are hard to classify, so obtaining high-uncertainty labels boosts accuracy of classificationmodels(morethansay,randomsampling). Particularly, in classification models (e.g., with class labels a, b, c, and d), the most uncertain example x is the one which can be assigned to any class label z(x) with an even probability distribution(e.g.,0.25,0.25,0.25,0.25). Inspired by the idea of uncertainty, also known as least confidence, (18) proposes a measurementofuncertaintyforbinaryclassificationmodels,whicheasilyextendstocategoricalclassificationmodels: u(lc)(x) = 1−p(yˆ|x) (1) where u(lc)(x) is the uncertainty score with the least confidence measurement of x, and yˆ is the predicted class label of the unlabeled x. Accordingly, after measuring the uncertainty of each unlabeledsample,theunlabeledsamplewithhighestuncertaintyisselected: x∗ = argmax u(x) (2) x whereu(x)canbeanyothermeasurementofinformativenessovertheunlabeledsamplex. 3 Our Approach Inthissection,weformallydescribeourproposedframework,whichintegratesactivelearning withknowledgedistillationtoproduceefficientandaccurateclassifiers. 3.1 Proposed Solution Our proposed solution is designed to bridge the gap between the high performance of Large LanguageModels(LLMs)andtheefficiencyoftraditionalmachinelearningclassifiers. Theframeworkaimstoachievetwoprimarygoals: 1)minimizethefinancialandcomputationalcostassociatedwithusinganLLM“teacher”forlabeling,and2)trainasmaller“student”modelthatachieves thehighestpossibleaccuracybylearningfromastrategicallyselected,information-richdataset. As illustrated in Algorithm 1, our framework identifies the most valuable data for training through an iterative selection process. The system first converts the entire raw text corpus into a 7

Algorithm1TheKnowledgeDistillationProcess Require: TherawtextcorpusD,ateachermodelM T Ensure: AtrainedstudentmodelM S 1: ConvertD intoasetofembeddingsE 2: L ← ∅{Initializethetrainingsetforthestudent} 3: U ← E {Initializetheunlabeledpool} 4: M ←initializestudentmodel S 5: whileU isnotemptydo 6: RandomlyselectonesamplexfromU 7: SolicitnormalizeduncertaintyscorepforxfromM S 8: Withprobabilityp,addxtothelabelingsetL 9: U ← U −{x} 10: endwhile 11: RequestlabelsforallsamplesinLfromteachermodelM T 12: TrainstudentmodelM onthelabeledsetL S 13: ReturntrainedstudentmodelM =0 S set of numerical vector representations, or embeddings, to make the data processable by machine learning models (Line 1). It then initializes an empty training set L and a student model M S (Lines 2-4). The core of our approach is a loop that intelligently builds the training set L (Lines 5-10). Ineachiteration,insteadofexhaustivelysearchingtheentireunlabeledpool,theframework randomlyselectsadatasampleandqueriesthecurrentstudentmodelforitspredictiveuncertainty. This uncertainty score is then used to probabilistically decide whether the sample is informative enoughtobeaddedtothesetLforlaterlabelingbytheteacher. Thisprocesscontinuesuntilevery sampleintheoriginalcorpushasbeenconsidered. Once the selection phase is complete, the framework sends only the curated, high-value samples in set L to the powerful but expensive LLM teacher to obtain high-quality labels (Line 11). This small, targeted training set is then used to train the final student model (Line 12). By focusingtheteacher’seffortexclusivelyonthemostinformativeexamples,ourframeworkfacilitatesan efficientknowledgetransfer,producingastudentmodelthatemulatestheteacher’sperformanceat afractionofthecost. A key advantage of this approach is the enhanced interpretability of the final student model. While LLMs and even DistilBERT operate as complex ”black boxes,” the decision-making processes of models like GBDT, Random Forest, and SVM can be readily explained using wellestablished techniques such as SHAP (SHapley Additive exPlanations) (20) or LIME (Local Interpretable Model-agnostic Explanations) (21). These methods can generate feature-level explanations for individual predictions, revealing which words or phrases most influenced a particular classification. This transparency is invaluable in high-stakes domains like finance or regulation, 8

whereunderstandingwhyamodelmadeacertaindecisionisasimportantasthedecisionitself. Inthefollowingsections,wewillpresenteachmaincomponentofourapproachindetail. 3.2 Data Embedding In the domain of natural language processing, the representation of text data is a critical first step that profoundly influences the performance of any machine learning model. To this end, embedding methods are employed to transform unstructured text into dense numerical vectors that capture semantic relationships. These methods aim to create feature representations such that the proximity between vectors in the learned vector space reflects the semantic similarity of the correspondingtextinitsoriginalform. A large variety of algorithms have been proposed for this task. Well-recognized approaches such as Word2vec (22), GloVe (23), and FastText generate embeddings at the word level, while moreadvancedtransformer-basedmodelslikeBERTorsentenceencodersliketheUniversalSentenceEncoder(24)createcontextualizedrepresentationsforentiresentencesordocuments. These methodsproviderichrepresentationsthatpreservethenuancesoflinguisticcontext,enablingclassifierstoperformcomplexreasoning. In our work, for student models that requires embeddings, we leverage sentence-level embeddings to ensure that the full semantic meaning of each text sample is captured. Using a single, unified embedding method for all traditional student models also ensures consistency and comparability of results, as different embedding techniques can produce vectors of varying dimensionality (from hundreds to thousands of dimensions), which could otherwise introduce confounding variablesintoourperformanceevaluation. 3.3 Query Strategy TheQueryStrategyisthecomponentofourframeworkresponsibleforminimizingthelabeling cost while maximizing the student model’s ultimate accuracy. In the context of our approach, a “query” refers to the process of selecting an unlabeled data sample to be labeled by the teacher LLM. Our framework leverages a specialized form of uncertainty sampling to intelligently build thetrainingsetandguidetheknowledgedistillationprocess. 9

UncertaintySampling Uncertainty sampling is a widely adopted active learning strategy predicated on a simple yet powerful intuition: a model gains the most information from samples it is least certain about. By prioritizing these ambiguous samples for labeling, a model can resolve confusion at its decision boundary more quickly, leading to faster convergence and higher accuracy with fewer labeled examples. Tomeasuretheuncertaintyofadataobjectx,aprobabilisticpredictivemodelisneeded to report the probability of x belonging to each possible class. The sample for which the model’s prediction is least confident (e.g., closest to a 50% probability in a binary task) is considered the mostuncertainand,therefore,themostinformative. ChallengeswithTraditionalUncertaintySampling Despite its effectiveness, traditional uncertainty sampling suffers from two major drawbacks, particularlyinthecontextoflargedatasets: 1)shortsightedness(25)and2)lowscalability(26). Shortsightedness arises because the model’s uncertainty is estimated using only the information from the few samples it has already seen. This can create a bias, causing the strategy to repeatedly select samples clustered around a single, noisy region of the decision boundary while ignoring other potentially informative areas of the feature space. Low scalability is a computationalbottleneck;conventionaluncertaintysamplingrequiresanexhaustivesearchovertheentire unlabeled dataset in every iteration to find the single most uncertain sample. This process incurs prohibitiveprocessingcostsandintroducessignificantdelays,makingitimpracticalforlarge-scale applications. RandomizedUncertainty To overcome the first drawback mentioned above, the work in (27) combines uncertainty with some degree of randomness. In particular, an unlabeled object that would be presented to the user asanexampleisprobabilisticallyselectedfromtheentiresetofunlabeledobjects. Thisprobabilistic framework requires that the ’informativeness’ of each sample be a non-negative, quantitatively meaningful score suitable for normalization. Because uncertainty scores are derived directly from model probabilities, they are a natural fit for creating such a selection distribution, a property not guaranteed by all informativeness metrics used in active learning (17). The probability that an 10

unlabeledobjectxisselectedisproportionaltoitsuncertaintyscore: u(x) p(xisselected) = (3) (cid:80) u(x ) xu∈U u whereU isthesetofunlabeledobjectsandu(x)istheuncertaintyscoreofx. Since the probability that an unlabeled object x is chosen as an example is equal to its normalized uncertainty score, therefore, less uncertain objects can still have a small chance of being acceptedasexamples,whichessentiallyreducesthebiasintroducedbythelabeledsamples. Multi-classRandomizedAccept/RejectUncertaintySampling(M-RARU) WhiletheRandomizedUncertaintystrategyaddressestraditionaluncertaintysampling’sdrawback of shortsightedness, the issue of low scalability still remains. To overcome this limitation, the work in (26) and (28) introduced a randomized Accept/Reject mechanism that allows uncertainty estimation to be performed efficiently for binary classifications. However, many real-world classification tasks often involve multiple classes or labels. Therefore, methods designed only for binary classification are not suitable for these knowledge distillation tasks. In this work we introduce the Multi-class Randomized Accept/Reject Uncertainty Sampling (M-RARU). M-RARU addresses both shortsightedness and scalability for both binary and multi-class classification tasks by introducing a randomized, probabilistic selection mechanism eliminates the need to perform exhaustivesearchovertheentiredataspace. Particularly,ineachstep,M-RARUrandomlyselects a single sample from the unlabeled pool, calculates its uncertainty score, and then uses this score to make a probabilistic decision on whether to “accept” the sample for labeling or “reject” it and moveon. The probability of an unlabeled data sample x being accepted into the training set L under M-RARUisdefinedas: p(xisaccepted) = 1− max Pr(C |x) (4) k k∈{1,...,K} wherePr(C |x)istheprobabilityofxbeingassignedtheclasslabelC bythestudentmodel,and k k K is the total number of classes. This formula directly captures the model’s uncertainty: when themaximumpredictedprobabilityislow(indicatingthemodelisuncertainaboutallclasses),the acceptance probability is high. Conversely, when the model is confident in its prediction (high maximum probability), the acceptance probability is low. This ensures that highly uncertain sam- 11

pleshaveahighprobabilityofbeingaccepted,whilestillallowinglessuncertainsamplesachance to be selected, which helps mitigate the shortsightedness bias. This formula is designed around the model’s prediction confidence because the accept/reject mechanism requires an uncertainty scorethatcanfunctionasadirectprobabilityofacceptance. Usingthemaximumpredictionprobability allows for the creation of a score naturally bounded within the required [0,1] range. In contrast, other common metrics like Shannon entropy produce a score on a different scale (e.g., [0,log(K)]), making them less compatible in this probabilistic decision framework. By randomly visiting unlabeled objects until one is accepted, M-RARU provides an early termination to the costly exhaustive search, directly solving the scalability problem. This combination of randomization and uncertainty-based acceptance allows the framework to efficiently build a diverse and highlyinformativetrainingset,preservingthecorebenefitsofuncertaintysamplingwhileadapting itforlarge-scaleknowledgedistillation. 4 Experimental Evaluation In this section, we present the results of our experiments. We begin by introducing the experimental setup and then demonstrate the performance of our proposed scheme against the baseline acrossvariousstudentmodelsanddatasets. 4.1 Experiment Setup Figure 3: SVM Public Figure 5: LDA Public Figure 2: SVM Public Figure 4: LDA Public Comments Balanced Comments Balanced CommentsAccuracy CommentsAccuracy Accuracy Accuracy DatasetsInourexperiments,weusedtworeal-worldunstructuredtextdatasets. Public Comments Dataset: This dataset comprises a vast collection of public responses to Federal Reserve announcements and regulations. For our experiments, we utilize a pool of 125,179 comments sampled from all public comments posted since 2008. The teacher model classifies 12

Table1: ExperimentalParameters Parameter Value ExperimentalDatasets PublicComments;LSEGData&Analytics—GlobalNewsArchive Database(GNAD) DataObjects(PublicComments) 125,179 DataObjects(GNAD) 12,288 EmbeddingDimensions 384 EmbeddingModel all-MiniLM-L6-v2 TeacherModel(Oracle) gemma-3-27b-it-qat-q4 0-gguf InitialLabeledPool Randomlysampleduntilatleastonesampleperclassispresent ALBatchSize 25 MaxLabeledExamples 6,275(PublicComments);6,150(GNAD) ConsideredALSchemes M-RARU;RandomSampling(RANDOM) StudentModels SVM;LDA;RF;GBDT;DistilBERT PerformanceMeasures Accuracy;BalancedAccuracy NumberofRunsperResult 5(1forDistilBERT) Figure 7: RF Public Figure9: GBDTPublic Figure 6: RF Public Figure8: GBDTPublic Comments Balanced Comments Balanced CommentsAccuracy CommentsAccuracy Accuracy Accuracy Figure 10: DistilBERT Figure 11: DistilBERT Figure 13: SVM Figure 12: SVM Public Comments Ac- Public Comments Bal- GNAD Balanced Ac- GNADAccuracy curacy ancedAccuracy curacy 13

Figure 15: LDA Figure 14: LDA Figure 16: RF GNAD Figure 17: RF GNAD GNAD Balanced Ac- GNADAccuracy Accuracy BalancedAccuracy curacy Figure 19: GBDT Figure 21: DistilBERT Figure 18: GBDT Figure 20: DistilBERT GNAD Balanced Ac- GNAD Balanced Ac- GNADAccuracy GNADAccuracy curacy curacy eachcommentintooneoffivecategories: BanksandTrades,Consumer/Community,Government, GeneralPublic,orOther,basedonthecommenter’sorganizationalaffiliationandperspective. LSEGData&Analytics. GlobalNewsArchiveDatabase(GNAD):TheGNADdatasetconsistsof professionally authored financial news articles. We utilize 12,288 news headlines for our experiments, focusing specifically on headline text to capture the most salient economic signals. The teacher model predicts whether each headline indicates rising, falling, or flat GDP trends, providingaconciseeconomicsentimentclassificationtask. Learning Representation To generate learning representations for the text, we employed the SentenceTransformer package. Specifically, we used the all-MiniLM-L6-v2 model, which transforms each text segment into a 384-dimensional dense vector. These embeddings capture semantic relationships and serve as the unified feature space for any student models that requires anembedding(i.e.,SVM,RF,GBDT,andLDA)inourexperiments. Active Learning Schemes We experimented with one baseline scheme and our proposed scheme. In both schemes, selected examples are labeled by a teacher model, a locally deployed gemma-3-27b-it-qat-q4 0, which acts as the oracle. The active learning process begins after an initial set of samples is randomly drawn to ensure at least one representative from each classispresentinthetrainingset. 14

• RandomSampling(RANDOM):Thebaselinescheme,wherethesystemselectsexamples tobelabeledfromtheunlabeledpoolbasedonauniformrandomdistribution. • M-RARU: Our proposed scheme, which uses Multi-class Randomized Accept/Reject Uncertainty Sampling to intelligently query the most informative examples for labeling by the teachermodel. Student Models We evaluated our active learning schemes on five distinct student models to assess the generalizability of our approach. All traditional machine learning models use default scikit-learn configurations for training to ensure reproducibility and fair comparison. The models include: a Support Vector Machine (SVM), trained with default scikit-learn parameters; Linear Discriminant Analysis (LDA), using default scikit-learn configuration; a Random Forest (RF), an ensemble of decision trees with default scikit-learn settings; a Gradient-Boosting Decision Tree (GBDT), implemented using XGBoost for GPU support while maintaining default scikit-learn configuration parameters; and DistilBERT, a distilled version of BERT trained using default configurationsfromtheTransformerslibrary. EvaluationMetricsWeassesstheperformanceofthestudentmodelsusingtwoprimaryclassificationmetrics. 1. Accuracy is the proportion of correctly predicted instances over the total number of instances: TP +TN Accuracy = (5) TP +TN +FP +FN 2. Balanced Accuracy is the average of recall obtained on each class, which is suitable for imbalanceddatasets: K 1 (cid:88) TP i BalancedAccuracy = (6) K TP +FN i i i=1 where TP, TN, FP, and FN are the counts of true positives, true negatives, false positives, and falsenegatives,respectively. EnvironmentWeimplementedallalgorithmsinPython3.11. Allexperimentswereconducted on a machine equipped with a 16-core Intel CPU, 128GB of RAM, and a single NVIDIA V100 GPU with 32GB of memory. All reported results are averages of 5 complete runs, with the exception of the DistilBERT model, for which a single run was conducted due to computational constraints. 15

ParametersTable1providesacomprehensivelistoftheparametersandsettingsusedthroughoutourexperiments. 4.2 Experimental Results AccuracyComparisonFigures2through21presenttheprimaryresultsofourstudy,illustratingtheperformanceofeachstudentmodelundertheM-RARUandRANDOMsamplingschemes across both datasets. The y-axis of each plot represents either Accuracy or Balanced Accuracy, while the x-axis indicates the number of samples labeled by the teacher model. The accuracy thresholds shown in each model-dataset configuration represent the thresholds that are achievable withM-RARUwithinthegivensamplebudgetconstraint(uptoacapof90%). Our results demonstrate that M-RARU consistently outperforms RANDOM sampling across allmodelconfigurations,withthemagnitudeofimprovementvaryingsignificantlybasedontheinherentuncertaintyestimationcapabilitiesofeachmodeltype. Thevariationsinperformancegains can be attributed to fundamental differences in how each model architecture estimates prediction uncertainty,whichisacriticalfactorforactivelearningeffectiveness. Tree-basedModels(RFandGBDT)exhibitthemostdramaticyetinconsistentimprovements with M-RARU. For instance, GBDT on Public Comments requires only 1,825 samples with M- RARUwhereasRANDOMneedsmorethan6,275toreach90%accuracy(71%reductioninsamples). However,therewaslittletonodifferenceinnecessarysamplestoreachtheaccuracythresholdsforRF.Tree-basedmodels’nativeprobabilisticoutputsthroughensemblevotingmechanisms influences this performance. In Random Forests, the variance across individual tree predictions provides a naturally calibrated uncertainty estimate, while GBDT’s sequential boosting process inherently focuses on difficult examples, aligning perfectly with M-RARU’s uncertainty-driven selection. The discrete decision boundaries created by tree splits also produce clear regions of high uncertainty at class boundaries, making these models well-suited for identifying informative samplesthroughactivelearning. LinearModels(SVMandLDA)showsubstantialbutmoremoderateimprovements,typically achieving50-70%reductionsinlabelingrequirements. Toreachanaccuracyof90%onthePublic Comments data, LDA requires 2,250 samples using M-RARU compared to more than 6,275 with RANDOM (64% reduction), and SVM requires only 875 samples using M-RARU compared to 2,075 using RANDOM (58% reduction). These gains arise from the models’ geometric interpretation of uncertainty. SVM’s distance from the decision hyperplane provides a natural uncertainty 16

Table2: StudentModelInferenceandTrainingComparison Model Training(ms) TrainingSpeedupvsDistilBERT Inference(ms) InferenceSpeedupvsDistilBERT DistilBERT(CUDA) 13.3 1.0× 2.80 1.0× GBDT(CUDA) 0.3 44× 0.08 35× RF 1.5 9× 0.12 23× LDA 4.9 3× 0.22 13× SVM 1.5 9× 0.55 5× metricthatalignswellwithM-RARU’ssamplingstrategy,particularlyeffectiveinidentifyingsupportvectorsthatdefineclassboundaries. LDA,asagenerativemodel,offerswell-calibratedposterior probabilities through its Gaussian assumptions, though its linear nature limits the complexity ofuncertaintypatternsitcancapturecomparedtotree-basedmethods. DistilBERT demonstrates the most modest improvements, with M-RARU typically requiring 10-20% fewer samples than RANDOM. This limited benefit stems from several factors inherent to transformer architectures. First, as reported in (29) DistilBERT’s softmax outputs require additional calibration to produce reliable uncertainty estimates, as neural networks are known to be overconfident in their predictions. Second, the model’s deep semantic understanding means it already performs well on randomly selected samples, reducing the relative benefit of strategic selection. Third,transformermodelslacknativeuncertaintyquantificationmechanisms,unlikeensemblemethodsorBayesianapproaches,andrequirepost-hoctechniquesliketemperaturescaling or Monte Carlo dropout for uncertainty estimation. The computational overhead of these calibrationmethodsfurtherlimitsthepracticalbenefitsofactivelearningfortransformermodels. DatasetComplexityImpact. TheGNADdatasetconsistentlyrequiresmoresamplesacrossall models to achieve comparable accuracy levels, reflecting its more challenging classification task. News headlines, by nature, are extremely concise and often ambiguous, requiring sophisticated inferencetodetermineGDPimpact. Here,M-RARU’sadvantagesbecomeevenmorepronounced as many configurations with RANDOM sampling fail to reach higher accuracy thresholds within the 6,150 sample budget, while M-RARU is capable of achieving these targets. For example, Random Forest with RANDOM cannot reach 75% accuracy on GNAD within the dataset limit, whereasM-RARUachievesthiswithonly2,200examples. Balanced Accuracy Analysis. When examining balanced accuracy metrics, which better account for class imbalance, the benefits of M-RARU become even more apparent. The strategic sampling inherently addresses class imbalance by focusing on decision boundaries where minority classes are often found. For instance, RF on Public Comments requires over 6,275 examples 17

Table3: SamplingEfficiency: M-RARUvsTraditionalUncertaintySampling ModelModelModelpt¡-Modelpt¿ PublicComments GNAD Acc. Rate Speedup Acc. Rate Speedup SVM 18.2% 912× 31.3% 154× LDA 0.2% 10× 1.9% 9× RF 35.7% 1,788× 42.9% 211× GBDT 13.7% 686× 33.1% 163× DistilBERT 5.9% 295× 8.3% 41× with RANDOM to achieve 80% balanced accuracy, while M-RARU needs only 1,200, representing an 81% reduction. This improvement is particularly valuable in real-world applications where minorityclassesoftenrepresentcriticalbutrareevents. TheconsistentpatternacrossallexperimentsrevealsthatM-RARU’seffectivenessscaleswith model uncertainty quality: models with naturally calibrated uncertainties (tree ensembles) benefit most, followed by models with geometric uncertainty interpretations (SVM, LDA), while models requiring uncertainty calibration (DistilBERT) show modest but still meaningful improvements. These results validate our hypothesis that combining knowledge distillation with intelligent active learningcandramaticallyreducethecostofcreatinghigh-performanceclassifiers. 4.3 Training Efficiency Analysis Table 2 illustrates the computational efficiency gains achieved by traditional machine learning models compared to the transformer-based DistilBERT baseline. These measurements represent averagesacross1,000batchesof32sampleseach. Intermsoftrainingefficiency,GBDTdemonstratesexceptionalperformancewitha44xspeedup comparedtoDistilBERT,requiringonly0.3msperbatchversus13.3msforthetransformermodel. This dramatic improvement stems from GBDT’s sequential tree construction algorithm, which efficiently leverages gradient information without the computational overhead of backpropagation through deep neural networks. Random Forest and SVM both achieve 9x training speedups, completing batch training in 1.5ms through parallelizable training procedures. LDA shows a 3x speedup with 4.9ms training time, as its statistical approach requires matrix operations that, while efficient,aremorecomputationallyintensivethantree-basedmethods. For inference performance, the advantages become even more pronounced. GBDT achieves a 35x speedup with inference times of 0.08ms per batch, making it ideal for real-time applications. 18

Random Forest delivers 23x faster inference at 0.12ms through simple tree traversal operations, whileLDAprovides13xfasterinferenceat0.22msviastraightforwardlineartransformations. These efficiency gains have profound implications for model development. The time saved by faster models can be directly reinvested into hyperparameter tuning, which is a critical process for maximizing predictive performance (30; 31). Within a fixed time budget, a practitioner can execute hundreds of GBDT experiments in the time required for a single DistilBERT run. This enablesthoroughexplorationofthehyperparameterspace,dramaticallyincreasingtheprobability offindingoptimalconfigurations. ThecombinationofM-RARU’ssampleefficiencyandtraditionalmodels’computationalspeed createsamultiplicativeadvantage: M-RARUreduceslabelingtimewhileefficientmodelsaccelerate training, enabling rapid iteration cycles. For instance, within a single workday, one could test hundreds of combinations of learning rates, tree depths, and regularization parameters for GBDT. The same search would take weeks with transformer models. This capability ensures that knowledge distilled from the teacher LLM is leveraged to its fullest extent, producing models that are notonlyfastbutoptimallytunedforpeakperformance. 4.4 Sampling Efficiency Analysis Previously, (26) and (28) have shown that the randomized accept/reject mechanism achieves comparable performance to the traditional exhaustive-based uncertainty sampling. To further strengthenthecomparison,inTable3,wequantifythecomputationalefficiencygainsofM-RARU overtraditionaluncertaintysamplingwhenreaching85%accuracyforPublicCommentsand75% accuracy for GNAD. The calculations assume a batch size of 25 samples, where traditional uncertainty sampling must perform exhaustive searches through the entire unlabeled pool (125,179 samples for Public Comments, 12,288 for GNAD) after training each batch to identify the most uncertainsamples. Incontrast,M-RARUemploystheaccept/rejectmechanismdescribedinEquation 4, where the uncertainty score directly serves as the acceptance probability, eliminating the need for exhaustive ranking. The acceptance rates shown reflect the average probability of accepting a sample during the active learning process until these accuracy thresholds are reached. The results reveal striking variations in acceptance rates across models: tree-based methods (RF and GBDT) maintain healthy acceptance rates of 13.7-42.9%, yielding speedups of 163-1,788× whenreachingtargetaccuracy,whileSVMshowsintermediateratesof18.2-31.3%withspeedups of 154-912×. Most notably, LDA exhibits pathologically low acceptance rates of 0.2% on Public Comments and 1.9% on GNAD, resulting in minimal speedups of 10× and 9× respectively. 19

This poor performance stems from LDA’s generative modeling approach, which produces overly confident posterior probabilities concentrated in narrow regions of the feature space. When LDA assigns high confidence to most samples (leaving few truly uncertain), the acceptance probability p = 1 − max Pr(C |x) becomes vanishingly small for the vast majority of the pool, and thus, k k leads to more candidate being exam. Overall, as can be seen from the results, thanks to the adaption of randomized accept/reject mechanism, M-RARU is requiring far less inferences than any traditionalactivelearningsamplingsthatrequiresexhaustivesearch. 5 Related Works In this section, we will present the works that are closely related to our research. We begin by introducing the literature on Knowledge Distillation, with a particular focus on its application to text classification. Then, we discuss established principles in Active Learning for efficient data selection. Finally, we survey the emerging intersection of these two fields, which provides the contextforourproposedmethodology. Knowledge Distillation for Text Classification TheconceptofKnowledgeDistillation(KD)wasformallyintroducedasamethodtocompress large, complex models into smaller, more efficient ones without a significant loss in performance (8). Thefundamentalideaistotrainacompact”student”modeltomimicthebehaviorofalarger, pre-trained ”teacher” model. This is typically achieved by using the softened class probabilities produced by the teacher as soft labels to guide the student’s training process. In the domain of Natural Language Processing (NLP), this technique gained significant traction with the advent of large-scale transformer models. For instance, works like DistilBERT (14) and TinyBERT (32) demonstratedthatitwaspossibletocreatemuchsmallerandfasterversionsofBERTthatretained over 95% of the original model’s performance on standard NLP benchmarks. Specifically for text classification, KD has been explored in various contexts. Some approaches focus on distilling knowledge across different domains, training a student model for a target domain using teachers withexpertiseinrelatedsourcedomains(33). OthershaveadaptedKDforindustrialapplications, developingperformance-guidedstrategiestocreateefficientclassifiersatscalebycarefullyselecting the knowledge to be transferred (34). There is also research on distilling knowledge between different modalities, such as from text-based models to speech-based models (35). Despite these advancements, a common challenge in nearly all KD applications is the high cost associated with 20

the initial step: requiring the powerful but slow and expensive teacher model to label a very large, randomlysampleddatasettocreatethetrainingsetforthestudent(36). Active Learning for Efficient Model Training ActiveLearning(AL)isasubfield ofmachinelearningthataimstoreducethetotalamountof labeleddatarequiredtotrainamodelbyallowingthelearningalgorithmtointelligentlychoosethe datafromwhichitlearns(17). Thecoreprincipleisthatnotalldatapointsareequallyinformative. Byiterativelyselectingthemostvaluablesamplesforlabeling,anALsystemcanachieveadesired levelofperformancewithsignificantlyfewerlabelsthanrequiredbypassive,randomsamplingapproaches. Awidevarietyofquerystrategieshavebeendevelopedtoidentifyinformativesamples. The most common approach is uncertainty sampling, where the algorithm queries the instances aboutwhichitisleastcertainofthecorrectlabel(18). OtherpopularstrategiesincludeQuery-by- Committee(QBC),whichusesanensembleofmodelsandselectssamplesonwhichthecommittee members disagree the most (37), and Expected Model Output Change (EMOC), which prioritizes samples that are expected to cause the greatest change to the current model if their labels were known (38). Another related technique is importance sampling, which has a rich history in statistics. Inmachinelearning,itisusedtoprioritizedatapointsthathavealargerimpactonthemodel’s loss function, thereby reducing training time and improving final accuracy (39). Recent work has extended this to create task-adaptive pretraining schemes by sampling data that is most relevant to the target task (40). Our work draws inspiration from these principles, but applies them to the unique problem of cost-effective knowledge transfer from a teacher model. More recently, (26) and (28) introduced a Randomized Accept/Reject mechanism into Uncertainty Sampling, which addresses the scalability issues of traditional uncertainty sampling through probabilistic selection. However, their implementations were limited to binary classification tasks, and thus, are unsuited totheparticularobjectiveofthiswork. Integrating Active Learning and Knowledge Distillation The high cost of data annotation in standard KD has naturally led researchers to explore the integration of AL. The goal of this hybrid approach, often termed Active Knowledge Distillation (AKD),istouseALquerystrategiestoselectasmall,highlyinformativesubsetofunlabeleddata fortheteacherLLMtolabel,therebyminimizingexpensiveAPIcallsandcomputationaloverhead (41; 42). Several strategies have been proposed within this emerging area. Some methods use 21

traditionaluncertaintymetrics,wherethestudentmodelidentifiesconfusingsamplesandrequests teacherlabelsonlyforthose(43). Othershavedevelopedmoresophisticatedmetricsthatconsider both the student’s uncertainty and the teacher’s confidence, aiming to select samples that are not onlyhardforthestudentbutalsoconfidentlylabeledbytheteacher(44). Furthermore,researchhas shown that co-training frameworks, where the student and teacher models are trained simultaneouslyinanactivelearningloop,canyieldmorerobustresults(45). Acomprehensivesurveyofdata selection methods highlights the critical role that strategic sampling plays in the overall efficiency of training modern language models (46). However, many existing AKD methods still rely on deterministicuncertaintysampling,whichcanbepronetoselectingoutliersandmaynotsufficiently explore the data space. These methods often lack a mechanism to balance exploration (sampling from diverse regions) and exploitation (sampling from regions of high uncertainty). Our proposed algorithm, M-RARU, addresses this specific gap. It integrates a randomized accept-reject mechanismwithuncertaintysampling,providingaprincipledwaytomanagetheexploration-exploitation trade-offandcost-effectivelyselectadiverseandhighlyinformativetrainingdatasetforthestudent model. 6 Conclusion In this work, we study the problem of cost-effective model training for large-scale text classification. To address this, we proposed a novel approach that combines Knowledge Distillation with Active Learning for efficient knowledge transfer. This approach effectively transfers a Large Language Model teacher’s knowledge to a smaller student model, creating highly accurate classifiersthatachievealevelofperformancedifficulttoobtainwithtraditionaltrainingmethodsalone. Our proposed method enables knowledge transfer for any student model as long as it can provide a measure of predictive uncertainty. In addition, we described in detail the key component of this approach, namely, Multi-class Randomized Accept/Reject Uncertainty Sampling (M-RARU), an intelligentquerystrategythatoptimizestheselectionoftraininginstancesfortheLLMteacher. We implemented our approach and experimentally verified its performance with five distinct student modelsonmultiplereal-worlddatasets. Theresultshaveshownthatourproposedmethodexhibits substantiallybetterperformancewhencomparedtotherandomsamplingbaselinewhileachieving desired classification accuracy. Specifically, M-RARU achieves up to 80% reduction in sample requirements compared to random sampling, substantially reducing the required training data and associatedlabelingcostswhileachievingthesame,orgreater,accuracyasthebaselinealternative. 22

References [1] T. Loughran and B. McDonald, “When is a liability not a liability? textual analysis, dictionaries, and 10-ks,”TheJournalofFinance,vol.66,no.1,pp.35–65,2011. [2] A. H. Shapiro, M. Sudhof, and D. J. Wilson, “Measuring news sentiment,” Federal Reserve Bank of SanFrancisco,WorkingPaper2020-01,2020. [3] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,Ł.Kaiser,andI.Polosukhin, “Attentionisallyouneed,”inAdvancesinNeuralInformationProcessingSystems,vol.30,2017. [4] T. Wu, Y. Wang, and N. Quach, “Advancements in natural language processing: Exploring transformer-based architectures for text understanding,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20227 [5] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, p. 273–297, 1995. [6] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp.1189–1232,2001. [7] L.Breiman,“Randomforests,”MachineLearning,vol.45,no.1,p.5–32,2001. [8] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015. [Online]. Available: https://arxiv.org/abs/1503.02531 [9] A. Ghorbani, J. He, J. Ma, A. Pagnoni, and A. Anandkumar, “A survey of knowledge distillation in naturallanguageprocessing,”2020. [10] R.Tang,Y.Lu,L.Liu,L.Jiang,X.Liu,andJ.Han,“Distillingtask-specificknowledgefrombertinto simpleneuralnetworks,” inProceedingsofthe3rdWorkshoponNeuralGenerationandTranslation, 2019,pp.153–159. [11] P.shuaiRen,P.boWang,C.yuanZhang,J.chenGu,X.danLiang,G.nanDong,andE.jinZhou,“A surveyofdeepactivelearning,”2021. [12] S.TongandD.Koller,“Supportvectormachineactivelearningwithapplicationstotextclassification,” in Proceedings of the Seventeenth International Conference on Machine Learning, ser. ICML ’00. MorganKaufmannPublishersInc.,2000,pp.999–1006. [13] L. Ein-Dor, A. Halfon, Y. Kantor, Y. Mass, O. Pereg, I. Roth, R. Rinott, S. Shalev-Shwartz, and A.Globerson,“Active-learningforbert: Anempiricalstudy,”2020. [14] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaperandlighter,”2019.[Online].Available: https://arxiv.org/abs/1910.01108 [15] K. Dasgupta, S. Roy, and S. Paul, “Distilling reasoning capabilities from large language models,” in Proceedingsofthe2022ConferenceonEmpiricalMethodsinNaturalLanguageProcessing: Findings, 2022,pp.4225–4235. 23

[16] C. Fan, Q. Wu, Y. Zhao, and L. Mo, “Integrating active learning and semi-supervised learning for improved data-driven hvac fault diagnosis performance,” Applied Energy, vol. 356, p. 122356, 2024. [Online].Available: https://www.sciencedirect.com/science/article/pii/S0306261923017208 [17] B.Settles,“Activelearningliteraturesurvey,”UniversityofWisconsin–Madison,Tech.Rep.,2009. [18] D. D. Lewis and W. A. Gale, “A sequential algorithm for training text classifiers,” in ACM SIGIR, 1994. [19] M. Barandas, D. Folgado, R. Santos, R. Sima˜o, and H. Gamboa, “Uncertainty-based rejection in machine learning: Implications for model development and interpretability,” Electronics, vol. 11, no.3,2022.[Online].Available: https://www.mdpi.com/2079-9292/11/3/396 [20] S.M.LundbergandS.-I.Lee,“Aunifiedapproachtointerpretingmodelpredictions,”inAdvancesin neuralinformationprocessingsystems,2017,pp.4765–4774. [21] M.T.Ribeiro,S.Singh,andC.Guestrin,“”whyshoulditrustyou?”: Explainingthepredictionsofany classifier,”inProceedingsofthe22ndACMSIGKDDinternationalconferenceonknowledgediscovery anddatamining,2016,pp.1135–1144. [22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words andphrasesandtheircompositionality,”inNIPS,2013. [23] J.Pennington,R.Socher,andC.D.Manning,“Glove: Globalvectorsforwordrepresentation,”inACL EMNLP,2014. [24] Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. A´brego, S. Yuan, C. Tar, Y. Sung, B.Strope,andR.Kurzweil,“Multilingualuniversalsentenceencoderforsemanticretrieval,”inACL, 2020. [25] R. J. Brachman, W. W. Cohen, and T. Dietterich, Synthesis Lectures on Artificial Intelligence and MachineLearning,2012. [26] X. Ge, Y. Xue, Z. Luo, M. A. Sharaf, and P. K. Chrysanthis, “Request: A scalable framework for interactiveconstructionofexploratoryqueries,”inIEEEBigData,2016. [27] Y. Xue and M. Hauskrecht, “Robust learning of classification models from noisy soft-label information.”inECCV,2016. [28] X. Ge, X. Zhang, and P. K. Chrysanthis, “Exnav: An interactive big data exploration framework for big unstructured data,” in 2020 IEEE International Conference on Big Data (Big Data), 2020, pp. 503–512. [29] S.DesaiandG.Durrett,“Calibrationofpre-trainedtransformers,”inProceedingsofthe2020ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,EMNLP,2020. [30] J.BergstraandY.Bengio,“Randomsearchforhyper-parameteroptimization,”inJournalofMachine LearningResearch,vol.13,no.Feb,2012,pp.281–305. [31] J.Snoek,H.Larochelle,andR.P.Adams,“Practicalbayesianoptimizationofmachinelearningalgorithms,”inAdvancesinneuralinformationprocessingsystems,vol.25,2012. 24

[32] X.Jiao,Y.Yin,L.Shang,X.Jiang,X.Chen,L.Li,F.Wang,andQ.Liu,“Tinybert: Distillingbertfor naturallanguageunderstanding,”2019.[Online].Available: https://arxiv.org/abs/1909.10351 [33] S. Zhang, L. Jiang, and J. Tan, “Cross-domain knowledge distillation for text classification,” Neurocomputing, vol. 509, pp. 11–20, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S092523122201058X [34] F. D. Palo, P. Singhi, and B. Fadlallah, “Performance-guided llm knowledge distillation for efficient text classification at scale,” 2024. [Online]. Available: https://www.amazon.science/publications/performance-guided-llm-knowledge-distillation-forefficient-text-classification-at-scale [35] J. Ni, Y. Ma, W. Wang, Q. Chen, D. Ng, H. Lei, T. H. Nguyen, C. Zhang, B. Ma, and E. Cambria, “Adaptive knowledge distillation between text and speech pre-trained models,” 2023. [Online]. Available: https://arxiv.org/abs/2303.03600 [36] Z. Yuan, W. Zhou, and H. Li, “Revisiting knowledge distillation: An inheritance and exploration framework,”2021.[Online].Available: https://arxiv.org/abs/2106.05942 [37] H.S.Seung,M.Opper,andH.Sompolinsky,“Querybycommittee,”inACMWorkshoponComputationalLearningTheory,1992. [38] A.Freytag,E.Rodner,andJ.Denzler,“Selectinginfluentialexamples: Activelearningwithexpected modeloutputchanges.”inECAI,2016. [39] A. Katharopoulos and F. Fleuret, “Not all samples are created equal: Deep learning with importance sampling,”2019.[Online].Available: https://arxiv.org/abs/1803.00942 [40] D. Grangier, S. Fan, S. Seto, and P. Ablin, “Task-adaptive pretrained language models via clustered-importancesampling,”2025.[Online].Available: https://arxiv.org/abs/2410.03735 [41] Z. Wang, S. Mueller, E. Faerman, M. Basan, E. Kiciman, and D. M. Pennock, “Active learning for optimal intervention design in causal models,” 2023. [Online]. Available: https://arxiv.org/abs/2306.01222 [42] Q. Zhang, Z.-D. Chen, X.-C. Li, and D.-C. Zhan, “Active knowledge distillation,” 2023. [Online]. Available: https://arxiv.org/abs/2305.15535 [43] D. Kothadiya, N. Vyas, A. Dani, and A. Rajwade, “Task-agnostic active learning for final-layer fine-tuning,”2022.[Online].Available: https://arxiv.org/abs/2210.08323 [44] A.Kuznetsov,V.V.Maiorova,R.N.Valeev,A.A.Shelmanov,A.V.Drobintsev,andE.A.Tutubalina, “Activeknowledgedistillation,”2021.[Online].Available: https://arxiv.org/abs/2112.00122 [45] P. yeh Chiang, Y.-T. Lee, C.-H. Kuo, and H. yi Lee, “Active-knowledge-distillation,” https://github.com/pingyehchiang/Active-Knowledge-Distillation,2020,gitHubrepository. [46] A.Albalak,Y.Elazar,S.M.Xie,S.Longpre,N.Lambert,X.Wang,N.Muennighoff,B.Hou,L.Pan, H. Jeong, C. Raffel, S. Chang, T. Hashimoto, and W. Y. Wang, “A survey on data selection for languagemodels,”2024.[Online].Available: https://arxiv.org/abs/2402.16827 25

Cite this document

APA

Viviana Luccioli, Rithika Iyengar, Ryan Panley, Flora Haberkorn, Xiaoyu Ge, Leland Crane, Nitish Sinha, & and Seung Jung Lee (2025). LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora* (FEDS 2025-108). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2025-108

BibTeX

@techreport{wtfs_feds_2025_108,
  author = {Viviana Luccioli and Rithika Iyengar and Ryan Panley and Flora Haberkorn and Xiaoyu Ge and Leland Crane and Nitish Sinha and and Seung Jung Lee},
  title = {LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora*},
  type = {Finance and Economics Discussion Series},
  number = {2025-108},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2025},
  url = {https://whenthefedspeaks.com/doc/feds_2025-108},
  abstract = {Large Language Models (LLMs) are highly accurate in classification tasks, however, substantial computational and financial costs hinder their large-scale deployment in dynamic environments. Knowledge Distillation (KD) where a LLM âteacherâ trains a smaller and more efficient âstudentâ model, offers a promising solution to this problem. However, the distillation process itself often remains costly for large datasets, since it requires the teacher to label a vast number of samples while incurring significant token consumption. To alleviate this challenge, in this work we explore the active learning (AL) as a way to create efficient student models at a fraction of the cost while preserving the LLMâs performance. In particular, we introduce M-RARU (Multi-class Randomized Accept/Reject Uncertainty Sampling), a novel AL algorithm that significantly reduces training costs. M-RARU employs an innovative strategy combining uncertainty with a randomized accept-reject mechanism to select only the most informative data points for the LLM teacher. This focused approach significantly minimizes required API calls and data processing time. We evaluate M-RARU against random sampling across five diverse student models (SVM, LDA, RF, GBDT, and DistilBERT) on multiple benchmark datasets. Experiments demonstrate that our proposed method achieves up to 80% reduction in sample requirements as compared to random sampling, substantially improving classification accuracy while reducing financial costs and overall training time.},
}