feds · May 2, 2024

Manufacturing Sentiment: Forecasting Industrial Production with Text Analysis

Abstract

This paper examines the link between industrial production and the sentiment expressed in natural language survey responses from U.S. manufacturing firms. We compare several natural language processing (NLP) techniques for classifying sentiment, ranging from dictionary-based methods to modern deep learning methods. Using a manually labeled sample as ground truth, we find that deep learning models--partially trained on a human-labeled sample of our data--outperform other methods for classifying the sentiment of survey responses. Further, we capitalize on the panel nature of the data to train models which predict firm-level production using lagged firm-level text. This allows us to leverage a large sample of "naturally occurring" labels with no manual input. We then assess the extent to which each sentiment measure, aggregated to monthly time series, can serve as a useful statistical indicator and forecast industrial production. Our results suggest that the text responses provide information beyond the available numerical data from the same survey and improve out-of-sample forecasting; deep learning methods and the use of naturally occurring labels seem especially useful for forecasting. We also explore what drives the predictions made by the deep learning models, and find that a relatively small number of words--associated with very positive/negative sentiment--account for much of the variation in the aggregate sentiment index.

Finance and Economics Discussion Series Federal Reserve Board, Washington, D.C. ISSN 1936-2854 (Print) ISSN 2767-3898 (Online) Manufacturing Sentiment: Forecasting Industrial Production with Text Analysis Tomaz Cajner, Leland D. Crane, Christopher Kurz, Norman Morin, Paul E. Soto, Betsy Vrankovich 2024-026 Please cite this paper as: Cajner,Tomaz,LelandD.Crane,ChristopherKurz,NormanMorin,PaulE.Soto,andBetsy Vrankovich(2024). “ManufacturingSentiment: ForecastingIndustrialProductionwithText Analysis,” Finance and Economics Discussion Series 2024-026. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2024.026. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

Manufacturing Sentiment: ∗ Forecasting Industrial Production with Text Analysis Tomaz Cajner Leland D. Crane Christopher Kurz Norman Morin Paul E. Soto Betsy Vrankovich April 2024 Abstract This paper examines the link between industrial production and the sentiment expressed in natural language survey responses from U.S. manufacturing firms. We compare several natural language processing (NLP) techniques for classifying sentiment, ranging from dictionary-based methods to modern deep learning methods. Using a manually labeled sample as ground truth, we find that deep learning models—partially trained on a human-labeled sample of our data—outperform other methods for classifying the sentiment of survey responses. Further, we capitalize on the panel nature of the data to train models which predict firm-level production using lagged firm-level text. This allows us to leverage a large sample of “naturally occurring” labels with no manual input. We then assess the extent to which each sentiment measure, aggregated to monthlytime series, canserve as auseful statistical indicatorand forecastindustrial production. Our results suggest that the text responses provide information beyond the available numerical data from the same survey and improve out-of-sample forecasting; deep learning methods and the use of naturally occurring labels seem especially useful for forecasting. We also explore what drives the predictions made by the deep learning models, and find that a relatively small number of words—associated with very positive/negative sentiment—account for much of the variation in the aggregate sentiment index. JEL codes: C1, E17, O14 Keywords: Industrial Production, Natural Language Processing, Machine Learning, Forecasting ∗All authors are at the Federal Reserve Board of Governors. We thank the Institute for Supply Management, including Kristina Cahill, Tom Derry, Debbie Fogel-Monnissen, Rose Marie Goupil, Paul Lee, Susan Marty, and Denis Wolowiecki, for access to and help with the manufacturing survey data that underlie the workdescribedbythispaper. WearethankfulforcommentsandsuggestionsfromStephenHansen,Andreas Joseph, Juri Marcucci, Arthur Turrell, and participants at the Society for Government Economists Annual Conference, the ESCoE Conference on Economic Measurement, the Government Advances in Statistical ProgrammingConference,theSocietyforEconomicMeasurementConference,andtheNontraditionalData, MachineLearning, andNaturalLanguageProcessinginMacroeconomicsConference. Theanalysisandconclusions set forth here are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors.

1 Introduction In recent years there has been an explosion of interest in natural language processing (NLP) within finance and macroeconomics. The use of text data to forecast and assist in model estimation is becoming increasingly commonplace. Still, there are many open questions around the use of NLP in empirical work. For example, which of the numerous available methods work best, and work best in specific contexts? Are off-the-shelf tools appropriate, or are there greater returns to specializing models to the data at hand? How useful is text for forecasting real output indicators, such as manufacturing output? What explains the predictions made by complicated NLP models? This paper addresses these questions, using a novel dataset and a variety of NLP methods ranging from traditional dictionaries to fine-tuned transformer neural networks. Our primary data source is the monthly survey microdata underlying the Institute for Supply Management’s (ISM) Manufacturing Report on Business. The survey is taken by purchasing managers at a representative sample of U.S. manufacturing firms. Part of the survey consists of categorical-response questions about aspects of their current operations, including production, inventories, backlogs, employment, and new orders. The answers to thesequestionsareoftheform“worse/thesame/betterthanlastmonth”, andareaggregated into the widely-reported ISM diffusion indexes. But the survey also includes free-response text boxes, where purchasing managers can provide further comments either in general or about specific aspects of their businesses; these comments are a novel source of signal about the economy and our focus in this paper.1 Our first step is to quantify the text into an economically important and interpretable measure. We focus on sentiment, given that waves of optimism and pessimism have historically been linked to business cycle fluctuations (Keynes, 1937). We begin by evaluating various NLP methods in terms of their ability to correctly classify the sentiment expressed in individual comments. Our context is fairly specific: the data are manufacturing-sector purchasing managers opining about about the business outlook for their firm, without much discussion of financial conditions. While there are numerous sentiment classification models available, many were developed with other data in mind, such as social media posts (Nielsen, 2011). Even within economics and finance, most work has focused on finance- 1While ISM collects these responses through the survey, this text is confidential and not incorporated into the publicized indexes. A sample of responses are published in the monthly ISM Report on Business (see https://www.ismworld.org/supply-management-news-and-reports/reports/ ism-report-on-business/). 2

related language (Araci, 2019; Correa et al., 2021; Huang et al., 2022). The lack of results for manufacturing-specific datasets motivates our assessment of a variety of NLP techniques. One common approach is to count the frequency of words within a sentiment dictionary. Economists initially used positive and negative words from psychology literature, but have since moved on to using domain-specific words (e.g., Correa et al., 2021) and using simple word counts to measure other types of tone, such as uncertainty (see Baker et al., 2016 and Gentzkow et al., 2019). While this method is transparent, it may fail to capture negation, synonyms, and often requires context-specific dictionaries that may not be available. More recently developed techniques employ deep learning methods that account for the nuances of language. We focus on variants of BERT (see Devlin et al., 2018), a precursor of popular large language models like ChatGPT. These models are pre-trained: the parameters are set by exposing the model to a large corpus of text—such as the entirety of Wikipedia—and attempting to predict missing words or the relationship between sentences. The pre-trained models can be used to classify sentiment directly, or they can be further trained (“finetuned”) on a specific dataset. The latter approach attempts to get the best of both worlds: a solid ability to parse language from the exposure to a large quantity of training data, plus the context-specific nuance from the fine-tuning data. While deep learning gets enormous attention, it is ex-ante unclear whether it should outperform carefully curated dictionaries in our context. Comparing the accuracy of these different methods on a sample of hand-coded comments from our dataset we find that deep learning does have an advantage on our data, in part because the brevity of the comments means that many comments have no overlap with dictionary terms. In addition, we find that there is value in specializing the models to our data: the models fine-tuned on our data have the highest sentiment classification accuracy on a hold-out sample. These results point to the advantages of using pre-trained models, as well as carefully specializing them to the task at hand. Our hope is that these results help guide other economists when deciding between NLP approaches. The sentiment measures based on free-form textual responses in the ISM data aggregate into indexes that closely mirror both the diffusion index based on the responses to the categorical survey and aggregate manufacturing output, as measured by the manufacturing component of industrial production. We further investigate the relationship between the average sentiment expressed by purchasing managers and manufacturing output econometrically. Ourbaselineforecastingmodelaskswhethersentimentcanhelpforecastmanufacturing output and includes—among other controls—some of the ISM diffusion indexes, so the test 3

is whether the sentiment indexes have additional information beyond the ISM categorical responses data. We find that most dictionary-based text variables do not help predict manufacturing output, with the exception of a curated financial stability-specific dictionary. On the other hand, sentiment variables from the deep learning models are predictive of future manufacturing output. Out-of-sample forecasting exercises show that the financial stability dictionary and deep learning techniques significantly reduce the mean squared forecast errors as well. Overall, our results suggest that purchasing managers’ survey responses contain useful forward-looking information, and that sentiment-based measures can improve the accuracy of forecasts of manufacturing output. The exercises described above rely on a manually-labeled sample of the data, both to assess the accuracy of different methods and to help fine-tune some of the deep-learning based methods. However, the panel microdata allow for a different approach. Since firms are in the survey for multiple months, we can link the text (and other) data from a given month to next month’s firm-level production data. Fitting a model to these data lets us forecast firm-level production using firm-level lagged information. This methodology has two advantages. First, it gives us a much larger training sample size as compared to the manually labeled data. Second, it aligns the training data objective very precisely with the aggregateforecastingobjective. Onthissecondpoint, wedoourbestwhenmanuallylabeling data to discern whether the comment is indicative of rising or falling industrial production. But there are plenty of ambiguous cases, so there are some clear advantages to letting the data speak, and seeing what text is actually associated with future (firm level) changes in production. We find that fine-tuning in this way is competitive with using the manual labels, and in some cases preferable. Finally, we make progress on the explainability of deep learning models. These models are notoriously opaque, a consequence of their very high parameter count and extremely nonlinear architecture. This can make it difficult to trust the outputs of such models, as it is not initially clear if the seemingly good predictions are based on solid foundations. We use a standard machine learning interpretability method—Shapley decompositions— to score the contribution of each individual word in each comment. Our results point to a sensible interpretation of our deep learning models. First, the score for each word is roughly constant over time: words do not dramatically change their average connotation (though the underlying deep learning model allows for this). Second, there are fat tails to the scores: most words have scores very close to zero (neutral), with a relatively small number of words having extreme sentiment. For example, the most positive words include 4

“brisk”, “excellent”, “booming”, “improve”, and “efficient”; among the most negative words are “unstable”, “insufficient”, “fragile”, “inconsistent”, and “questionable”. The close-toneutral words contribute very little to aggregate sentiment, even after accounting for the fact that they occur very frequently. Finally, we find that changes in our aggregated sentiment index are largely accounted for by changes in the frequency of the words with the most extreme (positive or negative) sentiment scores, with the vast majority of words playing little role. Thus, while it may be difficult to manually construct a domain-specific dictionary from scratch, it is possible to extract a fairly simple, interpretable dictionary from the deep learning model. Our paper contributes to two strands of literature. First, our comparison of NLP techniques for measuring sentiment adds to the growing body of literature incorporating NLP into economic and financial research. Since the seminal work of Tetlock (2007), many studies have used dictionary-based methods (Baker et al., 2016; Hassan et al., 2019; Young et al., 2021; Cowhey et al., 2022), and refined lexicons for specific contexts have been shown to improve performance in measurement and forecasting (Correa et al., 2021; Gardner et al., 2022; Sharpe et al., 2023). Machine learning techniques have also been used to select word lists (Manela and Moreira, 2017; Soto, 2021). More recent papers incorporate more sophisticated machine learning methods to extract the tense and topic of texts (Angelico et al., 2022; Hanley and Hoberg, 2019; Hansen et al., 2018; Kalamara et al., 2022). Advances in NLP, particularly the use of deep learning techniques, have significantly improved sentiment classification (Heston and Sinha, 2017; Araci, 2019; Huang et al., 2022; Bybee, 2023; Jha et al., 2024). Second, we contribute to the literature on forecasting industrial production (D’Agostino and Schnatz, 2012; Lahiri and Monokroussos, 2013; Ardia et al., 2019; Cimadomo et al., 2022; Andreou et al., 2017). Our analysis of the relationship between sentiment and industrial production provides new insights into the role of unstructured text data in economic forecasting (Marcucci, 2024). By comparing various NLP techniques, we are able to identify which methods are most effective for classifying sentiment and incorporating them into predictive models of industrial production. The paper most similar to ours is Shapiro et al. (2022), who find that domain specific dictionaries can improve predictions of human rated sentiment. We find broadly similar results using a financial stability (rather than a general purpose) dictionary to measure sentiment, but move one step further by providing a robust comparison to large language models. Our paper differs from theirs in two important ways. First, we focus on creating 5

a sentiment index from firm-level data, rather than beginning the analysis at an aggregate macroeconomic level. Instead of measuring consumer sentiment through newspaper articles, we measure manufacturing sentiment from a panel of survey responses. Our unique microleveldataallowustounderstandthevalueoftextbeyondcategoricalresponsesandnaturally occurring labels. Second, Shapiro et al. (2022) compares lexicon-based sentiment approaches only to baseline BERT, which at the time was the most developed transfer-learning based model. We also consider newer deep learning models based on BERT, particularly those finetuned on domain specific and naturally occurring data. We apply interpretability techniques to these ‘black box’ models and show that aggregate sentiment indexes derived from deep learning hinge on the frequencies of relatively few words. The remainder of the paper is structured as follows. Section 2 presents our data. Section 3 reviews how we measure sentiment from the textual survey data and Section 4 overviews the resulting indexes. Section 5 presents the empirical strategy and findings, and Section 6 evaluatesthemechanismsthroughwhichfirmsurveyresponsespredictindustrialproduction. Section 7 concludes. 2 Data The primary data for this study comes from the Institute for Supply Management (ISM). Each month, ISM conducts a survey of purchasing managers from a sample of manufacturing firms in the United States.2 Diffusion indexes based on the responses (described below) are published very rapidly, and are closely watched by markets. As highlighted in Bok et al. (2018), not only does such survey data provide important signal about the state of the economy, but the ISM data in particular provides the “earliest available information for the national economy on any given quarter”. In addition, the ISM data have a long time series, which is conducive to time-series modeling.3 The timeliness and relevance of the data motivates our exploration of the free-response text. TheISMsurveyincludesaseriesofquestionsabouttherespondents’operations,including their production levels, new orders, backlog, employment, supplier delivery times, input inventories, exports, and imports. These questions have a categorical response, where the purchasing managers specify whether these metrics have increased, decreased, or stayed the same between last month and the current month. The categorical responses are aggregated 2ISM also surveys non-manufacturing firms and hospitals separately. 3ISM series extend back to 1948, but most statistical analyses use data that starts in 1972. 6

into publicly-released diffusion indexes, discussed more below. In addition to the categorical response, purchasing managers can provide further explanation in accompanying text boxes. There are free response questions accompanying nearly every categorical question, asking for the reason for the response. In addition there is a “General Remarks” field at the beginning, where the respondent can put any general remarks they wish. Ten to twelve of these text responses are featured in the ISM’s data release to provide context for the diffusion indexes, but otherwise are not released publicly. The ISM manufacturing survey dates back to the 1930s. The dataset we analyze covers firm-month observations from November 2001 to January 2020. Most recently, the sample covers roughly 350 responses per month. The dark-shaded area of Figure 1 shows the percentage of firms in the sample with text responses over time. The figure illustrates that the majority of respondents provide text in addition to their quantitative survey answers. The black line in Figure 1 presents the average word count over the sample period. The word counts range from 10 to 33 words on average per month. The mean word count appears to fluctuate over the business cycle and jumps dramatically in 2018. The sudden increase in word count in 2018 is mostly due to heightened tensions surrounding trade policy at the time. Indeed, after removing responses that contain the word “tariff,” we observe a smoother increase in word counts (see Figure A1 in the appendix for further details). Table 1 provides a summary of the text responses. Nearly 49 percent of the general remarks sections contain text, while the next most common sections containing text are those related to employment, production, and new orders. The last row shows statistics for all the text fields concatenated together: 69 percent of firm-month observations have any text at all, and the text is about 17 words long onaverage. The average word count is highest for the General Remarks section, with an average of 8 words used in these responses. When considering only those responses that contain text, the average word count for the General Remarks section increases to 16 words. Turning from ISM’s survey microdata, we use several time series in our forecasting exercises. Our focus is on forecasting the manufacturing industrial production (IP) index. We userealtimedataontherighthandside, reflectingwhatpolicymakersknewatthetime, and forecast the fully revised series. In addition to IP series, we use the ISM diffusion indexes as regressors. Thediffusionindexesareaggregationsofthe categorical responsequestionsinthe survey. For example, the production diffusion index is a weighted average of the responses to the production question (paraphrasing, “Is production higher/the same/lower than last month?”), with the “Higher” responses getting weight 100, “Same” responses getting weight 7

50, and “Lower” responses getting weight 0. The formula for the diffusion index in period t, with N total firms responding is shown in equation (1): t N 1 (cid:88) D = [100·1{Response i is “Higher”}+50·1{Response i is “Same”}] (1) t N t i=1 These diffusion indexes have values between 0 and 100, with 0 indicating that all respondents say things are worse and 100 indicating that all respondents say things are better.4 ISM publishes indexes for each question, as well as a “PMI Composite”, which is an equallyweighted average of the diffusion indexes for new orders, production, employment, supplier deliveries, and inventories. 3 Measuring Sentiment Our goal is to extract useful information from the ISM survey text responses. We focus on sentiment analysis: measuring the extent to which the purchasing managers response is positive or negative. Even focusing on sentiment analysis, the wide range of NLP techniques available can make it challenging to choose an appropriate method. In this section we discuss the methods we use, leaving a complete description of the approaches to the Appendix. 3.1 Dictionaries One of the simplest methods for measuring sentiment is dictionary-based analysis, which involves counting the frequency of a predetermined list of sentiment words in the text. We use common sentiment dictionaries such as the Harvard (Tetlock, 2007) and AFINN (Nielsen, 2011) word lists. However, we also recognize that certain words that may be considered negative in other contexts may not be considered negative in the context of finance, such as “taxing” or “liability”. As such, we also apply finance-specific word lists, including the sentiment word list from Loughran and McDonald (2011) (henceforth, “LM”) and the financial stability word list from Correa et al. (2021). For all dictionaries, we score comments on a scale of -1 to +1, using the percent of total words in the comment that are positive less the percent of total words that are negative. When we require discrete 4Theresponsesare“better”,“same”,or“worse”forthenewordersquestion,production,andnewexport orders. For employment, inventories, prices, and imports the responses are “higher”, “same”, and “lower”. For backlogs the choices are “greater”, “same”, and “less”. 8

classifications, as in Figure 2, we classify the comment as positive if the score is greater than zero, negative if it is less than zero, and neutral if it equals zero. 3.2 Deep Learning Models Another approach to sentiment analysis involves fitting a model to the data. We try several variations on this theme. Unlike the dictionary methods, all of these approaches require labeled data: a sample of observations that have already been classified, which is used to fit the model and classify the remaining observations. We create a labeled dataset from a randomly selected subsample of 1,000 responses with text from the individual questions.5 Each response was classified for sentiment by two economists using the following question as a guide: “Is this comment consistent with manufacturing IP rising month over month?” The classifications were either positive, neutral, or negative, where “neutral” includes cases where is it is impossible to determine the sentiment. Both economists agreed on the sentiment classification for roughly 700 cases. This subsample is further split into a “training” dataset, used to fit the models, and “test” dataset, used to assess the relative merits of the models.6 Deep learning models have gained popularity in recent years, driven by their impressive performance on language-related tasks. Much of the progress has occurred within a particular class of deep learning models called transformers (see, e.g., Devlin et al., 2018, Radford et al., 2018, Chung et al., 2022, Ouyang et al., 2022, and Touvron et al., 2023). The defining feature of transformers—relative to other neural network architectures—is a mechanism called attention; a way to interact words within a sentence, allowing the context of a particular word to influence the meaning. A full explanation of transformers and the attention mechanism is beyond the scope of this paper, but we do provide a brief summary in the Appendix. The important points are that (unlike dictionaries and bag-of-words approaches) transformers take into account interactions between words, word order, and context-dependent meanings (polysemy). One notable transformer model is “BERT”, or Bidirectional Encoder Representations from Transformers, developed by Devlin et al. (2018). It is important to note that BERT is a pre-trained model: Devlin et al. (2018) specified the architecture and then trained the model on a corpus including the entirety of (English) Wikipedia and a number of books. 5Note, that the categorical responses can be considered a kind of label for the corresponding text. In Section 4.1 we investigate how well models can predict the categorical response from the associated text. 6The test data consists of observations from 2018m1 to 2020m1 and is not used by any of the models during training. 9

The model is large by the standards of the economics literature, with roughly 110 million parameters. We use several versions of BERT in this paper. Bydefault,theoff-the-shelfBERTmodelproducessentenceembeddings: Givenasentencelengthpieceoftext,itreturnsa768-dimensionalvectorrepresentingthesentence. Intuitively, sentences with similar meaning ought to have embedding vectors close to each other. BERT can be used as a classifier by adding an additional layer on top of it, essentially a logistic regression that takes the embedding vector as the input and returns class probabilities. Note that this requires some labeled data to fit the logit. BERT saves researchers the cost of training a large language model, while still allowing them to adapt the model for their specific needs, a practice known as “fine-tuning”. In the financial domain, specialized BERT models have been developed to account for the unique characteristics of financial and economic text. Two prominent examples are Huang et al. (2022) (which we refer to as FinBERTv1) and Araci (2019) (which we refer to as FinBERTv2). FinBERTv1 uses the BERT architecture but is trained from scratch on SEC filings, equity reports, and earnings conference call transcripts. The sentiment classification layeristrainedonthehumanlabeledAnalystTonedataset(Huangetal.,2014).7 FinBERTv2 was initialized with the pretrained BERT weights and further pre-trained on a corpus of Reuters news articles, which tend to focus on financial news. The sentiment classification layer was trained on the human-labeled Financial PhraseBank dataset from Malo et al. (2014).8 While FinBERTv1 and FinBERTv2 can do a good job parsing financial news and regulatory filings, our data are more focused on topics like order backlogs, production difficulties, inventories, and delivery times, which are not commonly found in financial corpora. After reviewing the text responses from the ISM survey, we found examples suggesting that Fin- BERTv1andFinBERTv2havesomedifficultywiththelanguage. Forexample, thecomment “slight up-tick inventory to account for slight up-tick in production” is coded as positive by the economists: it implies increased production, and an increase in input inventories to support that higher level of production. But this passage is classified as neutral by FinBERTv1 and negative by FinBERTv2. These issues motivate our use of the human-labeled dataset to fine-tune or train from scratch our own models. First, we estimate our own transformer model using the training dataset and a relatively small number of parameters. We call this 7Specifically, the model is yiyanghkust/finbert-tone from the Huggingface model hub, a classification fine-tuned version of “FinBERT-FinVocab uncased” in Huang et al. (2022). 8This model is ProsusAI/finbert on the Huggingface model hub. 10

model, TF-Small (TF for “transformer”).9 Second, we fine-tune BERT with our manually labeled training examples, and call the resulting model Fine-Tuned BERT: Human Labeled Data. This model benefits from the large size and extensive training of the base BERT model, but is explicitly tuned on the language relevant for our task. As we shall see below, this results in good performance. An alternative to fine-tuning on manually-labeled data is to capitalize on the panel structure of the firm-level responses, specifically with regards to the text and the future reporting of the categorical variable measuring production. We estimate a model that uses firm f’s text in month t to predict the value of firm f(cid:48)s production in month t + 1 (i.e. text f,t predicting if the firm reported PROD as higher/lower/same). This strategy provides f,t+1 two benefits as compared to the previous approach. First, we obtain a larger dataset for fine-tuning, without having to manually label observations. Second, this approach directly aligns with our ultimate forecasting exercise, where we want to use the text at time t to predict aggregate manufacturing production. We label this BERT-based model fine-tuned on the production categorical responses as Fine-Tuned BERT: Production Data.10 Overall, we propose nine models for sentiment classification. The four dictionary-based methods are the Harvard, AFINN, Loughran and McDonald (2011), and financial stability (Correa et al., 2021) dictionaries, and the five transformer models are FinBERTv1, FinBERTv2, TF-Small, Fine-Tuned BERT: Human Labeled Data, and Fine-Tuned BERT: Production Data. 4 ISM Text-Derived Sentiment Indexes Before evaluating the marginal value of text for forecasting aggregate series, we document the accuracy of each sentiment model on the microdata and compare our preferred sentiment indexes to the aggregate ISM composite purchasing manager index (PMI) and to US manufacturing production. 9We use the Keras library to build a simple encoder-only transformer model with input embedding dimension of 12 and an output sentiment layer with similar dimensions. 10The training data for Fine-Tuned BERT: Production Data includes only firm-level responses with text and for firms that appear for at least two consecutive years in the sample. The target variable for the fine-tuning is the production categorical response in month t+1. 11

4.1 Comment-Level Classification Results Figure 2 and Table 2 present accuracy information for each model as compared the test human-labeled dataset (from 3.2).11 The confusion matrices in Figure 2 tabulate the percent of observations with a given human “true” classification (which varies across rows) and the model-based predicted classification (which varies across columns) for each model. Overall accuracy is reported on the top of each matrix. Table 2 presents overall accuracy rates but also provides the ratio of neutral class predictions to the true number of neutral comments. We begin by considering whether the categorical response for each comment is predictive of the human-rated sentiment. For example, if the human label for a new orders textual response is positive, we would like to know how often was the categorical response that new orders are higher than last month. We find an overall accuracy of 85.6 percent (upper left block of Figure 2), suggesting that the sentiment in the text responses—as measured by the manual label—is highly correlated with the categorical response, but not fully redundant. That is, there appears to be content in both the textual responses and the categorical response that might provide information related to economic activity. The Harvard, AFINN, Loughran-McDonald, and Stability dictionaries all have accuracy scores below 30 percent (as seen in Figure 2 and Table 2.) The low accuracy is due to the fact that they predict over half of responses to be neutral, while in reality only a couple percent of responses are neutral. The last column of Table 2 shows that the dictionaries predict thirty to fifty times more neutrals than actually appear in the data. Dictionary-based methods can only produce a positive or negative classification if either positive or negative words appear in the text, and the short comments in our data often do not contain any of the words in the dictionaries. FinBERTv1 and FinBERTv2 perform better, with accuracies of 70.3 percent and 56.8 percent, respectively. Both of these models are better able to classify actual neutral responses, but both tend to over-predict neutral classifications (though less so than the dictionaries). The best performing model is Fine-Tuned BERT: Human Labeled Data, with an accuracy score of 82.9 percent. The improved performance is largely due to having seen examples of manufacturing-specific text, as well as survey-specific examples of positive, negative, and neutral responses. The TF-Small model has an accuracy score slightly lower at 67.6 percent. 11While the test dataset contains 141 observations, we report predictions for only the 111 observations for which a categorical response is provided. Recall that the General Remarks response is not associated with a categorical question. This ensures the evaluation sample for the categorical response is similar to the evaluation sample for the sentiment models. 12

In Figure 2, we see that Fine-Tuned BERT: Production Data has the lowest performance, with an accuracy of 4.5 percent on the test dataset. This result can be attributed to a difference in the training and test samples. Nearly 55 percent of observations in the data used to train Fine-Tuned BERT: Production Data have (next month’s) production flat, often with no associated text. However, our test data is derived from observations with text and include each type of question (production, new orders, etc.) as its own observation; as a result only 5 percent of the observations have the outcome of “flat”/neutral. This small share of neutral observations is also reflected in the human labels, where neutral labels are similarly rare. As a consequence, the model disproportionately labels text as “flat”/neutral in this test sample, even though it is well calibrated on the training data. 4.2 Sentiment Indexes and Aggregates We next run the nine sentiment classifiers on all available observations, and average the sentiment scores by month. Here the scores are, in the case of the dictionaries, the fraction of positive words minus the fraction of negative words. For the transformer-based models, the scores are the predicted probability of the text being positive less the probability of the text being negative. Across all models the firm-month level scores are between -1 and 1, these are averaged by month to get aggregate sentiment. We seasonally adjust the series using X-13 on the default settings. The seasonally adjusted time series will feed into the forecasting models in Section 5.12 Table 3 collects the summary statistics for the monthly series. The dictionary-based monthly averages tend to have a mean close to zero and a small standard deviation, a result of the infrequent usage of words appearing in the dictionaries. In contrast, the transformers models have larger (in absolute value) means and standard deviations, largely due to the predicted probabilities that appear closer to the extremes of 1 and -1. Table 4 shows the correlation matrix between our main variable of interest, the growth rate of manufacturing industrial production (IP Growth) as measured by the Federal Reserve Board’sIndustrialProductionstatistics,13 andoursentimentmeasures. IPGrowthcorrelates nearly 30 percent with all dictionary based sentiment measures and TF-Small, while the other deep learning based sentiment indexes exhibit stronger correlations, above 0.40. The highest correlation with IP Growth is Fine-Tuned BERT: Production Data. This might seem 12Note that the comment-level sentiment scores all range between -1 and +1. 13These data are released monthly in the Federal Reserve Board’s G.17 statistical release on industrial production and capacity utilization, available at https://www.federalreserve.gov/releases/g17/. 13

puzzling given the poor performance on human labeled sentiment in Figure 2. However, our aggregate monthly sentiment measure is calculated as the difference between the percentage of positive responses less the percentage of negative responses. While this model favors neutral predictions for any given text, it labels text as positive or negative in such a way thatthenetsentiment(percentagepositiveminuspercentnegative)correlateshighlywiththe best performing model for matching human sentiment, Fine-Tuned BERT: Human Labeled, as indicated by the correlation of 0.94 in Table 4. We focus on both Fine-Tuned BERT models,sinceonescoreshighlyonthehumanlabeledbenchmark(Fine-TunedBERT:Human Labeled) and the other correlates highly with IP Growth, our main variable of interest (Fine- Tuned BERT: Production Data). Figure 3 presents a plot of the two fine-tuned BERT models and the ISM PMI aggregate. Note that the ISM PMI is on a different y-axis different axis than the net sentiment indexes. It is apparent that the sentiment indexes capture much of the dynamics of the ISM PMI. Recall that the PMI is a composite of the categorical responses, while the sentiment indexes include no direct information from the categorical responses. It is interesting that indicators based on text alone can recreate the broad dynamics of the PMI. This is reinforced by high correlations: 0.76 for the human labeled BERT model and 0.86 for the production labeled BERT model. On the other hand, it is perhaps not surprising the the series comove, as the textual responses are a supplement to the categorical answers. The ISM manufacturing survey is by definition a report on the manufacturing sector by purchasing managers. So, it would make sense to present the text-derived sentiment measures in a figure with manufacturing industrial production. Figure 4 presents the two fine-tuned BERT models alongside the growth rate of manufacturing industrial production. Manufacturing production at the monthly frequency is quite volatile. Despite this volatility, the two fine-tuned BERT models exhibit a meaningful correlations, at 0.42 and 0.48 for human- and production-tuned BERT models respectively. We now turn to the more formal task of predicting activity with the text-derived sentiment measures. 5 Empirical Results Our forecasting exercises focus on predicting monthly manufacturing output growth. The real time data flow is important to understand, and is as follows: • The ISM data for a reference month t are typically released on the first business day of month t+1. 14

• The first IP data for reference month t are typically released around the 15th of month t+1. • The IP estimates for a reference month t are revised over the subsequent months and years, as more product data become available and benchmark revisions are incorporated. The monthly revisions are part of the subsequent month’s IP releases, so the first monthly revision to IP for reference month t is released around the 15th of month t+2, the second revision occurs around the 15th of t+3, etc. Our baseline forecasting model is as follows: ∆IPcurrent = α+β ∆IPt∗ +β ∆IPt∗ +β ∆IPt∗ +δxt∗ +(cid:15) (2) t 1 t−1 2 t−2 3 t−3 t t where∆IPcurrent isthefullyrevised,current-vintagegrowthrateofmanufacturingoutput t in month t. The superscript t∗ denotes a variable as reported on the eve of the month t G.17 IP data release: the real-time vintage relevant for forecasting ∆IP just prior to its first t print. Thus ∆IPt∗ is the estimate of month t−1 from the initial month t−1 data release t−1 (releasedaroundthemiddleofmontht), and∆IPt∗ (∆IPt∗ )isthe(twice)revised estimate t−2 t−3 of month t−2 (t−3) from the month t−1 data release (again, released around the middle of month t). The vector xt∗ collects the ISM metrics for month t. These are available well t before the month t IP data, and so may be particularly useful for forecasting. In the baseline model, x contains only the composite PMI index, an average of five of the ISM diffusion t indexes.14 Table 5 presents shows in-sample results for the baseline model as well as version that add sentiment indexes. In column (1), we see that the baseline model has an R-squared of 0.219 with a positive and statistically significant relationship between PMI and IP growth. The strong relationship between PMI and IP growth illustrates the importance of the ISM categorical data as a leading indicator for production. The subsequent columns show that the aggregate sentiment indexes based on the LM, Harvard, and AFINN dictionaries are not statistically significant, and only lead to small improvements in R-squared. The only dictionary-based index that leads to a positive and significant effect on IP is the Stability dictionary, shown in column (5). The relatively good performance of the Stability dictionary is likely due to the fact that it includes several words related to the business cycle whose 14In unreported regressions, we forecast future industrial production, ∆IP , after IP has published for t+1 montht. Thespecificationis: ∆IPcurrent =α+β ∆IPt∗+β ∆IPt∗ +β ∆IPt∗ +δxt∗+(cid:15) . Theresults t+1 1 t 2 t−1 3 t−2 t t+1 hold, with slightly less significance for the transformer based models in the out-of-sample exercise. 15

frequency would coincide with declines in the manufacturing sector, such as “contagion”, “recession”, and “spillover” for negative words and “healthy”, “improve”, and “resilient” for positive words. Moving to columns 6-10, all five transformer-based sentiment measures are positively and significantly related to manufacturing growth. The largest gain in R-squared is seen in column (10) with Fine-Tuned BERT: Production Data. This model’s fine-tuning task is to predict future firm-level (ISM-derived) production based on firm-level text, so it is reassuring that aggregate sentiment from the model predicts aggregate output.15 The measure significantly improves on our baseline model, with a nearly 3 percentage point increase in R-squared. We also observe that the PMI index loses some of its economic and statistical significance when we include Fine-Tuned BERT: Production Data. This can be attributed to the fact that this sentiment index targets the future firm-level production in the training, overlapping with the PMI measure that aggregates firm-level production into the form of a diffusion index. Next, we assess the out-of-sample performance of the sentiment indexes. We use two setups: thefirsttreatsthedates2001m11-2017m12asin-sample,andtheyears2018m1-2020m1 as out-of-sample. The second focuses on the Global Financial Crisis (GFC), incorporating 2001m11-2007m11 as in-sample, and the NBER dated recession as the out-of-sample period: 2007m12-2009m6. In both cases, we respect the out-of-sample dates both when fitting the forecastingregressionsandwhentrainingourupstreamdeeplearningmodels. Inotherwords, for these exercises we are only using labeled observations from the the in-sample dates (i.e. manually labeled sentiment observations from 2001m11-2017m12 to predict 2018m1-2020m1 industrial production; 2001m11-2007m11 to predict 2007m12-2009m6). Table 6 shows the results from the expanding window exercise of forecasting IP growth over the period 2018m1-2020m1, using Diebold-Mariano tests to compare the forecast of the baseline model with text-augmented models. Each cell displays the out-of-sample RMSE and DM test statistics. In the top row—for our preferred specification—we see that the LM, Harvard and AFINN dictionary-based text measures reduce the RMSE slightly though statistically insignificantly. The Stability dictionary reduces the RSME by about 9 percent. Similarly,thetransformer-basedsentimentmeasuresreducetheout-of-sampleforecasterrors, withFinBERTv1andbothFine-TunedBERTmodelsstatisticallysignificantatthe5percent level. The other rows in the table show alternative specifications: only including the PMI index as a control, only using lagged manufacturing growth as a control, replacing the PMI 15Note that the ISM data is not an input to IP, so the datasets are independent. 16

compositewithneworders, andincludingseveralISMdiffusionindexesascontrols. Innearly all cases, the Stability dictionary and transformer-based models significantly reduce the outof-sample RMSE. In the strictest specification including three revised lags of IP growth and several ISM measures, we observe reductions in the out-of-sample RMSE of nearly 2 percent for FinBERTv1 and Fine-Tuned BERT: Human labeled Data, significant at the 1 percent level. The largest gain in forecasting is achieved by Fine-Tuned BERT: Production Data, with a reduction in RMSE of nearly 8 percent. We extend our forecasting analysis by considering the period leading into a recession. Specifically, we rerun the out-of-sample exercise with 2001m11 to 2007m11 as the in-sample period, and using 2007m12 to 2009m7 as the out-of-sample period. Importantly, we ensure that TF-Small and the Fine-Tuned BERT models are trained only using data from 2001m11 to 2007m11. Table 7 shows the results. For dictionary-based methods, the LM and Harvard variables are slightly significant in reducing the RMSE in a few specifications. However, given that the strictest specification including lags and other ISM variables does not lead to any significant reductions in RMSE, we conclude that the dictionary-based methods do not help with forecasting during the GFC. On the other hand, we find that TF-Small and both Fine-Tuned BERT models economically and statistically improve out-of-sample forecast errors during the GFC, with a RMSE reduction in the range of 11-17 percent for the strictest specification with three lags and several ISM variables. The forecast errors for the two Fin- BERT models are not statistically or economically different to the baseline model. Overall, we find that sentiment variables generated from the transformer models, particularly those trained on hand-labeled and naturally occurring data, are best for improving forecasting performance during the GFC. 6 Interpretation The results in Section 5 suggest that the sentiment indexes, and fine-tuned versions of BERT in particular, provide additional forecasting power. However, BERT is very much a black box, and it is far from obvious what drives its behavior. Machine learning models can easily make predictions based on irrelevant or unintuitive data features, an outcome we want to avoid (or at least understand). In this section we provide supporting evidence to help interpret the BERT results. We draw on the active field of research in interpretable machine learning, where many methods have been proposed to deal with these issues. We will use one such method—Shapley decompositions—to interpret our results. 17

Our goal is to boil down the BERT-based predictions into simple lists of relevant words with associated scores. If we can accomplish this, we will have a dictionary that is both interpretable and approximates the BERT-based models. There are several obstacles. First, it is not immediately clear how to calculate the marginal contribution of a given word to the net positive score of a comment. Second, the marginal contribution of a word can vary across comments and across time: BERT allows for context to influence the meaning of words. Finally, it is ex ante unclear whether aggregate changes in the sentiment index reflect changes in the use of many words, or a relatively small, interpretable group. To address the first challenge we rely on Shapley decompositions (discussed below) to calculate the marginal contributions of words to comment-level scores. On the second point, we find that most words’ contributions do not vary much over time, so we can treat them as approximately constant. Finally, we also find that changes in aggregate sentiment are mostly attributable to changes in the volume of extreme (positive or negative) sentiment words. Taken together, these facts suggest that the BERT-based indexes can indeed be approximated by a dictionary-based approach. 6.1 Shapley Decompositions Shapley decompositions are used in machine learning to deal with the nonlinear relationships between the dependent variable and independent variables (Lundberg and Lee, 2017), drawing on cooperative game theory results from Shapley (1953). Given (1) an observation, and (2) the prediction of the model, the Shapley decomposition estimates the contribution of each feature to the prediction. Each contribution is relative to a “null value” for the feature; for numeric data the null value might be the mean of the feature in sample, we will discuss the null value in text data below. Roughly speaking, the Shapley decomposition calculates the marginal contribution of switching a given feature from its null value to the observed value, averaging across all possible null/observed permutations for the other features. The averaging across permutations ensures that the resulting contributions have good properties, including additivity: The contributions to the prediction add up to the prediction exactly. In our context, an observation is a single ISM comment, and the features are the individual words. BERT provides three predictions for each observation: the probability of being in the negative, neutral, and positive classes. Rather than deal with this vector, we calculate the net positive score: Pr[positive class]−Pr[negative class], and use this as the prediction. The net positive score is analogous to the diffusion index formula, and reduces the model 18

output to a single number between -1 and +1. To understand how the Shapley decomposition operates in our context, consider the example comment “Business continues to be slow”. Fine-tuned BERT predicts this comment is positive with probability 0.078, with a net positive probability of -0.76. The Shapley decomposition replaces subsets of the words with a special token, [MASK].16 BERT interprets [MASK] as meaning that there is a real, unknown word in that place in the comment. BERT continues to make predictions for the class of the comment even when words are masked; these predictions are based on the remaining unmasked words and the positions of the words in the comment. The marginal contribution of the word “slow” can be calculated as the difference between the net positive probability of “Business continues to be slow” and “Business continues to be [MASK]”. However another plausible estimate of the marginal contribution would be, e.g., the difference between “[MASK] continues to be slow” and “[MASK] continues to be [MASK]”. The Shapley decomposition iterates over the various masking permutations to arrive at an average marginal contribution.17 It is worth noting here that the Shapley decomposition is not a structural explanation, nor does it imply any causal relationship. It is an accounting identity that can be imposed on any model. For our purposes, it is useful for linearizing the the relationship between tokens and the aggregate sentiment index. After running the Shapley decomposition on all the comments, we obtain‘ Shapley scores for each token in each comment. The Shapley scores for the tokens in a given comment add up to the net positive probability for that comment. The contribution of a token can vary across comments, because BERT’s predictions are not a linear function of the tokens. This is part of the advantage of BERT: tokens may have different meanings depending on the context. However, in order to get a handle on interpretability we average that variation away and work with time-invariant word-level Shapley scores. We can also examine the distribution of words across scores: Figure 5 plots the density of words across Shapley scores. The density is winsorized at the top and bottom 5 percent of the distribution to make the central mass visible. The weighted density, in black, shows the distribution weighted by the number of occurrences in the corpus. The vermilion unweighted density counts each unique token in the vocabulary equally. Note that many tokens have 16In NLP, “tokens” are the basic unit of observation, they can be words or word parts. 17In practice, calculating every permutation requires 2N model evaluations for a sequence with N tokens, which can become very costly even for short comments. The SHAP package for Python circumvents this issue by sampling. 19

scores close to zero, particularly in the weighted plot. As the Shapley scores can range between positive and negative 1, it might be puzzling why so much mass is concentrated on the (-0.01,0.01) interval. Part of the reason is simply the length of the comments: if comments are on average 16 words long, a random word will—on average—only contribute 1/16th to the comment’s score (which is bounded on (-1,1)). In addition, many of the tokens are filler words or word parts, e.g., the token “the” has a Shapley score of 0.003. 6.2 Approximate Sentiment Index Next, we make two approximations. First, we replace the Shapley scores for each word in each comment with that word’s average Shapley score across all time. This amounts to imposing that each word has a time-invariant sentiment score. Second, motivated by the aforementioned histograms, we focus only on the tails of the distribution. In particular we keep only the top and bottom 5 percent of words, on the theory that they contain the most information about sentiment. This restriction reduces the number of words in the vocabulary to about 1,000 (from more than 10,000). We recalculated an approximate sentiment index by adding up the (time averaged) wordlevel Shapley scores for the top and bottom 5 percent of words, then dividing by the number of comments. Figure 6 shows the results. The approximate sentiment index is closely aligned with the standard index, although it does not fall by as much during the Great Recession. In general, it appears that the much simpler approximate sentiment index captures the important features of the original. This is useful, given that the approximate sentiment index is essentially a dictionary based method: it is constructed by adding up the (timeaveraged) word-level sentiment scores of each word. Table 8 shows the words with the most positive and negative sentiment scores. The words in each group appear quite reasonable, whichreassuresusthatBERTisindeedpickinguponmeaningfulsemanticsinthecomments. 7 Conclusion In this paper, we examine the relationship between manufacturing sentiment and industrial production growth, an important indicator for macroeconomic forecasting. To evaluate the effectiveness of the sentiment measures, we compare dictionary-based and deep learning methods to human labeled sentiment scores. Our results show that context-specific dictionary-based methods and deep learning techniques perform best in mimicking human sentimentclassificationsforindividualcomments. Notably, indexesbasedontheaveragesen- 20

timent measures of free-form textual responses closely mirror the ISM diffusion index and manufacturing production. In addition, when estimating out-of-sample industrial production growth, we find that sentiment measures based on financial-stability focused words and fine-tuned deep learning models significantly improve forecasting accuracy. Our preferred deep learning model, one based on naturally occurring labels, is consistent with a view that most words have nearly-neutral sentiment, with aggregate sentiment and changes in aggregate sentiment hinging on the frequencies of relatively few words with extreme sentiment scores. Our comparison of different sentiment measures can assist future researchers in choosing the most appropriate methodology for text analysis. Our findings suggest that deep learning techniques benefit from both manual and naturally occurring labels, and that contextspecificdictionariesoutperformgeneralpurposedictionariesinout-of-sampleexercises. With the advent of large language models, such as ChatGPT, we hope future research can test whether fine-tuning or curating dictionaries are needed with generative artificial intelligence. Furthermore, the improvements to industrial production forecasts we find using survey responses suggest that other macroeconomic variables may also benefit from the inclusion of unstructured and non-traditional data such as text. 21

References Andreou, Elena, Patrick Gagliardini, Eric Ghysels, and Mirco Rubin,“IsIndustrial Production Still the Dominant Factor for the US Economy?,” CEPR Discussion Papers 12219 August 2017. Angelico, Cristina, Juri Marcucci, Marcello Miccoli, and Filippo Quarta, “Can We Measure Inflation Expectations Using Twitter?,” Journal of Econometrics, 2022, 228 (2), 259–277. Araci, Dogu, “FinBERT: Financial Sentiment Analysis with Pre-trained Language Models,” arXiv preprint arXiv:1908.10063, 2019. Ardia, David, Keven Bluteau, and Kris Boudt, “Questioning the News About Economic Growth: Sparse Forecasting Using Thousands of News-Based Sentiment Values,” International Journal of Forecasting, 2019, 35 (4), 1370–1386. Baker, Scott R., Nicholas Bloom, and Steven J. Davis, “Measuring Economic Policy Uncertainty,” The Quarterly Journal of Economics, 2016, 131 (4), 1593–1636. Bok, Brandyn, Daniele Caratelli, Domenico Giannone, Argia M. Sbordone, and Andrea Tambalotti, “Macroeconomic Nowcasting and Forecasting with Big Data,” Annual Review of Economics, 2018, 10 (1), 615–643. Bybee, J Leland, “The Ghost in the Machine: Generating Beliefs with Large Language Models,” arXiv preprint arXiv:2305.02823, 2023. Chung, Hyung Won, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei, “Scaling Instruction-Finetuned Language Models,” arXiv preprint arXiv:2210.11416, 2022. Cimadomo, Jacopo, Domenico Giannone, Michele Lenza, Francesca Monti, and Andrej Sokol, “Nowcasting with Large Bayesian Vector Autoregressions,” Journal of Econometrics, 2022, 231 (2), 500–519. 22

Correa, Ricardo, Keshav Garud, Juan M. Londono, and Nathan Mislang, “Sentiment in Central Banks’ Financial Stability Reports,” Review of Finance, 2021, 25 (1), 85–120. Cowhey, Maureen, Seung Jung Lee, Thomas Popeck Spiller, and Cindy M. Vojtech, “Sentiment in Bank Examination Reports and Bank Outcomes,” FEDS Working Paper, Board of Governors of the Federal Reserve System 2022. D’Agostino, Antonello and Bernd Schnatz, “Survey-based nowcasting of US growth: a real-time forecast comparison over more than 40 years,” Working Paper Series 1455, European Central Bank August 2012. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv preprint arXiv:1810.04805, 2018. Gardner, Ben, Chiara Scotti, and Clara Vega, “Words speak as loudly as actions: Central bank communication and the response of equity prices to macroeconomic announcements,” Journal of Econometrics, 2022, 231 (2), 387–409. Gentzkow, Matthew, Bryan Kelly, and Matt Taddy, “Text as Data,” Journal of Economic Literature, 2019, 57 (3), 535–74. Hanley, Kathleen Weiss and Gerard Hoberg, “Dynamic Interpretation of Emerging Risks in the Financial Sector,” The Review of Financial Studies, 2019, 32 (12), 4543–4603. Hansen, Stephen, Michael McMahon, and Andrea Prat, “Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach,” The Quarterly Journal of Economics, 2018, 133 (2), 801–870. Hassan, Tarek A., Stephan Hollander, Laurence Van Lent, and Ahmed Tahoun, “Firm-Level Political Risk: Measurement and Effects,” The Quarterly Journal of Economics, 2019, 134 (4), 2135–2202. Heston, Steven L. and Nitish Ranjan Sinha, “News vs. Sentiment: Predicting Stock Returns from News Stories,” Financial Analysts Journal, 2017, 73 (3), 67–83. Huang, Allen H, Amy Y Zang, and Rong Zheng, “Evidence on the Information Content of Text in Analyst Reports,” The Accounting Review, 2014, 89 (6), 2151–2180. 23

Huang, Allen H., Hui Wang, and Yi Yang, “FinBERT: A Large Language Model for Extracting Information from Financial Text,” Contemporary Accounting Research, 2022, 40 (2), 806–841. Jha, Manish, Jialin Qian, Michael Weber, and Baozhong Yang, “ChatGPT and Corporate Policies,” Technical Report, National Bureau of Economic Research 2024. Kalamara, Eleni, Arthur Turrell, Chris Redl, George Kapetanios, and Sujit Kapadia, “Making Text Count: Economic Forecasting Using Newspaper Text,” Journal of Applied Econometrics, 2022, 37 (5), 896–919. Keynes, John Maynard, “The General Theory of Employment,” The Quarterly Journal of Economics, 1937, 51 (2), 209–223. Lahiri, Kajal and George Monokroussos, “Nowcasting US GDP: The Role of ISM Business Surveys,” International Journal of Forecasting, 2013, 29 (4), 644–658. Loughran, Tim and Bill McDonald, “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks,” The Journal of Finance, 2011, 66 (1), 35–65. Lundberg, Scott M. and Su-In Lee, “A Unified Approach to Interpreting Model Predictions,” arXiv preprint arXiv:1705.07874, 2017. Malo, Pekka, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala, “Good Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts,” Journal of the Association for Information Science and Technology, 2014, 65 (4), 782–796. Manela, Asaf and Alan Moreira, “News Implied Volatility and Disaster Concerns,” Journal of Financial Economics, 2017, 123 (1), 137–162. Marcucci, Juri, “Macroeconomic Forecasting with Text-Based Data,” Working paper, Bank of Italy 2024. Nielsen, Finn ˚Arup, “A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs,” arXiv preprint arXiv:1103.2903, 2011. Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe, 24

“Training Language Models to Follow Instructions with Human Feedback,” arXiv preprint arXiv:2203.02155, 2022. Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, “Language Models are Unsupervised Multitask Learners,” Technical Report 2018. Shapiro, Adam Hale, Moritz Sudhof, and Daniel J. Wilson, “Measuring News Sentiment,” Journal of Econometrics, 2022, 228 (2), 221–243. Shapley, L. S., “AValueforn-PersonGames,” inHaroldWilliamKuhnandAlbertWilliam Tucker, eds., Contributions to the Theory of Games (AM-28), Volume II, Princeton: Princeton University Press, 1953, pp. 307–318. Sharpe, Steven A., Nitish R. Sinha, and Christopher A. Hollrah, “The Power of Narrative Sentiment in Economic Forecasts,” International Journal of Forecasting, 2023, 39 (3), 1097–1121. Soto, Paul E., “Breaking the Word Bank: Measurement and Effects of Bank Level Uncertainty,” Journal of Financial Services Research, 2021, 59 (1), 1–45. Tetlock, Paul C., “Giving Content to Investor Sentiment: The Role of Media in the Stock Market,” The Journal of Finance, 2007, 62 (3), 1139–1168. Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample, “LLaMA: Open and Efficient Foundation Language Models,” arXiv preprint arXiv:2302.13971, 2023. Young, Henry L., Anderson Monken, Flora Haberkorn, and Eva Van Leemput, “Effects of Supply Chain Bottlenecks on Prices using Textual Analysis,” FEDS Notes, Board of Governors of the Federal Reserve System 2021. 25

Tables Table 1: Survey Summary Statistics (1) (2) (3) Mean Word Count Field Fraction W/ Text Mean Word Count Cond. on Text General Remarks 0.49 7.86 15.94 Production 0.27 1.42 5.27 New Orders 0.27 1.45 5.41 Backlog 0.19 1.13 6.05 Employment 0.52 1.29 6.21 Supplier Speed 0.12 0.89 7.23 Input Inventories 0.23 1.51 6.50 Exports 0.10 0.59 5.88 Imports 0.12 0.79 6.43 All Text (Appended) 0.69 16.92 24.42 Notes: Summary statistics derived from the ISM survey. Column (1) reports the fraction of firmmonth observations containing any text. Column (2) shows the mean word count across all firmmonthobservations,whilecolumn(3)showsthemeanwordcountofonlythoseresponsescontaining any text. Each row corresponds to one of the various question types on the ISM survey. Table 2: Sentiment Classification Accuracy Model Accuracy Predicted neutrals, relative to actual AFINN 27.93 37.00 LM 20.72 43.50 Harvard 24.32 37.50 Stability 11.71 50.00 FinBERT (v1) 70.27 13.00 FinBERT (v2) 56.76 17.50 TF-Small 67.57 1.50 Fine-Tuned BERT: Human Labeled Data 82.88 0.00 Fine-Tuned BERT: Production Data 4.50 54.00 Notes: Accuracy and other statistics for sentiment classification models. All evaluations are done on the hold-outdata. “Accuracy”isthepercentofobservationscorrectlyclassifiedbythemodel. Thethirdcolumn shows the ratio of neutral class predictions to the true number of neutral comments. 26

Table 3: Summary Statistics Notes: Summary statistics for the variables used in the in-sample and out-of-sample analysis. LM, Harvard, AFINN, and Stability measure the average net sentiment when applying dictionary word counts of the Loughran and McDonald (2011), Harvard, AFINN, and Stability (Correa et al., 2021) word lists, respectively. FinBERT (v1) and FinBERT (v2) measure the average net sentiment of applying the FinBERT model from Huang et al. (2022) and Araci (2019), respectively. TF-Small and Fine-Tuned BERT: Human Labeled Data are sentiment scores derived from a fine-tuned transformer and a fine-tuned BERT model using a sample of human-labeled ISM responses. Fine-Tuned BERT: Production Data is a fine-tuned BERT model using panel data of the firm-level responses to predict the categorical variable for production in t+1 using the text in month t. 27

Table 4: Correlation Matrix Notes: Correlation matrix between manufacturing industrial production growth (IP Growth) and our nine sentiment measures. LM, Harvard, AFINN, and Stability measure the average net sentiment when applying dictionary word counts of the Loughran and McDonald (2011), Harvard, AFINN, and Stability (Correa et al., 2021) word lists, respectively. FinBERT (v1) and FinBERT (v2) measure the average net sentiment of applying the FinBERT model from Huang et al. (2022) and Araci (2019), respectively. TF-Small and Fine-Tuned BERT: Human Labeled Data are sentiment scores derived from a fine-tuned transformer and a fine-tuned BERT model using a sample of human-labeled ISM responses. Fine-Tuned BERT: Production Data is a fine-tuned BERT model using panel data of the firm-level responses to predict the categorical variable for production in t+1 using the text in month t. 28

Table 5: In-sample Regression Results Notes: This table reports in-sample regressions of the month-to-month percentage change of industrial production on a set of real-time predictors of IP from 2001m11-2020m1. ISM Sentiment is a text measure of the survey response sentiment using either dictionary-based methods (columns 2-5), transfer learning of financial BERT models (columns 6-7), or fine-tuned models trained on a random selection of human-labeled ISM survey responses (columns 8-9) or naturally occurring labels predicting future production (column 10). We standard normalize the sentiment measures in the regressions so that the coefficient can be interpreted as the increase in the change of industrial production in response to a one standard deviation increase in sentiment. ISM PMI is the monthly diffusion index of PMI released by the ISM at the beginning of the month. IP Growth is the estimate of month t−1 from the initial month t−1 data release, and t−1 IP Growth ( IP Growth ) is the (twice) revised estimate of month t−2 (t−3) from the month t−1 data release. t−2 t−3 Significance levels are indicated by *** (1 percent), ** (5 percent), and * (10 percent). 29

Table 6: Out-of-sample Regression Results (2018m1-2020m1) Notes: This table reports out-of-sample mean squared errors of regressions of month-to-month percentage change of industrialproductiononasetofreal-timepredictorsofIPfrom2018m1-2020m1. Thetextmeasuresrepresentthesurveyresponse sentiment using either dictionary-based methods (columns 2-5), transfer learning of financial BERT models (columns 6-7), or fine-tuned models trained on a random selection of human-labeled ISM survey responses (columns 8-9) or naturally occurring labels predicting future production (column 10). PMI, New Orders, and Inventories are monthly diffusion indexes released by the ISM at the beginning of the month. The 3 lags are the initial estimate of IP Growth for month t-1, the revised estimate of IP Growth for month t-2, and the twice-revised estimate of IP rowth for month t-3. The P-values are G calculated using the Diebold-Mariano out-of-sample error statistics. Significance levels are indicated by *** (1 percent), ** (5 percent), and * (10 percent). 30

Table 7: Out-of-sample Regression Results (Global Financial Crisis) Notes: This table reports out-of-sample mean squared errors of regressions of month-to-month percentage change of industrialproductiononasetofreal-timepredictorsofIPfrom2007m12-2009m6. Thetextmeasuresrepresentthesurveyresponse sentiment using either dictionary-based methods (columns 2-5), transfer learning of financial BERT models (columns 6-7), or fine-tuned models trained on a random selection of human-labeled ISM survey responses (columns 8-9) or naturally occurring labels predicting future production (column 10). PMI, New Orders, and Inventories are monthly diffusion indexes released by the ISM at the beginning of the month. The 3 lags are the initial estimate of IP Growth for month t-1, the revised estimate of IP Growth for month t-2, and the twice-revised estimate of IP rowth for month t-3. The P-values are G calculated using the Diebold-Mariano out-of-sample error statistics. Significance levels are indicated by *** (1 percent), ** (5 percent), and * (10 percent). 31

Table 8: Average Net Positive Scores Positive Words Score Negative Words Score specials 0.055 weak -0.063 improved 0.053 inability -0.064 excellent 0.051 fragile -0.064 booming 0.049 decline -0.066 upbeat 0.048 downward -0.066 improves 0.048 declining -0.068 improvement 0.047 downs -0.069 improve 0.046 weakening -0.070 increase 0.045 depressed -0.071 good 0.044 weaken -0.072 rum 0.043 discontinued -0.073 launch 0.041 slow -0.075 brisk 0.040 offs -0.075 increased 0.040 insufficient -0.076 increasing 0.036 instability -0.080 heightened 0.033 slowing -0.081 upgrade 0.033 slug -0.084 advantages 0.033 erosion -0.085 lift 0.032 errors -0.093 doubled 0.032 unstable -0.105 Notes: Words are those with the most positive and most negative scores, among words appearing more than 5 times in the data. The “score” is the net positive probability from the Shapley decomposition: The average marginalcontributionofthewordtowardapositiveclassification, minus the average marginal contribution towards a negative classification. 32

Figures Figure 1: ISM Survey Responses 1.0 0.8 0.6 0.4 0.2 0.0 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 smriF fo tnecreP Empty Responses (left) Non-Empty Responses (left) 30 Number of Words (right) 25 20 15 10 esnopseR rep sdroW naeM Notes: This figure shows the percentage of firms and the word counts for the ISM survey responses. [Left Axis] The light (dark) grey region shows the percent of firms that provided empty (non-empty) responses on their monthly response. [Right Axis] The black line shows the mean number of words per response across all respondents for a given month. 33

Figure 2: Confusion Matrices and Accuracy Scores Negative Neutral Positive Predicted Values seulaV lautcA evitageN lartueN evitisoP Categorical Responses Accuracy: 85.6% 33 3 3 29.73% 2.70% 2.70% 0 0 2 0.00% 0.00% 1.80% 5 3 62 4.50% 2.70% 55.86% Negative Neutral Positive Predicted Values seulaV lautcA evitageN lartueN evitisoP AFINN Dictionary Accuracy: 27.9% 11 25 3 9.91% 22.52% 2.70% 0 2 0 0.00% 1.80% 0.00% 5 47 18 4.50% 42.34% 16.22% Negative Neutral Positive Predicted Values seulaV lautcA evitageN lartueN evitisoP LM Dictionary Accuracy: 20.7% 17 22 0 15.32% 19.82% 0.00% 0 2 0 0.00% 1.80% 0.00% 3 63 4 2.70% 56.76% 3.60% Negative Neutral Positive Predicted Values seulaV lautcA evitageN lartueN evitisoP Harvard Dictionary Accuracy: 24.3% 10 25 4 9.01% 22.52% 3.60% 0 2 0 0.00% 1.80% 0.00% 7 48 15 6.31% 43.24% 13.51% Negative Neutral Positive Predicted Values seulaV lautcA evitageN lartueN evitisoP Stability Dictionary Accuracy: 11.7% 8 31 0 7.21% 27.93% 0.00% 0 2 0 0.00% 1.80% 0.00% 0 67 3 0.00% 60.36% 2.70% Negative Neutral Positive Predicted Values seulaV lautcA evitageN lartueN evitisoP FinBERT v1 Accuracy: 70.3% 34 4 1 30.63% 3.60% 0.90% 0 1 1 0.00% 0.90% 0.90% 6 21 43 5.41% 18.92% 38.74% Negative Neutral Positive Predicted Values seulaV lautcA evitageN lartueN evitisoP FinBERT v2 Accuracy: 56.8% 27 9 3 24.32% 8.11% 2.70% 0 2 0 0.00% 1.80% 0.00% 12 24 34 10.81% 21.62% 30.63% Negative Neutral Positive Predicted Values seulaV lautcA evitageN lartueN evitisoP TF-Small Accuracy: 67.6% 26 1 12 23.42% 0.90% 10.81% 0 0 2 0.00% 0.00% 1.80% 19 2 49 17.12% 1.80% 44.14% Negative Neutral Positive Predicted Values seulaV lautcA evitageN lartueN evitisoP Fine-Tuned BERT: Human Labeled Data Accuracy: 82.9% 33 0 6 29.73% 0.00% 5.41% 1 0 1 0.90% 0.00% 0.90% 11 0 59 9.91% 0.00% 53.15% Negative Neutral Positive Predicted Values seulaV lautcA evitageN lartueN evitisoP 0.6 0.5 0.4 0.3 Fine-Tuned BERT: Production Data Accuracy: 4.5% 0 39 0 0.00% 35.14% 0.00% 0.2 0 2 0 0.00% 1.80% 0.00% 0.1 0 67 3 0.00% 60.36% 2.70% 0.0 Notes: This figure shows the confusion matrix for nine manufacturing sentiment measures applied to the training dataset of manually labeled ISM survey responses. The rows of each matrix refer to the actual values, while the columns refer to the predicted values. Values along the diagonal are correctly classified, while values on the off-diagonals are incorrect. The shaded color refers to the percentage of responses within a given cell, according to the heatmap legend on the right. 34

Figure 3: ISM PMI and Sentiment Indexes 65 60 55 50 45 40 35 2002 2004 2006 2008 2010 2012 2014 2016 2018 IMP 4 2 0 2 4 Net Sentiment (Standard Normalized) PMI FT BERT: Human Labeled (Corr: 0.76) FT BERT: Production Data (Corr: 0.84) Notes: This figure shows two manufacturing sentiment measures alongside the ISM PMI (red). Fine-Tuned BERT: Human Labeled Data and Fine-Tuned BERT: Production Data are fine-tuned BERT models trained on human-labeled sentiment and future firm-level production data, respectively. Correlations between the two sentiment measures and the ISM PMI are provided in parentheses. Figure 4: Industrial Production and Sentiment Indexes 4 3 2 1 0 1 2 3 4 2002 2004 2006 2008 2010 2012 2014 2016 2018 htworG PI 4 2 0 2 4 Net Sentiment (Standard Normalized) IP Growth FT BERT: Human Labeled (Corr: 0.42) FT BERT: Production Data (Corr: 0.48) Notes: This figure shows two manufacturing sentiment measures alongside IP Growth (red). Fine-Tuned BERT: Human Labeled Data and Fine-Tuned BERT: Production Data are finetuned BERT models trained on human-labeled sentiment and future firm-level production data, respectively. Correlations to IP Growth are provided in parentheses. 35

Figure 5: Token PDFs .3 .3 .2 .2 .1 .1 0 0 FDP Weighted by count Unweighted −.05 0 .05 Average Shapley Score Note: DistributionoftokensacrossShapleyscores. Forthisgraph, ShapleyscoresareWinsorizedatthetop and bottom 5 percent of the distribution. “Unweighted” gives the distribution of unique tokens by score, “Weighted” gives the distribution weighted by number of appearances in the data. 36

Figure 6: Approximate Sentiment Index Approximate Sentiment Index Sentiment .1 .1 .05 .05 0 0 −.05 −.05 −.1 −.1 −.15 Sentiment Index −.15 Constant word−level sentiment, top and bottom 5% only 2004 2008 2012 2016 2020 Notes: Note: Sentiment index (black line) is based on BERT, as fine-tuned used future firm-level production data. The approximate index (red dashed line) is calculated by (1) estimating Shapley decompositions to obtain each word’s contribution to the score of each comment, (2) averaging those scores over comments to get time-invariant word scores, and (3) only keeping the top and bottom 5 percent of words. 37

Appendix Figure A1: What Explains the 2018 Increase in Text Responses? 1.0 0.8 0.6 0.4 0.2 0.0 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 smriF fo tnecreP Empty Responses Non-Empty Responses w/o Tariff 30 Non-Empty Responses w/ Tariff 25 20 15 10 esnopseR rep sdroW naeM # of Words (responses w/o Tariff) # of Words (All responses) Notes: This figure shows the percentage of firms and the word counts for the ISM survey responses. [Left Axis] The light (dark and red) grey region shows the percent of firms that provided empty (non-empty) responses on their monthly response. The red region highlights the number of firms that included the word ”tariff” in their response. [Right Axis] The solid (dotted) black line shows the mean number of words per response across all respondents (excluding responses using the word ”tariff”) for a given month. 38

1 Methods 1.1 Dictionary Based Methods A bag of words dictionary method is a mapping of the form f : RV → R where xd ∈ RV is a V-dimensional vector and V represents the size of the set of the unique tokens across a corpus, S. The elements of xd, e.g. xd , represent the number of occurrences of the word w wi i in document d. To implement a dictionary method, we select a subset of the unique words across the corpus, D ⊂ S. Then, the function f is simply the sum of the elements in xd, i.e. f(xd) = (cid:80) t x , where t represents the weight given to word w . Typically, for sentiment wi∈D wi wi wi i analysis, there are three weights: +1 for positive words, 0 for neutral words, and -1 for negative words inside of D. 1.2 BERT Models This subsection describes the basics of BERT, one of the most popular transformer-based models. It is difficult to explain transformer-based models briefly, in part because they are fundamentally complex. Existing descriptions of these models are either very terse, assume extensive knowledge of deep learning terminology and history, or are vague. Our goal is to provide a reasonably succinct overview of the architecture, accessible to someone not specialized in deep learning. These models are called “transformers” because the input is transformed into a representation in a latent space.18 This aspect of the architecture is not particularly unique; the main distinguishing feature of transformers is attention, a mechanism that allows the interpretation of words in a sentence to be influenced by the other words in the sentence. Transformers gained popularity in part because they showed excellent performance on a wide variety of language tasks and, relatedly, the design allows for extreme parallelism. Section 1.2.1 describes in detail the mechanics of what happens when text is fed to a BERT model used for sentiment classification (or more broadly, any type of classification). Section 1.2.2 goes over how BERT models are trained. Section 1.2.3 discusses how the 18In the original transformer paper the application was machine translation. The input, in one language, was transformed (“endcoded”) into an abstract representation and this was then “decoded” back into the secondnaturallanguage. TheBERTarchitectureonlyincludestheencodingstep,andclassificationorother tasks use the abstract representation as an input. GPT-like models are considered decoder-only models, which seek to generate the next word in a sequence using a representation of the sequence so far. 39

BERT model is further trained and specialized (“fine-tuned”) to perform specific tasks or use additional data. 1.2.1 BERT at Inference BERT at inference can generally be defined in five steps. First, the input text is partitioned into its atomic unit, e.g. a word, in a process known as tokenization. Each token is represented in an abstract vector space that captures the syntactic and semantic meaning of the token. Second, the word order is taken into account using positional embeddings. Third, the model adjusts the attention it should place to other words in the sequence through the defining characteristic of transformers, known as the attention mechanism. Fourth, a normalization step concatenates the attention with the input embeddings. Lastly, the new representation of the input sequence is used for sentiment classification. We guide the reader through these five steps below. Step 1: Creating the Input Embeddings A transformer-based sentiment model can be defined as a mapping of a fixed number of tokens, L, such that f : RVxL → R, where the input x is a V ×L matrix. T A token is a word, a part of a word or a single character. V is the size of the vocabulary, the tokens that are valid inputs.19 The columns of x are dummy vectors of size V, with element i equal to 1 if the word in the i − th position is equal to w , and zero otherwise. i Many pre-trained BERT models fix L, the sequence length, at 512 tokens. If a sequence contains less than 512 tokens, then the remaining sequences are “padded”, in other words replaced with a special “end of sequence” token that will mask any parameters associated with those positions. If a sequence has more than 512 tokens, only the first 512 would be used. Transformers, like most NLP methods, represent words as vectors, called embeddings. In large-scale, general versions of BERT, such as the base version released by Meta,20 the word (token) is represented as a 768-dimensional vector. The high dimensionality should help capture the fact that words’ meanings have many dimensions, so two words can be similar in many ways but still distinct along important dimensions. 19BERT has a vocabulary of 30,522 tokens. These tokens include most common words, and “token” is sometimes used interchangeably with “word”. But, importantly, the vocabulary also includes many word parts, such as common word endings, and all single characters. Thus BERT can process any text, since unfamiliar words can be built up from word fragments and single characters. 20https://github.com/google-research/bert/blob/master/README.md 40

At inference time the embeddings are fixed. The first step of f is to convert the VxL T input into a NxL matrix, where each token indicator column (of length V, the size of the vocabulary) is converted into a length N word embedding vector. Define the NxL matrix as x(cid:48). Step 2: Adjust to Generate Positional Embeddings Transformer models do not inherently account for the order of the inputs anywhere in their architecture, a characteristic that is critical for understanding the meaning of text. Adding an index number of the input token (e.g. 1 for the first token, 2 for the second token, etc.) would create two difficulties. First, this method leads to unbounded positional adjustments. Second, the model may not be able to generalize for sequence lengths that are rarely seen, especially longer sequences. The model could see plenty of first word adjustments, second wordadjustments, etc. butlargervalueswouldbecomerarer. Thetypicalsolutiontoaccount for positions is to use sine and cosine functions. For input token x(cid:48), an N-dimension vector k - pk- is generated. For 0 ≤ i < N/2, : k pk = sin( ) 2i 100002i N (3) k pk = cos( ) 2i+1 100002i N where pk is the i-th index of pk and N is the dimension size of the target embedding. i We adjust the column vectors of x(cid:48) for their position by adding p to x(cid:48). Call the adjusted matrix y ∈ RNxL Step 3: Attention Mechanism Next we enter the transformer block. This is a mapping f : RNxL → RNxL. Note that the output and input are the exact same size. This step of the transformer model is arguably the most important as the final representation of the word vectors captures well the meaning of the text. We begin by creating a set of key, value, and query matrices. This step mimics a look-up table in a database table. They are defined as follows: 41

K(y ) = W y i k i V(y ) = W y (4) i v i Q(y ) = W y i q i where y ∈ RN (a column vector from the input y) and W ,W ,W ∈ RMxN. Ultimately, i k v q the resulting vectors K(y ),V(y ),Q(y ) ∈ RM are transformations of the input vector, y . i i i i This can be thought of as projecting the individual vector y into into an abstract Mi dimensional space. Using the entire L-length input sequence, an LxL matrix is then created:   α ... α 11 1L α =   . . . ... . . .   (5)   α ... α L1 LL where α = softmax (Q(yi√ )·K(yj)) (i.e. the rows of the α matrix sum to 1). Essentially, i,j j M α measures how similar the query is (a transformation of the i-th word in consideration) i,j to the other keys (a transformation of the other words in the sequence). Each vector is then weighted depending on the attention of that word with the other words in the sequence. L (cid:88) u(cid:48) = W α V(y ) (6) i 0 i,j j j=1 where W ∈ RNxM and u(cid:48) ∈ RN . This assumes there is just one head. However, we 0 i can have multiple heads such that equations (4), (5), and (6) are repeated with H different sets of parameters. For example, for head h, the W matrices in (4) will be different: W ,W , and W . This will lead to α in (5), and (6) will become: k,h v,h q,h h H L (cid:88) (cid:88) u(cid:48) = W α(h)V(h)(y ) (7) i 0,h i,j i h=1 j=1 with W ∈ RNxM. 0,h The last step of the attention mechanism is to add back the resulting matrix u(cid:48) from (7) i 42

back to the input vector y , and then pass the resulting vector through a layer normalization i function, which is analogous to a standard normalization procedure but slightly adjusted with a different scaling and shifting parameter.21 u = LayerNorm(y +u(cid:48)) (8) i i i Step 4: Feed Forward and Normalize Next the resulting vector, u , is passed to a ReLU network, then added to itself, and finally i normalized once more: z(cid:48) = W ReLU(W u ) (9) i 2 1 i z = LayerNorm(u +z(cid:48)) (10) i i i where W ∈ RPxN and W ∈ RNxP. The final vector z ∈ RN is the transformed input 1 2 i vector y that accounts for the position of the i-th word and the attention the word emits i and receives from other words in the sequence. Step 5: Sentiment Classification The last step entails a mapping f : RNxL → R. Typically, this is a neural network that takes as input a matrix and outputs a probability distribution across 3 categories: positive, negative, and neutral. 1.2.2 Training As with most deep learning models, BERT is estimated using stochastic gradient descent. Model weights are adjusted using a learning rate, λ, such that w = w −λδL, where L is i+1 i δwi the loss function. If λ is too large, updates may exceed w and the optima may be missed. i Setting λ too small may lead to smaller adjustments and more time needed for convergence. Toacceleratetheprocessandimprovetheefficiencyoffindingoptimumweights, anextension of gradient descent, known as Adaptive Moment Estimation (or the ADAM optimizer), is typically used. 21The layer normalization function has two hyperparameters, γ and β, and is defined as follows: LayerNorm(x;γ,β)=γ∗ x−µ +β σ 43

For financial text sentiment classification, two popular BERT models have been pretrained on large corpi of data and are publically available: Huang et al. (2022) (which we refer to as FinBERTv1) and Araci (2019) (which we refer to as FinBERTv2). FinBERTv1 was trained on nearly 10,000 sentences from SEC filings, equity reports, and earnings conference call transcripts that were hand labeled for sentiment. FinBERTv2 was trained on nearly 5,000 randomly selected sentences from financial news articles, and nearly 1,000 financial news tweets, all of which were manually labeled for sentiment. 1.2.3 Fine-Tuning We fine-tune BERT in three different ways. The first two use human-labeled responses from a sample of the ISM survey, while the last exploits the panel structure of the survey to fine-tune a model with naturally occurring data. TF-Small and Fine-Tuned BERT: Human Labeled Data We first create a dataset of responses that were hand-labeled for sentiment. We format the ISM survey responses at the firm-month-question level and randomly select 1,000 text responses. Each response was classified for sentiment by two economists using the following question as a guide: “Is this comment consistent with manufacturing IP rising month over month?” The classifications were either positive, neutral, or negative. We keep only 700 responses for which both economists agreed on the sentiment. Then, we split our sample such that 90 percent is used for fine-tuning, and 10 percent is leftover for an unseen test set for sentiment model comparisons (i.e. is never used for the training). We use this human-rated training data to train two types of models. For the first model, we train a plain vanilla transformer model from scratch using a simple architecture (with only one head and embedding dimensions of size 12-16). We call this model simply the TF-Small model(TFfor“transformer”). Thesecondmodelusesthepre-trained,off-the-shelf BERT as the baseline transformer, but we fine-tune the last layer using our human-labeled dataset. We call this model Fine-Tuned BERT: Human Labeled Data. Note that for the TF- Small model, we estimate the entire attention mechanism weights, whereas for Fine-Tuned BERT, we are further tuning the attention weights that were pre-trained on a large dataset. 44

Fine-Tuned BERT: Production Data (Naturally Occurring Data) The third fine-tuned model capitalizes on the panel structure of the ISM survey. Since same firms appear in the survey over time, we use pre-trained BERT to estimate a model that predicts the production categorical response (PROD ) from the previous months f,t+1 text (text ). The availability of the target variable, PROD , is beneficial for us in two f,t f,t+1 ways. First, rather than manually labelling hundreds or thousands of responses, we obtain naturally occuring datathatisseveralordersofmagnitudelargerthanourhuman-labeled dataset, at nearly zero-effort and instantaneous availability. Second, the task of predicting the production categorical response one month ahead perfectly aligns with our downstream task of forecasting aggregate industrial production. 45

Cite this document

APA

Tomaz Cajner, Leland D. Crane, Christopher Kurz, Norman Morin, Paul E. Soto, & Betsy Vrankovich (2024). Manufacturing Sentiment: Forecasting Industrial Production with Text Analysis (FEDS 2024-026). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2024-026

BibTeX

@techreport{wtfs_feds_2024_026,
  author = {Tomaz Cajner and Leland D. Crane and Christopher Kurz and Norman Morin and Paul E. Soto and Betsy Vrankovich},
  title = {Manufacturing Sentiment: Forecasting Industrial Production with Text Analysis},
  type = {Finance and Economics Discussion Series},
  number = {2024-026},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2024},
  url = {https://whenthefedspeaks.com/doc/feds_2024-026},
  abstract = {This paper examines the link between industrial production and the sentiment expressed in natural language survey responses from U.S. manufacturing firms. We compare several natural language processing (NLP) techniques for classifying sentiment, ranging from dictionary-based methods to modern deep learning methods. Using a manually labeled sample as ground truth, we find that deep learning models--partially trained on a human-labeled sample of our data--outperform other methods for classifying the sentiment of survey responses. Further, we capitalize on the panel nature of the data to train models which predict firm-level production using lagged firm-level text. This allows us to leverage a large sample of "naturally occurring" labels with no manual input. We then assess the extent to which each sentiment measure, aggregated to monthly time series, can serve as a useful statistical indicator and forecast industrial production. Our results suggest that the text responses provide information beyond the available numerical data from the same survey and improve out-of-sample forecasting; deep learning methods and the use of naturally occurring labels seem especially useful for forecasting. We also explore what drives the predictions made by the deep learning models, and find that a relatively small number of words--associated with very positive/negative sentiment--account for much of the variation in the aggregate sentiment index.},
}