feds · May 28, 2026

Skill and Efficiency in the U.S. Mutual Fund Industry

Abstract

We propose a new measure of mutual fund manager ability: "efficiency" is the ability to accrue the risk premium associated with a risk factor. The familiar abnormal return, or alpha, is shown to be the sum of two distinct measures of ability: "aggregate efficiency" which is the beta-weighted sum of the fund's (in)efficiencies across risk factors, and "skill," the component that is unrelated to factor exposures. Using a panel of U.S. equity mutual fund returns from 1999-2023, we document significant heterogeneity in mutual fund manager skill and efficiency. We employ regression trees and their extensions to capture this heterogeneity. We find that efficiency is more persistent than skill, and we show that future abnormal returns can be better predicted by decomposing lagged abnormal returns into skill and efficiency.

Finance and Economics Discussion Series Federal Reserve Board, Washington, D.C. ISSN 1936-2854 (Print) ISSN 2767-3898 (Online) Skill and Efficiency in the U.S. Mutual Fund Industry Dong Hwan Oh and Andrew J. Patton 2026-032 Please cite this paper as: Oh, Dong Hwan, and Andrew J. Patton (2026). “Skill and Efficiency in the U.S. Mutual Fund Industry,” Finance and Economics Discussion Series 2026-032. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2026.032. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

∗ Skill and Efficiency in the U.S. Mutual Fund Industry † ‡ Dong Hwan Oh and Andrew J. Patton March 25, 2026 Abstract We propose a new measure of mutual fund manager ability: “efficiency” is the ability to accrue the risk premium associated with a risk factor. The familiar abnormal return, or alpha, is shown to be the sum of two distinct measures of ability: “aggregate efficiency” which is the beta-weighted sum of the fund’s (in)efficiencies across risk factors, and “skill,” the component that is unrelated to factor exposures. UsingapanelofU.S.equitymutualfundreturnsfrom1999-2023,wedocumentsignificant heterogeneity in mutual fund manager skill and efficiency. We employ regression trees and their extensions to capture this heterogeneity. We find that efficiency is more persistent than skill, and we show that future abnormal returns can be better predicted by decomposing lagged abnormal returns into skill and efficiency. Keywords: mutual fund performance, trading costs, machine learning, heterogeneity. J.E.L. Classification: G11, G12, G23, C58 ∗WethankTimBollerslev,FrankDiebold,RobStambaugh,andseminarparticipantsatDukeUniversity and the University of Pennsylvania for helpful comments. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. The second author also thanks the School of Banking and Finance at UNSW Sydney, where part of the work on this paper was completed, for their hospitality. †Federal Reserve Board. Email: donghwan.oh@frb.gov ‡Department of Economics, Duke University and School of Economics, Singapore Management University. Email: andrew.patton@duke.edu

1 Introduction U.S. mutual funds collectively manage more than $21 trillion in assets1 and play an essential role in channeling household savings into equities, fixed income securities, and other asset classes. Given this scale, understanding whether and how mutual funds add value is a long-standing question in financial economics. Since Jensen (1968), academics have debated whether mutual fund managers possess genuine skill, or whether observed fund performances reflect exposure to systematic risk factors (Fama and French, 1993; Carhart, 1997) or luck (Barras et al., 2010; Fama and French, 2010; Harvey and Liu, 2022). Parallel to studies of manager skill, a growing literature (e.g., Keim and Madhavan, 1997; Wermers, 2000; Novy-Marx and Velikov, 2016) studies mutual funds’ implementation costs, which detract from their abnormal returns. For example, Patton and Weller (2020) use Fama-MacBeth (1973) regressions to obtain an estimate of the “all-in” costs of implementing factor strategies, and show that, on average, mutual funds fail to fully capture the risk premia implied by factor models. The revolution in how fund managers are evaluated, from raw returns to Sharpe (1964) orTreynor(1966)ratios, tothenow-ubiquitous“risk-adjustedreturns”(oralpha)ofJensen (1968) has immensely benefited investors, helping them to distinguish between risk premia and skill. But, given implementation costs, capacity constraints, and similar impediments faced by fund managers, what if accruing the full market risk premium on a portfolio is a non-trivial task? In such a world, one component of skill would be adding value beyond risk exposures; the traditional alpha. The other component would be accumulating the risk premium associated with those risk exposures; the component that enters the “riskadjustment” term in alpha. This paper argues that both of these are a type of ability that a fund manager may possess, and shows that these abilities have distinct properties in the cross section and time series of U.S. mutual fund returns. 1Source: Federal Reserve Bank of St. Louis, series BOGZ1LM654090000Q. 1

We propose a decomposition of the familiar abnormal return, or alpha, into two components: “efficiency,” which measures how well the manager is able to accrue the risk premia associated with a given risk factor, and “skill,” the part of the fund’s abnormal return that is unrelated to its exposures.2 In conventional performance evaluation, full efficiency is implicitly assumed and alpha is attributed entirely to skill. However, in the presence of implementation costs, accruing risk premia is itself a form of ability, and by measuring it we obtain important new insights into manager heterogeneity and performance persistence. Using data on all U.S. equity mutual funds from 1999 to 2023, and the Fama-French (1992)-Carhart (1997) four-factor model,3 we document four main results. First, we show that there is statistically significant heterogeneity in both skill and efficiency across funds. Larger funds, for example, tend to be less efficient in harvesting premia while funds that are well explained by conventional factors (“high R2” funds) tend to be highly efficient. These differences can be large: e.g., funds in the top quintile of R2 earn 1.3% more per year per unit exposure to the market factor than funds in the bottom quintile. Thus the heteroegeneity in alpha documented in past work (e.g., Fama and French, 2010; Banegas et al., 2013; Koijen, 2014) is also present in the components of alpha, skill and efficiency. Second, we show that skill and efficiency are negatively related in the cross-section of funds: higher skilled funds tend to be less efficient, and vice versa. For example, funds in the top quintile of skill are 2.5 times more likely to be in the bottom quintile of efficiency than in the top quintile of efficiency. This is consistent with these tasks each requiring manager attention or effort, and good performance in one task detracts from performance in the other, see e.g., Kacperczyk et al. (2016). Third, we find evidence that efficiency is more persistent than skill, with an autoregressive coefficient of 0.13 compared with 0.05. (Unsurprisingly, the persistence of alpha 2Clearly, we use “efficiency” to describe a property that is related to, but distinct from, the meanvariance efficiency property of Markowitz (1952). 3InSection5wealsoconsidertheFama-French(2015)-Carhart(1997)six-factormodelandfindsimilar results. 2

is between each of its two components, at 0.10.) This is consistent with efficiency being a more fundamental property of the manager (or manager’s investment process, liquidity management, execution costs, etc.) and with skill being more attributable to luck. Finally, we show that forecasts of future abnormal returns can be significantly improved by decomposing past abnormal returns into skill and efficiency. Connecting with evidence that mutual fund performance persistence is mostly driven by poorly-performing funds (Brown and Goetzmann, 1995; Carhart, 1997), we find further gains in forecasts of future abnormal returns by decomposing skill and efficiency into positive and negative components. We find that positive skill and negative efficiency are key for predicting future abnormal returns. We capture the heterogeneity in mutual fund efficiency using a variety of methods. Firstly, we use ex ante sorts of funds by observed characteristics like total net assets or the time series R2 from the factor model (measuring how closely the fund is tracking factors). We also sort funds using the characteristics of the stocks held by the mutual funds (market capitalization, book-to-market ratio, and momentum), following the work of Daniel and Titman (1997). Given the preponderance of possible sort variables and split choices, we also use regression trees (Breiman et al., 1984, 2017) and random forests (Breiman, 2001) to capture unknown forms of heterogeneity. Our use of regression trees and their extensions connects this paper to the literature on machine learning methods in finance. Early work in this emerging area focused on machine learning methods for prediction (e.g., see Freyberger et al., 2020; Gu et al., 2020; Bianchi et al., 2021; Aleti et al., 2025; Kaniel et al., 2023; Kelly and Xiu, 2023). In contrast, our use of these methods is focused on capturing unobserved heterogeneity through the lens of an economic model, linking this paper to the work of Bonhomme and Manresa (2015), Fan et al. (2022), Patton and Weller (2022) and Farrell et al. (2025). This paper also connects to studies that consider the appropriate set of factors to 3

include in a risk-adjustment model. For example, Berk and van Binsbergen (2015) note that the factors commonly used in academic work do not account for transaction costs, and DeMiguel et al. (2025) note that the usual factors involve both a long and a short leg, with the latter typically being outside of the opportunity set for typical mutual fund investors. Our focus is primarily on the heterogeneity in efficiency across funds for a given set of factors, rather than whether mutual funds are able to accrue the full risk premia associated with those factors. That is, the full risk premium may out of reach for all mutual funds, but perhaps some are closer to achieving it than others. For comparison, though, we also consider “long-only” versions of the Fama-French (1992)-Carhart (1997) factors, and find similar results to those obtained using the original factors. The remainder of the paper is organized as follows. Section 2 describes our data and presents evidence that mutual funds, on average, fail to accrue all of the “on-paper” risk premia for the factors to which they are exposed. Section 3 develops our decomposition of abnormal returns into skill and efficiency and presents results on the cross-sectional heterogneiety in skill and efficiency. Section 4 shows how skill and efficiency are related to each other in the cross-section and the time series, and considers the predictive power of each for future abnormal returns. Section 5 considers three extensions of our main analysis: to “long-only” versions of the factors used in the paper, to an alternative factor model (the six-factor model of Fama-French (2015)-Carhart (1997)), and to a model that allows for partial homogeneity of efficiency across mutual funds. Section 6 concludes. Additional details and analyses are presented in the appendix. 2 Skill, efficiency, and abnormal returns In this section we introduce the sample of mutual funds used in our analysis, and then show that mutual funds do not fully capture the risk premia associated with standard risk 4

factors, i.e., are “inefficient” at accuring risk premia. Finally, we show that in the presence of inefficiencies, the famililar abnormal return is comprised of two components. 2.1 Data description We use data on 4,853 U.S. domestic equity mutual funds available in CRSP, over the period 1999–2023. We follow Berk and van Binsbergen (2015), Pastor et al. (2015) and Patton and Weller (2020) for the data cleaning and filtering steps, which are described in detail in Appendix A. Summary statistics on the characteristics of these funds and their monthly returns is presented in Table 1. We see that approximately half of our full sample of funds (2,376) is available in each month of the sample, and that the average age of the funds in our sample is 14.5 years. The average (inflation-adjusted) total net assets (TNA) of mutual funds in our sample grew from $1.8B to $3.2B from the start to the end of our sample. The average TNA masks the well-known right-skewness of mutual fund TNA, as shown by the cross-sectional quantiles of TNA. For example, in December 2023 the mean TNA was well above the 75th percentile of the distribution. In Panel B of Table 1 we see that the average fund yielded 4.9% annualized average net-of-fees return, with annualized standard deviation of 15.4% and a Sharpe ratio of 0.37. Being domestic equity mutual funds, it is no surprise that they are highly correlated with the market, average correlation of 0.80, and well explained by the Fama-French (1992) Carhart (1997) four-factor model, with an average R2 of 0.78, and a median R2 of 0.89. Figure 1 plots the average number of funds in each cross section in our sample period as well as the average TNA per fund. We observe that the number of funds has grown from just over 1,000 at the start of our sample, to around 2,700 in 2018, declining to about 2,400 at the end of our sample. The average size of mutual funds in our sample (in 2023 dollars) was around $1.5B for the first decade of our sample, then rose strongly to over $3.5B in 2021, declining to around $2.6B in December 2023. 5

Table 1: Summary statistics. This table presents summary statistics on our sample of 4,853 United States domestic equity mutual funds, over the period 1999-2023. Panel A provides information on the time series of the number of active funds for each date as well as cross-sectional information on fund lifetimes and total net assets (TNA) at sample start, middle, and end (all measured in December 2023 dollars). Panel B reports distributional information on fund excess returns, such as the mean return, return volatility and the Sharpe ratio. ρ is the correlation MKT with the S&P 500 index return, and R2 is the coefficient of determination from the FF4 Fama-French (1992) model augmented with the momentum factor of Carhart (1997). Panel A: Fund characteristics Funds Age TNA 1/99 TNA 12/10 TNA 12/23 (number) (years) ($m) ($m) ($m) Mean 2375.8 14.5 1799.9 1899.2.5 3209.4 Std Dev 257.9 10.0 6862.4 7768.1 12880.9 Q25 2170.0 6.4 72.4 74.2 88.1 Q50 2378.5 11.6 246.1 287.9 397.3 Q75 2585.0 21.1 959.2 1118.2 1560.9 Panel B: Fund return characteristics Mean Ret Ret Vol Sharpe ρ R2 MKT FF4 (% pa) (% pa) (Annualized) Mean 4.89 15.43 0.37 0.80 0.78 Std Dev 7.15 7.32 0.40 0.26 0.25 Q25 2.26 10.87 0.22 0.79 0.75 Q50 5.95 15.60 0.45 0.90 0.89 Q75 8.92 18.92 0.58 0.95 0.94 To construct the market equity (ME), book-to-market ratio (BM), and momentum (MO)variablesatthemutualfundgrouplevel,webeginbyobtainingquarterlymutualfund holdingsdata(includingticker, PERMNO,thenumberofsharesheld, andthemarketvalue of holdings) from the CRSP Mutual Funds Holdings database. For each stock identified by ticker and PERMNO, we collect monthly stock prices, returns, and shares outstanding from the CRSP monthly stock file. Additionally, we retrieve the quarterly accounting data necessary for computing book value from the Compustat/CRSP merged database, using 6

Figure 1: Number and average size of mutual funds, 1999-2023. This figure shows the average total net assets (TNA) of the 4,853 funds in our sample, measured in millions of December 2023 dollars, displayed on the left y-axis. On the right y-axis we show the average number of funds in our sample for each month. 4000 3500 3000 2500 2000 1500 1000 2000 2005 2010 2015 2020 ANT egarevA Number and average size of mutual funds 2800 2600 2400 2200 2000 1800 1600 sdnuf fo rebmuN GVKEY and PERMNO as identifiers. Market equity (ME) is calculated as the product of a stock’s price and the number of shares outstanding. Momentum (MO) is defined as the cumulative return over the past twelve months, excluding the most recent month (i.e., months t−12 to t−2). If the 12-month return is unavailable, we use the 6-month return as a proxy. The book-to-market ratio (BM) is computed monthly as book equity4 divided by ME. After calculating monthly ME, BM, and MO for each stock, we aggregate these measures to the mutual fund level. Because CRSP holdings are reported quarterly, we assume that mutual funds maintain constant shareholdings between reporting dates. For each month, stock-level portfolio weights are calculated as the product of the number of shares held and the stock’s month-end price, divided by the total value of all holdings in the mutual fund, i.e., the sum of these products across all stocks in the fund. This 4Book equity is assumed to remain constant within each quarter and is defined as the sum of Total Parent Stockholders’ Equity (SEQQ) and Deferred Taxes and Investment Tax Credit (TXDITCQ), minus Preferred/Preference Stock (PSTKQ). 7

methodology follows established practices in Daniel et al. (1997), Ferson and Wang (2021), among others. Using these weights, mutual fund-level ME, BM, and MO are calculated as the monthly weighted averages of the corresponding stock-level characteristics. Finally, we aggregate to the fund group level by computing TNA-weighted averages of each fund’s characteristics, where TNA denotes the total net assets of the component mutual funds. For our main analyses we adopt the popular four-factor model comprised of the market factor, a size factor (“SMB”) and a value factor (“HML”) following Fama and French (1992), augmented with the momentum factor of Carhart (1997), all value-weighted.5 In Section 5 we consider long-only versions of the FF4 factors, as well as the Fama-French (2015)-Carhart (1997) six-factor model. We use the procedure of Fama and MacBeth (1973) to estimate risk exposures and risk premiaassociatedwiththefactors. Thefirststageofthisprocedureistypicallyatimeseries regression using the last 36 or 60 months of returns, see Ferson (2019) and Kaniel et al. (2023) for example, which yields estimates of the risk exposures (betas) and abnormal returns (alphas) for each fund. However, since late 1998 daily returns on mutual funds have become available, facilitating the implementation of a superior method for measuring factor exposures: “realized betas” (Barndorff-Nielsen and Shephard, 2004; Andersen et al., 2006). This method is used by Lewellen and Nagel (2006) for stock portfolios and by Akbas and Genc (2020) for mutual funds. We substitute the standard 36- or 60-month window of monthly returns with a three-month window (approximately 65 observations) of daily returns, see Appendix B for details. This approach captures changes in funds’ betas more quickly and facilitates more accurate estimates of risk premia in the second stage. If any of our mutual funds engage in factor timing (Bollen and Busse, 2001; Ammann et al., 2020) our realized betas can capture this effect, in contrast with estimates of beta using only monthly data over a longer span of time. 5We obtain returns on these factors from Kenneth French’s website: https://mba.tuck.dartmouth. edu/pages/faculty/ken.french/data_library.html 8

2.2 Do mutual funds acrrue the full risk premium? We adopt the method described in Patton and Weller (2020) to answer the question posed in the section title: “on-paper” risk premia are estimated using a sample of 199 stock portfolios based on sorts on size and other characteristics.6 We compare these with the risk premia estimates obtained estimating the same model on our sample of mutual funds. The difference between the “on-paper” risk premium and that achieved by mutual funds is the average (in)efficiency of mutual funds for that factor. Table 2 reveals economically and statistically significant inefficiencies for the average mutual fund. The annualized on-paper market risk premium for our sample period is 8.8%, while the premium earned by mutual funds for exposure to the market factor is only 6.9%, an efficiency loss of 1.9%, significant at the 0.01 level. The “value” factor (HML) risk premium is 1.4%, while mutual funds earn -0.3%, which is an effiency loss of 1.7%, significant at the 0.10 level. Similarly, the momentum factor earned 1.8% over our sample period while premium earned by mutual funds is -0.1%, a difference significant at the 0.10 level. The SMB factor did not generate significant risk premia over our sample period, and mutual funds exhibit no significant inefficiency in their exposure to that factor. A joint test that mutual funds are fully efficiency for all four factors rejects the null at the 0.01 level, indicating strong evidence that mutual funds do not fully capture the risk premia associated with the factors to which they are exposed.7,8 This motivates our analysis of whether some mutual funds are better than others at accruing risk premia. 6Specifically, we use 50 portfolios based on one-way decile sorts on size, book-to-market (BM), momentum, investment, and operating profitability (OP); 100 portfolios based on two-way quintile sorts on size-BM,size-momentum,size-investment,andsize-OP;and49industryportfoliosformedusingSICcodes. 7In Table A2 we re-do this analysis using monthly data to estimate betas and find qualitatively similar results: the significance of the inefficiency for the MOM factor is slightly lower, though it remains strong for the market factor and the value factor, and the joint test of full efficiency for all four factors again rejects the null at the 0.01 level. 8Forcomparability,theresultsinTable2useregressionsthatsuppresstheinterceptinbothregressions, thusimposingthatallportfolioshavezeroalphaandforcingthefactorloadingstoexplainthecross-section of average returns. Mutual funds, however, may have alphas that differ from zero and in Table A3 an intercept is included in the mutual fund regressions. That table shows that the estimated inefficiencies are slightly larger, and remain strongly significant, in that specification. 9

Table 2: Differences in earned risk premia. This table presents estimates of risk premia associated with the four risk factors in the Fama-French (1992)-Carhart (1997) model, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses 199 stock portfolios described in Section 2.1, and represents the “on-paper” risk premia for these factors. The second column presents estimates using our sample of 4,853 mutual funds, also described in Section 2.1. The third column presents the difference between the two estimates, and can be interpreted as the average “(in)efficiency” of mutual funds in our sample for a specific risk factor. The row labeled R2 reports the average of the cross-sectional R2s across the 300 ¯ months in the sample. The row labeled N reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all four differences are zero. All t-statistics and p-values use Newey-West (1987) standard errors with five lags. Mutual Stocks funds Difference MKT 8.834 6.944 -1.890 (t-stat) (2.718) (2.047) (-4.884) SMB 2.575 3.152 0.577 (t-stat) (1.378) (1.729) (0.744) HML 1.378 -0.281 -1.658 (t-stat) (0.560) (-0.106) (-1.662) MOM 1.821 -0.051 -1.873 (t-stat) (0.576) (-0.016) (-1.750) R2 0.324 0.510 N¯ 199 2,364 Joint p-val 0.000 10

2.3 Decomposing the abnormal return The canonical model for measuring the performance of mutual fund managers is: E[R −Rf] = α +β⊤λ (1) i,t t i i where β measures the factor exposures of the fund, λ contains the risk premia for the i factors, and α measures the abnormal return of the manager of fund i. However, the i results in Table 2 reveal that mutual funds do not, in fact, earn the entire risk premium, λ, for their exposures; instead, they typically earn less. If we denote the amount a given fund i actually earns as λ = λ+η , where η is typically negative, then this suggests an i i i extension of the canonical model: E[R −Rf] = ψ +β⊤(λ+η ) (2) i,t t i i i The intercept, ψ denotes the “skill” of the fund, the component of excess returns that is i unrelatedtoitsfactorexposures. The(in)efficiencyofthefundiscapturedbythevectorη , i which measures how much, if any, of the risk premia associated with the factors, denoted by λ, is lost. The aggregate efficiency of the fund is the beta-weighted sum of the individual efficiencies, ∆ = β⊤η . Combining equations (1) and (2) reveals that, in the presence of i i i inefficiencies in accruing risk premia, the familiar abnormal return, α , is in fact comprised i of two components: skill and the aggregate (in)efficiency: α = ψ +∆ , where ∆ ≡ β⊤η (3) i i i i i i If all mutual funds are assumed to be fully efficient for all factor exposures, i.e. η = 0 ∀ i, i as is implicitly assumed in most of the literature to date, then equation (2) simplifies to equation (1) and we have ψ = α . However, Table 2 presents strong evidence (significant at i i 11

the 0.01 level) that η ̸= 0, and so both skill and efficiency each contribute to the abnormal i return of a fund. Equation (3) reveals another important feature for the analysis in this paper: skill and efficiency cannot be separately identified using only time series data on a single fund. The usual time series regression used to estimate factor exposures provides an estimate of abnormal returns, α , but without further information we cannot determine its constituent i parts. In the next section we show how to use a panel of mutual funds to separately identify skillandefficiency, eitherviaexantesortsofmutualfundsorbymachinelearningmethods. 3 Estimating mutual fund skill and efficiency As discussed in the previous section, we cannot, without further assumptions, separately identify the skill and efficiency for a single mutual fund given only its returns. In this section we present a variety of methods for measuring the skill and efficiency of a panel of mutual funds. 3.1 Homogeneous factor efficiency The simplest method for estimating skill and efficiency is the one used to obtain Table 2, where we estimate the average efficiency for a cross-section of mutual funds. This method is optimal if all mutual funds are equally efficient for each factor. (We test, and reject, this restriction in the next subsection.) Note that even though this approach assumes that all funds are equally efficient for each factor, the fact that average efficiency differs across factors (e.g., the average inefficiency is -1.9% for the market factor but not different from zero for the size factor) and factor 12

exposures vary across mutual funds, the aggregate (in)efficiency will vary across funds: ∆ = β⊤ η¯ (4) i,t i,t−1 t Thus, even with this simple specification, we can capture the cross-sectional variation in efficiency arising from cross-sectional variations in factor loadings. Given that accruing all of the risk premia associated with a set of risk factors is difficult, and Table 2 shows that the average mutual fund is unable to do so, this raises the question of whether some mutual funds are able to do it better than others, i.e., are some mutual funds more efficient than others? 3.2 Are all mutual funds equally good at accruing risk premia? We firstly allow for heterogeneity in efficiency by grouping funds by observable characteristics and examining the efficiency of the funds in each group separately. This imposes within-group homogeneity but allows for cross-group heterogeneity, thereby generalizing the approch presented in the previous section. We first consider sorting funds by their time series R2 from the Fama-French (1992)- Carhart (1997) model. Recall from Table 1 that the median fund has an R2 of 0.89, revealing that these four factors explain much of the variation in fund returns, on average. However, there is significant cross-sectional variation in R2, with the lowest quintile having an average R2 of around 44% compared with the top quintile with an average of over 97%. Clearly some funds hew very closely to these four factors, while others take on more idiosyncratic risks. Table 3 provides our first evidence of heterogeneity in skill and efficiency across mutual funds. We present tests of equal efficiency across quintile groups using the usual asymptotic χ2 p-values as well as p-values constructed using a simulation-based approach described in 13

Table 3: Heterogeneity in mutual fund efficiency: R2 sorts. This table presents estimates of risk premia associated with the four risk factors in the Fama-French (1992)-Carhart (1997) model, as well as abnormal returns in the fifth row, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds. The next five columns report parameter estimates separately foreach quintile of mutualfunds sorted by their timeseries R2 from theFama-French (1992)-Carhart (1997) model. The last two columns present the p-values (asymptotic and simulation-based) associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across ¯ the 300 months in the sample. The row labeled N reports the average number of assets used to estimate the risk premia each period. The row labeled R2 reports the average FF4 R2 from the French (1992)-Carhart (1997) model for the funds in that column. The last row reports the p-value on the joint test that all quintiles have the same parameters. All t-statistics and asymptotic p-values use Newey-West (1987) standard errors with five lags. R2 Quintile p-values FF4 All funds Low 2 3 4 High Asymp. Sim. MKT 5.085 4.834 5.170 6.138 6.980 6.207 0.079 0.071 (t-stat) (1.696) (1.512) (1.796) (1.999) (2.286) (2.154) SMB 3.149 1.323 3.997 2.635 2.689 2.597 0.103 0.121 (t-stat) (1.722) (0.675) (2.095) (1.442) (1.484) (1.375) HML -0.490 -1.222 -0.316 -0.192 -0.470 0.257 0.619 0.654 (t-stat) (-0.183) (-0.448) (-0.112) (-0.072) (-0.166) (0.093) MOM 0.256 -0.176 -0.614 -0.783 -0.094 -1.173 0.755 0.788 (t-stat) (0.08) (-0.06) (-0.183) (-0.233) (-0.028) (-0.332) Const 1.802 2.110 1.886 0.991 -0.112 0.446 0.002 0.003 (t-stat) (2.208) (2.636) (1.692) (1.048) (-0.099) (0.43) R2 0.543 0.464 0.476 0.547 0.579 0.670 N¯ 2364 473 473 473 473 472 R¯2 0.826 0.442 0.850 0.916 0.945 0.975 FF4 Joint p-val 0.003 0.047 14

Appendix C. Table 3 reveals that high R2 funds tend to be better at earning the market risk premium than low R2 funds: 6.2% versus 4.8%, and the test that efficiency is equal acrossallquintilesrejectsthenullatthe0.10level. HighR2 fundsarealsobetteratearning the size risk premium: 2.6% versus 1.3%, though the null of homogeneity is not rejected for this factor. Table 3 also reveals important heterogeneity in skill, the component of abnormal returns that is unrelated to factor exposures: low R2 funds have much higher skill than high R2 funds, 2.1% compared with 0.4%, and a test that skill is equal across all R2 quintiles rejects the null at the 0.01 level. The results in Table 3 are consistent with economic intuition: funds that track the four factors closely should aim to capture all, or most, of the risk premia associated with those factors, while funds that pursue more idiosyncratic strategies earn their abnormal returns via activities orthogonal to their risk exposures.9 As discussed in Section 2.1, there are vast differences in the total net assets (TNA) across funds: for example, in the last month of our sample the TNA of a fund at the 75th percentile is over 18 times as large as the fund at the 25th percentile, and for the 90th and 10th percentiles the ratio is nearly 250. This motivates Table 4, which sorts mutual funds into quintiles based on their TNA at the end of the previous month. Table 4 reveals significant heterogeneity in efficiency across TNA quintiles. The efficiency for the market factor, in the first row, is almost monotonically declining in TNA, going from 5.9% for the smallest funds to 4.9% for the largest, though the p-value that these five estimates are equal is around 0.20 using either the asymptotic or the simulation-based approaches. For value (HML) we reject the null of equal efficiency at the 0.10 level, and for momentum we reject at the 0.05 level. The average skill for each quintile, captured via the intercept reported in the bottom row of the top panel, is almost monotonically increasing in TNA. 9Amihud and Goyenko (2013) consider the time series R2 as a fund characteristic and find funds with lower R2 have higher abnormal returns than funds that more closely track systematic factors. The results in Table 3 can be interpreted as a decomposition of their result. 15

Table 4: Heterogeneity in mutual fund efficiency: TNA sorts. This table presents estimates of risk premia associated with the four risk factors in the Fama-French (1992)-Carhart (1997) model, as well as abnormal returns in the fifth row, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds. The next five columns report parameter estimates separately for each quintile of mutual funds sorted by total net assets. The last two columns present the p-values (asymptotic and simulation-based) associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across the 300 months in the sample. The row labeled N ¯ reports the average number of assets used to estimate the risk premia each period. The row labeled TNAreportstheaveragetotalnetassets(inmillionsofDecember2023dollars)ofthefunds in that column. The last row reports the p-value on the joint test that all quintiles have the same parameters. All t-statistics and asymptotic p-values use Newey-West (1987) standard errors with five lags. Total Net Asset Quntile p-values All funds Small 2 3 4 Large Asymp. Sim. MKT 5.078 5.889 4.985 4.649 4.811 4.910 0.217 0.228 (t-stat) (1.691) (1.999) (1.657) (1.497) (1.597) (1.638) SMB 3.146 3.108 3.206 3.342 3.550 2.745 0.441 0.466 (t-stat) (1.719) (1.719) (1.773) (1.756) (1.903) (1.448) HML -0.507 -1.161 0.105 -0.083 -0.946 -0.359 0.075 0.082 (t-stat) (-0.19) (-0.442) (0.039) (-0.029) (-0.347) (-0.128) MOM 0.260 0.046 -0.741 1.757 -1.003 0.795 0.006 0.011 (t-stat) (0.081) (0.014) (-0.227) (0.547) (-0.294) (0.24) Const 1.799 0.995 1.867 2.020 1.994 2.061 0.178 0.203 (t-stat) (2.208) (1.34) (2.391) (2.49) (2.142) (2.12) R2 0.543 0.500 0.555 0.578 0.582 0.600 N¯ 2344 469 469 469 469 468 TNA 2186 25 98 301 950 9,562 Joint p-val 0.002 0.044 16

Combining all coefficients and testing whether all quintiles share the same parameters we rejects the null, with a p-value of 0.04. Not all funds are equally skilled or efficient. Wenextpresentresultswhenmutualfundsaresortedbasedonthevalue-weightedcharacteristics of the stocks in their portfolios. Like Daniel et al. (1997) and Ferson and Wang (2021), we focus on three stock characteristics: market capitalization, book-to-market ratio, and momentum. Table 5 presents results using the market capitalization sort and reveals that funds that invest in larger-cap stocks tend to be better at accruing the MKT risk premium, with the estimates ranging from 4.2% for the smallest quintile to 5.5% for the largest. The strongest source of heterogeneity across quintiles is average skill, with an estimate of 3.2% for funds with the smallest market-cap stocks in their portfolios, and 1.4% for funds with the largest-cap stocks. A joint test of equal parameters across all quintiles rejects the null at the 0.01 level. Table 6 presents results sorting funds by the value/growth characteristic of the stocks in their portfolios. We find that funds in the second quintile, so neither “value” nor “growth” focused, are the most efficient for the MKT factor, earning 5.2%, compared with 4.4% for growth funds and 3.8% for value funds, and we reject the null of equal efficiency for MKT at the 0.05 level. We also see a U-shaped pattern in average skill, with the middle quintile having skill of 1.7% compared with 2.2% for the growth quintile and 2.9% for the value quintile. The joint test rejects the null across all parameters at the 0.05 level. We present the results based on sorts using the MOM characteristic of their holdings in Table A5 in the Appendix. Using the usual asymptotic p-value we reject the null of equal efficiency across quintile portfolios at the 0.05 level, however the simulation-based p-value is 0.13 and thus fails to reject the null of equal efficiency. Given the size distortion of the joint test reported in Appendix C, we focus on the simulation-based p-value rather than the asymptotic p-value, and conclude that we do not have evidence to reject the null of equal efficiency when using momentum as a sorting variable. 17

Table 5: Heterogeneity in mutual fund efficiency: Sorts by ME of holdings. This table presents estimates of risk premia associated with the four risk factors in the Fama-French (1992)-Carhart (1997) model, as well as abnormal returns in the fifth row, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds for which we have holdings data. The next five columns report parameter estimates separately for each quintile of mutual funds sorted by the valueweighted average of the market capitalization (ME) of their stock holdings. The last two columns present the p-values (asymptotic and simulation-based) associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across the 300 months in the sample. ¯ The row labeled N reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all quintiles have the same parameters. All t-statistics and asymptotic p-values use Newey-West (1987) standard errors with five lags. ME Quintile p-vals All funds Small 2 3 4 Large Asymp. Sim. MKT 5.085 4.223 5.728 4.404 6.125 5.459 0.036 0.050 (t-stat) (1.696) (1.495) (1.975) (1.369) (2.067) (1.757) SMB 3.149 2.686 1.887 3.397 0.819 3.294 0.154 0.191 (t-stat) (1.722) (1.411) (1.017) (1.958) (0.431) (1.427) HML -0.490 1.433 -1.166 -0.311 -2.002 -0.937 0.110 0.129 (t-stat) (-0.183) (0.471) (-0.431) (-0.114) (-0.744) (-0.342) MOM 0.256 -0.403 0.605 0.381 0.502 0.217 0.964 0.973 (t-stat) (0.08) (-0.111) (0.185) (0.122) (0.168) (0.071) Const 1.802 3.242 1.820 1.748 0.890 1.427 0.023 0.031 (t-stat) (2.208) (2.799) (1.863) (2.351) (0.934) (1.857) R2 0.543 0.506 0.483 0.613 0.503 0.528 N¯ 2364 473 473 473 473 472 Joint p-val 0.000 0.010 18

Table 6: Heterogeneity in mutual fund efficiency: Sorts by BM of holdings. This table presents estimates of risk premia associated with the four risk factors in the Fama-French (1992)-Carhart (1997) model, as well as abnormal returns in the fifth row, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds for which we have holdings data. The next five columns report parameter estimates separately for each quintile of mutual funds sorted by the valueweighted average of the book-to-market ratio (BM) of their stock holdings. The last two columns present the p-values (asymptotic and simulation-based) associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across the 300 months in the sample. ¯ The row labeled N reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all quintiles have the same parameters. All t-statistics and asymptotic p-values use Newey-West (1987) standard errors with five lags. BM Quintile p-vals All funds Growth 2 3 4 Value Asymp. Sim. MKT 5.085 4.413 5.239 4.433 4.963 3.778 0.025 0.035 (t-stat) (1.696) (1.53) (1.715) (1.393) (1.696) (1.343) SMB 3.149 1.554 2.348 3.940 3.201 3.385 0.285 0.315 (t-stat) (1.722) (0.772) (1.212) (2.284) (1.82) (1.934) HML -0.490 -1.602 -1.903 -1.190 0.135 0.914 0.168 0.192 (t-stat) (-0.183) (-0.649) (-0.795) (-0.407) (0.052) (0.328) MOM 0.256 0.089 -0.179 -0.178 0.195 -0.088 0.997 0.995 (t-stat) (0.08) (0.031) (-0.053) (-0.054) (0.061) (-0.026) Const 1.802 2.245 1.736 1.713 1.822 2.875 0.282 0.336 (t-stat) (2.208) (1.985) (2.306) (2.41) (1.941) (2.519) R2 0.543 0.480 0.581 0.613 0.505 0.462 N¯ 2364 473 473 473 473 472 Joint p-val 0.000 0.014 19

The tables in this section answer the question posed in the title firmly in the negative: accruing the “on-paper” risk premium associated with risk factors is difficult (Table 2) and funds vary significantly in their ability to do so (Tables 3 to 6). We show in Section 5.1 that these conclusions also hold when using “long only” versions of the Fama-French (1992)-Carhart (1997) factors. The abundance of variables available for ex ante sorting immediately raises the further question of whether one variable is more powerful than the others, and whether a combination of variables could perform better than any single variable. In the next section we adopt a popular method from machine learning to address these questions. 3.3 Regression trees for learning about skill and efficiency The analysis in the previous section relies on the selected sorting variable (e.g., R2 or FF4 TNA) and thresholds (quintiles, in the previous section) being a useful way of grouping mutual funds. An alternative to selecting the sorting variable and thresholds ex ante is to let the data determine the optimal grouping of funds. “Regression trees” (Breiman et al., 1984, Breiman et al., 2017) are perfectly suited to this task. Given a set of state variables, possibly a large number, a regression tree sequentially selects the “splitting” variable and the split threshold that maximizes the fit of the overall model. The procedure continues to split the data until the number of splits has reached a pre-determined maximum, or until another constraint (e.g, a constraint on the minimum number of observations in each split) is binding. Details on our implementation of regression trees are presented in Appendix D. Regression trees naturally nest the ex ante splits considered in the previous section: the tree could select the same splitting variable (e.g., TNA) for all splits, and could select the quintiles of that variable as thresholds. However, with a variety of spitting variables to choose from, and flexibility on the threshold for each variable, such an outcome is unlikely. We consider a total of 16 state variables across four categories: past performance (fund 20

returns over the past 3 and 12 months, and cross-sectional ranks of fund returns over the past 3 and 12 months); flows (net flows into/out of the fund over the past 1, 3, and 12 months, and cross-sectional ranks of these flows); fund characteristics (cross-sectional ranks of TNA, R2 , and R2 ); fund holdings characteristics (cross-sectional ranks of FF4 CAPM the value-weighted average of the stocks in each fund’s portfolio, according to size, value and momentum). We consider a grid of 19 thresholds for each of these variables, based on their 0.05, 0.1, ..., 0.95 percentiles. Itispossibletoestimateaseparatetreeforeachcross-sectioninoursample,howeverthis can lead to over-fitting, as the tree structure changes more frequently than is economically plausible. To address this, we impose that the tree structure is the same for each calendar year.10 This allows for low-frequency changes in the tree, and improves estimation accuracy by using 12 months of data to estimate the structure of the tree. Regression trees, like other machine learning methods, over-fit in-sample and require regularization. We use five-fold cross validation to choose the number of splits in the tree as follows: we randomly assign funds into five equally-sized groups, and then use four of these groups estimate the tree structure with number of leaves ranging from one (corresponding to no splits at all) up to ten, and finally we evaluate the fit on the fifth group. We repeat this so that each one of the groups is used as the out-of-sample group, and we sum the out-of-sample fit across all five “folds.” We choose the number of leaves that maximizes the out-of-sample fit. Table A4 in Appendix E reports the number of leaves selected for each year. The optimal number of leaves ranges from one (where no tree structure is used, and the model reduces to simple OLS) to ten, and the average number of leaves is 4.6. As an example, Figure 2 presents the optimal tree for 2023. The optimal number of leaves in that year is four. The first split separates funds into the top 20% and bottom 80% of funds by performance over the past three months. The top-performing funds are 10Naturallyweallowtheregressionparameterstovaryeachmonth,capturingvariationinrealizedfactor returns. 21

Figure 2: Regression tree for 2023. This figure describes the optimal tree for the last year of our sample period, 2023. The number of leaves is determined by five-fold cross-validation. The sample is split using the state variable and threshold given in the rectangles. The number of fund-months in each “leaf,” as well as the proportion of the sample it represents, is reported. then split again by past three-month performance, into the top 1% of performers and the next 19%. On the other branch, the lower-performing 80% of funds are split according to how well the Fama and French (1992)-Carhart (1997) model explains their returns. We also consider random forests (Breiman, 2001), a popular extension of regression trees, to capture heterogeneity in mutual fund efficiency. Random forests have been found to produce forecasts with lower variance, and have greater robustness to noise. Using the optimal number of leaves from the cross-validation of the regression tree, we implement the random forest using 500 bootstrap samples of the original data, obtaining the regression tree forecast on each sample, and then averaging the resulting forecasts. We randomly select five of the 16 state variables to use in each bootstrapped tree. Table 7 compares the cross-sectional R2 from the seven methods for capturing variation 22

Table 7: Cross-sectional fits. This table presents summary statistics on the R2 from Fama-MacBeth (1973) second-stage regressions for mutual fund returns. These regressions are run each month and this table presents the mean and three quartiles of these values across the 300 months in our sample. The first column presents results for the model assuming homogeneity in efficiency, using OLS on the entire sample of mutual funds. The next five columns use OLS separately on quintiles of funds sorted by their time series R2 from the Fama-French (1991)-Carhart (1997) four-factor model, their total net assets, or the ME, BM and MO characteristics of their stock holdings. The last two columns present results using the regression tree and random forest models described in Section 3.3. Panel A presents results using realized betas, which are constructed using daily data in the preceding three months, and Panel B presents results using betas estimated using a 36-month window of returns. Ex-ante quintile sorts Homog. R2 TNA ME BM MO Tree Forest FF4 Panel A: Realized betas Mean 0.543 0.581 0.556 0.587 0.583 0.579 0.606 0.622 Q25 0.407 0.450 0.418 0.445 0.458 0.437 0.485 0.502 Q50 0.562 0.607 0.576 0.602 0.603 0.592 0.632 0.644 Q75 0.707 0.729 0.714 0.740 0.740 0.738 0.750 0.760 Panel B: Monthly betas Mean 0.502 0.548 0.517 0.550 0.544 0.546 0.576 0.594 Q25 0.352 0.410 0.372 0.417 0.412 0.416 0.439 0.463 Q50 0.516 0.573 0.533 0.574 0.558 0.562 0.600 0.614 Q75 0.669 0.695 0.682 0.703 0.692 0.702 0.721 0.733 in mutual fund efficiency. We show the average R2 across all 300 months, as well as the 25th, 50th, and 75th percentiles. Panel A shows that the average fit using standard OLS is 54%. This rises to 56% when splitting funds by TNA quintiles, and rises further to 58% when splitting funds by R2 quintiles. Sorting by fund holdings characteristics ME and FF4 BM also lead to R2s of between 58% and 59%. The regression tree and random forest extensions have the highest average fits, at 61% and 62%. Note that the tree achieves a 23

better fit with fewer parameters than the quintile sorts, as the latter estimate a total of 25 parameters (four risk premia and an intercept, for each of the five quintiles), while the tree model has an average of 4.56 leaves, and so is based on an average of 22.8 parameters. It is noteworthy that the two machine learning-based methods, when regularized using cross validation, do not have dramatically higher fits than the best ex-ante sorts, indicating that the ex-ante sorts capture much of the heterogeneity in the sample. Panel B of Table 7 presents results when using monthly betas rather than realized betas. We see that the improvements in fit across the cross-sectional models for efficiency arequalitativelythesameforthesebetas. ComparingtheaverageR2sacrossthetwopanels, we observe that using realized betas yields R2s that are around 3.5% higher than when using monthly betas in the same model, consistent with the former being more accurate measures of mutual fund factor exposures. 4 Decomposing mutual fund returns In this section we decompose mutual fund returns into components related to risk premia, efficiency and skill. Recall the proposed model for mutual fund returns: R = Rf +ψ +β⊤ (λ +η )+ε (5) i,t t i i,t−1 t i,t i,t We estimate “on-paper” risk premia using the Fama-MacBeth (1973) procedure on a panel of 199 stock portfolios, as displayed in Table 2, and denote these as λStocks. The t abnormal return is estimated as the difference between the average return on the fund and the average benchmark return on the fund, equal to the risk free rate plus the risk premia associated with the fund’s factor exposures: (cid:104) (cid:105) α = E[R ]−E Rf +β⊤ λStocks (6) i i,t t i,t−1 t 24

We estimate λ using one of the methods presented in Section 3, and from those we obtain i,t the efficiency for each fund in our sample: η = λ −λStocks (7) i,t i,t t Finally, we estimate skill as ψ = α −∆ (8) i i i where ∆ ≡ E[β⊤ η ] (9) i i,t−1 i,t is the average aggregate efficiency of fund i. We next investigate the properties of these components in the cross-section and time series of mutual fund returns. 4.1 Skill vs. Efficiency In the presence of inefficiencies in accruing risk premia, the familiar abnormal return is the sum of two terms: the average aggregate (in)efficiency, ∆ ≡ E[β⊤ η ], and a “skill” i i,t−1 i,t component unrelated to factor exposures, ψ . Figure 3 plots average aggregate efficiency i against skill for all funds in our sample, using the random forest model to capture heterogeneity. The red circle in Figure 3 denotes the average skill (0.95%) and efficiency (-2.64%) across all funds and all periods. Interestingly, Figure 3 reveals that these measures of fund manager ability are negatively related in the cross section: the correlation is -0.17, and the OLS fitted line has slope -0.12 with a t-statistic of -11.6.11 Thus some mutual funds achieve their abnormal returns by increased efficiency, while others do so via skill. 11If the estimates of individual mutual fund efficiency were pure noise, uncorrelated in the cross section, then the observed skill-efficiency relationship would also be negative, with correlation equal to (cid:112) − V[∆ ]/V[ψ ], which equals -0.68 in our sample, compared with the actual correlation of -0.17. The i i difference arises because our estimates of efficiency are positively correlated with abnormal returns, with correlation of 0.46, despite being computed in a completely different step (cross-sectional rather than time series), consistent with the efficiency estimates not being noise. 25

Figure 3: Skill and efficiency of US mutual funds. This figure plots skill (ψ ) and aggregate efficiency (β⊤η ), measured in annualized percent, i i i for all mutual funds in our sample, averaged over all months for which they are in the sample. The thin dashed line denotes the “zero abnormal return” line, where skill and efficiency sum to zero. The thick dashed line is the OLS regression line. The red circle denotes the average skill and efficiency across all funds and all periods. 10 5 0 -5 -10 -15 -20 -30 -20 -10 0 10 20 Skill ycneiciffE AbRet=0 OLS fit Relatively few exhibit both positive efficiency and positive skill, and thus the north-east quadrant of Figure 3 is sparsely populated. Nextweinvestigatethestatisticalsignificanceofthesemeasuresoffundmanagerability. Table 8 presents a two-way contingency table, categorizing funds as having negative, zero, or positive skill, and negative, full, or positive efficiency, by testing these two measures being equal to zero at the 0.10 level. The right column of Table 8 presents the proportion of funds with positive, zero, and negative abnormal returns. We see that one-third of funds (33.2%) are statistically fully efficient, a negligible fraction (2.9%, when the level of the test is 10%) are positively efficient, and nearly two-thirds of funds are significantly 26

Table 8: Skill, efficiency, and abnormal returns: Statistical significance. This table presents results on the proportion of mutual funds that have skill, efficiency, and abnormal returns that are significantly different from zero at the 10% level. Estimates of skill and efficiency are based on the random forest model described in Section 3.3. “Full efficiency” corresponds to an estimate that is not different from zero. The left panel shows a two-way contingency table for skill and efficiency, and the right column shows the results for abnormal returns. Efficiency Abnormal Negative Full Positive Total Return Negative 0.008 0.023 0.001 0.033 0.312 Skill Zero 0.397 0.267 0.006 0.671 0.610 Positive 0.233 0.042 0.022 0.296 0.078 Total 0.639 0.332 0.029 1 1 inefficient. In contrast, around two-thirds of funds have zero skill, about 30% exhibit positive skill, and a negligible proportion are negatively skilled. Just over half of funds live on the anti-diagonal of this contingency table, consistent with the negative correlation between these measures of ability presented in Figure 3. The modal cell in the table is the zero skill/negative efficiency combination, where 39.7% of funds reside.12 This reveals that the negative average abnormal return in our sample of mutual funds, -1.58%, is driven primarily by funds’ inability to fully capture the risk premia associated with the factors to which they are exposed. We show in Section 5.1 that similar results are obtained when we use “long-only” versions of the factors. 4.2 The persistence of skill and efficiency There is a large literature (e.g., see Brown and Goetzmann, 1995 and Carhart, 1997 for early work in this area) on whether the performance of mutual fund managers is persistent, 12Table A6 reports very similar figures when using the regression tree or homogeneous (OLS) models. 27

that is, whether past performance is predictive of future performance. We investigate this in our sample for each of the three measures of ability: skill, efficiency, and abnormal returns. Given our tree-based models impose the same tree structure for each calendar year, we study the persistence of yearly aggregates of these measures.13 We do so via a simple panel autoregressive (AR) model of the form: X = c+ϕX +e (10) i,y+1 i,y i,y+1 where X denotes one of the measures of ability for fund i: skill (ψ ), efficiency (∆ ) or i,y i,y i,y abnormal return (α ) in year y. i,y Table 9 reveals a striking difference in the persistence of these measures of ability. Skill is essentially uncorrelated from year to year, with an AR coefficient of 0.05, and an R2 of 0.2%. Efficiency, on the other hand, is more persistent, with an AR coefficient of 0.13 and R2 of 1.8%. The abnormal return, which is the sum of these two measures, lies in the middle, with an AR coefficient of 0.10 and an R2 of 1.1%.14 Thus the component of abnormal returns that drives its persistence from year to year is efficiency. 4.3 Predicting abnormal returns with skill and efficiency Mutual fund investors cannot consume skill and efficiency separately; what they take to the bank is the sum of these, the abnormal return. We next show that even if interest lies only in abnormal returns, investors can improve their decisions by decomposing abnormal returns into skill and efficiency. The differences in persistence across the measures of skill documented in the previous section suggest that they may have differing predictive power for future abnormal returns. 13Doing this at the monthly level would introduce look-ahead bias in all months other than December. 14The regression tree and homogeneous models lead to similar results, see Table A7. 28

Table 9: Persistence of measures of mutual fund manager ability. This table presents parameter estimates from a panel autoregressive model of order one for annualized measures of mutual fund manager ability, given in the column titles, estimated usingthetreemodeldescribedinSection3.3. Allt-statisticsuseThompson(2011)standard errors, clustered by firm and time. Abnormal Skill Efficiency Return AR(1) 0.047 0.131 0.102 (t-stat) (5.796) (18.501) (10.293) Const 1.871 -2.579 -0.897 (t-stat) (40.890) (-76.936) (-23.784) Obs. 43,039 43,039 43,039 R2 (%) 0.234 1.805 1.108 We investigate this in Table 10 via simple linear regressions.15 The first column of Table 10 replicates the AR model for abnormal returns in Table 9 and is included for ease of reference. The second column decomposes lagged abnormal returns into lagged skill and lagged efficiency, and reveals that they differ in their ability to predict future abnormal returns. The coefficient on efficiency is almost double that on skill, at 0.15 compared with 0.09. A test that these coefficients are equal, and thus that this model collapses to the AR(1) model in column 1, rejects the null at the 0.01 level.16 Thus if abnormal returns are the sum of a persistent component (efficiency) and a transitory component (skill), a better forecast of the sum can be obtained by emphasizing the persistent component and de-emphasizing the transitory component. 15Kaniel et al. (2023) consider sophisticated neural networks for forecasting mutual fund abnormal returns,basedonpastreturnsandalargenumberofstockandfundcharacteristics. Thispapercomplements the work by Kaniel et al. in that we find evidence that one component of abnormal returns (efficiency) is morepersistentthantheother,andthusasophisticatedforecastingframeworkliketheirsmaybeimproved by exploiting this feature. 16Recall that since abnormal returns are the sum of skill and efficiency, we cannot include all three measures of ability in the same regression, as that leads to perfect multicollinearity. 29

Table 10: Predicting abnormal returns. This table presents results from panel regressions to predict one-year-ahead abnormal returns. The first specification is a first-order autoregression, and corresponds to the last column in Table 9. The second specification uses lagged skill and efficiency as predictors. The third specification decomposes lagged abnormal returns into positive and negative components. The fourth specification decomposes skill and efficiency into their positive and negative components. All t-statistics use Thompson (2011) standard errors, clustered by firm and time. (1) (2) (3) (4) Abnormal Return 0.102 (t-stat) (10.293) Skill 0.092 (t-stat) (9.189) Efficiency 0.153 (t-stat) (11.816) AbRet+ 0.069 (t-stat) (3.967) AbRet− 0.129 (t-stat) (7.151) Skill+ 0.162 (t-stat) (12.055) Skill− 0.007 (t-stat) (0.318) Eff+ 0.000 (t-stat) (0.000) Eff− 0.228 (t-stat) (15.618) Const -0.897 -0.734 -0.739 -0.793 (t-stat) (-23.784) (-14.692) (-9.993) (-9.194) Obs. 43,039 43,039 43,039 43,039 R2 (%) 1.108 1.290 1.163 1.691 30

Motivated by evidence, see Brown and Goetzmann (1995) and Carhart (1997) for example, thatpersistenceinmutualfundmanagerperformanceismostlydrivenbythecontinued poor performance of poorly-performing funds, we next investigate whether there is asymmetry in the predictive power of these measures of ability. In column three of Table 10 we decompose lagged abnormal returns into positive and negative components. Consistent with the existing literature, we see that the coefficient on lagged negative abnormal returns is larger than that on lagged positive abnormal returns (0.13 versus 0.07, a difference that is significant at the 0.01 level). It is interesting to note that while decomposing lagged abnormal returns into positive and negative components improves the performance of the model, increasing the R2 from 1.11% to 1.16%, the decomposition in column 2, into skill and efficiency, leads to a larger improvement in performance, with an R2 of 1.29%. In the last column of Table 10 we decompose lagged skill and efficiency into positive and negative parts. The fit of this model is improved relative to treating negative and positive parts the same (the R2 rises from 1.29% to 1.69%). Table 10 shows that it is positive skill and negative efficiency that are predictive for future abnormal returns. The coefficients on negative skill and positive efficiency are not different from zero (t-statistics of 0.3 and 0.0 respectively). This is strong evidence for combining the known asymmetry in the predictive power of positive and negative past performance with the decomposition of abnormal returns into skill and efficiency proposed in this paper.17 4.4 Common and relative (in)efficiency Finally, we consider disaggregating the efficiency of a given fund into a component that is common to all funds and the “relative efficiency” of the fund when compared with the average fund. Recall that the usual Fama-MacBeth (1973) second-stage regression can 17TableA8presentscorrespondingresultsusingtheregressiontreeandhomogeneousmodels,andshows that the conclusions are very similar using these other two methods. 31

be interpreted as estimating the average risk premia earned by mutual funds, which was ¯ dubbed the “homogeneous” model in Section 3.1. Denote this estimate as λ . With the t fund-specific estimates of risk premia captured by the regression tree or random forest models, λ , we can then obtain: i,t ∆ ≡ β⊤ (λ −λStocks) (11) i,t i,t−1 i,t t = β⊤ (λ ¯ −λStocks)+β⊤ (λ −λ ¯ ) (12) i,t−1 t t i,t−1 i,t t ≡ ∆ +∆(cid:101) i,t i,t All of the cross-sectional variation in ∆ is due to variation in factor exposures and the i,t differences across factors in mutual funds’ average efficiency. ∆(cid:101) measures variation in i,t efficiency relative to the average mutual fund, as captured by the regression tree or random forest models. Table 11 shows that average efficiency, ∆ , is useful for predicting future abnormal i,t returns, with a t-statistic of 4.4. Columns 3 and 4 of Table 11 show that relative efficiency, ∆(cid:101) , computed using the tree-based models, is also useful, with a coefficients more than i,t double that of average efficiency and t-statistics of 3.6 and 8.8. This table reveals that the cross-sectional variation in efficiency captured by the tree-based models has predictive power beyond the average efficiency captured by the homogeneous model. That is, by exploiting information on mutual funds’ characteristics (such as TNA, R2 , and charac- FF4 teristics of the stocks they hold) we can obtain better estimates of the risk premia earned by individual funds. 32

Table 11: Decomposing mutual fund efficiency: Average and relative efficiency This table presents results from panel regressions to predict one-year-ahead abnormal returns. The first specification is a first-order autoregression, and corresponds to the last column in Table 9. The second specification adds lagged efficiency estimated using the “OLS” method. The third and fourth specifications add the difference between efficiency estimated using a random forest or a regression tree and that estimated using OLS. All t-statistics use Thompson (2011) standard errors, clustered by firm and time. (1) (2) (3) (4) Abnormal return 0.102 0.096 0.091 0.084 (t-stat) (10.293) (9.486) (9.162) (8.297) Average efficiency 0.053 0.051 0.056 (t-stat) (4.438) (4.248) (4.679) Relative efficiency (Forest) 0.102 (t-stat) (3.574) Relative efficiency (Tree) 0.128 (t-stat) (8.792) Const -0.897 -0.753 -0.763 -0.761 (t-stat) (-23.784) (-13.829) (-13.972) (-14.088) Obs. 43,039 43,039 43,039 43,039 R2 (%) 1.108 1.217 1.308 1.680 5 Extensions In this section we consider three extensions of the analysis presented above. First, we adopt “long-only” versions of the FF4 factors, and consider the heterogeneity in efficiency across mutual funds remains. Second, we consider a larger factor model, the six-factor model of Fama-French (2015)-Carhart (1997). Finally, we consider a method for imposing partial homogeneity on the estimated risk premia, allowing some factors to have premia that vary across mutual funds while other factors have premia that are homogeneous in the cross section. 33

5.1 Long-only factors Motivated by questions about whether factors that involve short positions, such as the famous SMB, HML and MOM factors, are the right benchmarks against which to judge mutual fund performance (see, e.g., Berk and van Binsbergen, 2015; DeMiguel et al., 2025) wenextconsider“longonly”versionsoftheFamaandFrench(1992)-Carhart(1997)model. We construct these as the returns on the long legs of these factors (S, H and U) in excess of the risk free rate. We denote these long-only factors as SMB+, HML+ and MOM+, and we compute realized betas using these factors for use in the second-stage regressions of the Fama and MacBeth (1973) procedure. Table 12 shows that mutual funds are unable to fully capture the risk premia associated with the long-only factors, with inefficiencies ranging from 1.8% for the MKT to 2.7% for MOM+. A joint test that all differences in risk premia are zero rejects the null at the 0.01 significance level. Thus even for long-only factors there are impediments to accruing all of the on-paper risk premia.18 Next we examine whether mutual funds differ in their ability to accrue the risk premia associated with the long-only factors. In Table 13 we sort funds into quintiles by the market capitalization of the stocks they hold (the same stratification used in Table 5) and test whether risk premia are equal across all quintiles. We find a similar pattern as with the original factors: funds that hold larger-cap stocks are generally better at accruing the market risk premium than funds that hold smaller-cap stocks, but also tend to exhibit less skill. A joint test across all factors rejects the null with a p-value of 0.03. Results for sorts 18Compared with Table 2, there are more factors with significant inefficiencies, but some caution is needed before concluding that mutual funds are more inefficient with long-only factors than the original factors. The long-only factors are more correlated with each other than the original factors (correlations ranging from 0.83 to 0.92 compared with 0.03 to 0.35) and this collinearity makes estimating individual factor loadings less precise. A safer way to judge the joint significance of the inefficiency in Tables 2 and 12 is via the F-statistics associated with the p-values reported in the lower-right corners of those tables, which are 46 and 37 respectively. Both are significant at the 0.01 level, but the evidence against the null is stronger for the original factors than the long-only factors. 34

Table 12: Differences in earned risk premia, long-only factors. This table presents estimates of risk premia associated with “long only” versions of the four risk factors in the Fama-French (1992)-Carhart (1997) model, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses 199 stock portfolios described in Section 2.1, and represents the “on-paper” risk premia for these factors. The second column presents estimates using our sample of 4,853 mutual funds, also described in Section 2.1. For both samples the intercept is suppressed. The third column presents the difference between the two estimates, and can be interpreted as the average “(in)efficiency” of mutual funds in our sample for a specific risk factor. The row labeled R2 reports the average of the cross-sectional R2s across the 300 months in the ¯ sample. The row labeled N reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all four differences are zero. All t-statistics and p-values use Newey-West (1987) standard errors with five lags. Mutual Stocks funds Difference MKT 8.758 6.883 -1.875 (t-stat) (2.705) (2.035) (-4.568) SMB+ 11.502 9.873 -1.629 (t-stat) (2.675) (2.347) (-2.358) HML+ 11.008 8.521 -2.487 (t-stat) (2.586) (2.005) (-3.644) MOM+ 11.236 8.579 -2.657 (t-stat) (3.005) (2.187) (-3.685) R2 0.320 0.507 N¯ 199 2,364 Joint p-val 0.000 based on other characteristics of the fund are presented in Tables A9 to A12 in Appendix E. We find heterogeneity in efficiency when sorting by R2, TNA, BM and MO, with joint p-values of 0.10, 0.06, 0.01, 0.09. To combine all of the characteristics and optimally choose the thresholds for stratifying funds, we again use regression trees and random forests. We use the same methods 35

Table 13: Heterogeneity in mutual fund efficiency: Sorts by ME of holdings, long-only factors. This table presents estimates of risk premia associated with “long only” versions of the four risk factors in the Fama-French (1992)-Carhart (1997) model, as well as abnormal returns in the fifth row, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds for which we have holdings data, and differs slightly from the second column in Table A3 due to missing data. The next five columns report parameter estimates separately for each quintile of mutual funds sorted by the value-weighted average of the market capitalization (ME) of theirstockholdings. Thelasttwocolumnspresentthep-values(asymptoticandsimulationbased) associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across the 300 ¯ months in the sample. The row labeled N reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all quintiles have the same parameters. All t-statistics and asymptotic p-values use Newey-West (1987) standard errors with five lags. ME Quintile p-vals All funds Small 2 3 4 Large Asymp. Sim. MKT 5.225 4.512 5.940 4.520 6.145 5.707 0.055 0.076 (t-stat) (1.747) (1.594) (2.074) (1.42) (2.087) (1.822) SMB+ 8.115 6.877 7.517 7.879 6.722 8.917 0.310 0.339 (t-stat) (2.096) (1.913) (2.079) (1.982) (1.682) (1.99) HML+ 6.623 6.254 6.152 5.807 5.579 7.212 0.557 0.597 (t-stat) (1.727) (1.798) (1.708) (1.353) (1.461) (1.771) MOM+ 6.950 5.850 6.831 6.153 6.763 7.842 0.487 0.497 (t-stat) (1.897) (1.654) (1.942) (1.641) (1.803) (1.934) Const 1.635 3.175 1.528 1.589 0.805 1.230 0.028 0.043 (t-stat) (1.934) (2.587) (1.498) (2.064) (0.827) (1.512) R2 0.539 0.499 0.479 0.613 0.501 0.527 N¯ 2364 473 473 473 473 472 Joint p-val 0.002 0.029 36

Table 14: Predicting abnormal returns relative to long-only factors. This table presents results from panel regressions to predict one-year-ahead abnormal returns based on the long-only factor model. The first specification is a first-order autoregression. The second specification uses lagged skill and efficiency as predictors. The third specification decomposes lagged abnormal returns into positive and negative components. The fourth specification decomposes skill and efficiency into their positive and negative components. All t-statistics use Thompson (2011) standard errors, clustered by firm and time. (1) (2) (3) (4) Abnormal Return 0.086 (t-stat) (8.781) Skill 0.073 (t-stat) (7.346) Efficiency 0.154 (t-stat) (11.820) AbRet+ 0.030 (t-stat) (1.721) AbRet− 0.130 (t-stat) (7.460) Skill+ 0.153 (t-stat) (10.938) Skill− -0.016 (t-stat) (-0.780) Eff+ 0.003 (t-stat) (0.087) Eff− 0.237 (t-stat) (14.949) Const -0.926 -0.728 -0.660 -0.811 (t-stat) (-23.770) (-14.696) (-8.831) (-9.122) Obs. 43,039 43,039 43,039 43,039 R2 (%) 0.795 1.100 0.940 1.574 37

described in Section 3.3, but focusing on long-only factor premia rather than the premia associated with the original factors. Similar to the fits achieved for the original factor premia, we find that the average cross-sectional R2 increases from the base case assuming homogeneous risk premia when we allow for heterogeneity: from 0.54 for the base case to 0.58 using ME quintiles, 0.59 using regression trees, and 0.61 using random forests. We use the random forest to decompose mutual fund abnormal returns, relative to the long-only factors, into skill and efficiency components. Similar to the original factors, we find that efficiency is the most persistent component, see Table A13, with an AR(1) coefficient of 0.14, compared with just 0.03 for skill. Finally, we investigate whether there are gains for predicting abnormal returns using skill and efficiency separately, and report the results in Table 14. Column 1 reports the results of an AR(1) for the abnormal return for reference. Column 2 shows that the coefficient on lagged efficiency is almost double that on skill, a difference that is significant at the 0.01 level. Comparing the fit in columns 2 and 3 we see that the gains from decomposing lagged abnormal return into skill and efficiency components is greater than decomposing it into positive and negative components. Similar to the results using the original factors, column 4 shows that future abnormal returns are predicted by the positive component of lagged skill, and the negative component of lagged efficiency, with the other two signed components having coefficients that are not different from zero. 5.2 An alternative factor model The analysis above focused on the widely-used Fama-French (1992)-Carhart (1997) fourfactor model. In this section we consider a six-factor model, augmenting the four-factor model with the profitability factor (RMW) and investment factor (CMA) of Fama and French (2015). Table15comparestheriskpremiaonthesesixfactorsasestimatedusingstockportfolios 38

Table 15: Differences in earned risk premia, Fama-French (2015)-Carhart (1997) model. Thistablepresentsestimatesofriskpremiaassociatedwiththesixriskfactors in the Fama-French (2015)-Carhart (1997) model, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses 199 stock portfolios described in Section 2.1, and represents the “on-paper” risk premia for these factors. The second column presents estimates using our sample of 4,853 mutual funds, also described in Section 2.1. The third column presents the difference between the two estimates, and can be interpreted as the average “(in)efficiency” of mutual funds in our sample for a specific risk factor. The row labeled R2 reports the average of the cross-sectional R2s across the ¯ 300 months in the sample. The row labeled N reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all four differences are zero. All t-statistics and p-values use Newey-West (1987) standard errors with five lags. Mutual Stocks funds Difference MKT 8.940 6.979 -1.961 (t-stat) (2.766) (2.075) (-5.144) SMB 3.003 3.182 0.179 (t-stat) (1.607) (1.758) (0.257) HML 0.956 -0.378 -1.335 (t-stat) (0.400) (-0.142) (-1.390) MOM 1.933 0.237 -1.696 (t-stat) (0.612) (0.072) (-1.627) RMW 3.170 1.284 -1.886 (t-stat) (1.856) (0.708) (-2.040) CMA 1.197 0.581 -0.617 (t-stat) (0.674) (0.319) (-0.971) R2 0.389 0.550 N¯ 199 2,364 Joint p-val 0.000 39

with what is earned by mutual funds. Similar to Table 2, we find a significant gap for the MKT factor, of 2.0% per year. We also observe a significant gap for the profitability factor, which earned an “on-paper” premium of 3.2% over this sample period, while mutual funds exposed to this factor accrued only 1.3% on average, a gap of 1.9%, significant at the 5% level. The remaining four factors did not generate significant on-paper premia over this period, and mutual funds exhibited no significant inefficiencies with respect to those factors. More relevant for this paper is whether mutual funds exhibit differing abilities to accrue the risk premia associated with these factors. In Table 16, we sort funds according to the market capitalization of the stocks they hold (the “ME” characteristic of Daniel and Titman, 1997). We see that funds which hold larger-cap stocks generally do better at accruingthemarketriskpremium. Wealsoobserveheterogeneityinfunds’abilitytoaccrue the investment factor (CMA) premium, with funds that hold mid-cap stocks earning 1.4% compared with 0.5% and 0.7% for the funds holding small-cap and large-cap stocks. As for the four-factor analysis in Table 5, we again see that mutual funds that hold small-cap stocks exhibit greater skill than funds that hold large-cap stocks. The joint test of equal risk premia and equal skill across all ME quintiles rejects the null at the 0.01 level Table 17 shows that decomposing lagged abnormal returns in skill and efficiency components leads to better forecasts of future abnormal returns, when combined with sign information. Columns 3 to 5 reveal only marginal gains from decomposing into lagged abnormal returns in skill and efficiency components relative to simply using the lagged abnormal return. However columns 6 to 8 shows that decomposing lagged ability into positive and negative, in the spirit of Brown and Goetzmann (1995) and Carhart (1997), leads to sizeable improvements: applying this sign decomposition to lagged abnormal returns (column 2) leads to a small improvement in fit (an R2 of 1.75% compared with 1.73%), with past poor performance being more informative than past positive performance. Applying 40

Table 16: Heterogeneity in mutual fund efficiency: Sorts by ME of holdings, Fama-French (2015)-Carhart (1997) model. This table presents estimates of risk premiaassociatedwiththesixriskfactorsintheFama-French(2015)-Carhart(1997)model, as well as abnormal returns in the seventh row, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds for which we have holdings data. The next five columns report parameter estimates separately for each quintile of mutual funds sorted by the value-weighted average of the market capitalization (ME) of their stock holdings. The last two columns present the p-values (asymptotic and simulation-based) associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across ¯ the 300 months in the sample. The row labeled N reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all quintiles have the same parameters. All t-statistics and asymptotic p-values use Newey-West (1987) standard errors with five lags. ME Quintile p-vals All funds Small 2 3 4 Large Asymp. Sim. MKT 5.127 4.455 5.784 4.449 6.202 5.488 0.070 0.091 (t-stat) (1.729) (1.615) (2.07) (1.393) (2.216) (1.767) SMB 3.189 2.330 2.498 3.070 1.353 4.075 0.088 0.123 (t-stat) (1.757) (1.239) (1.445) (1.77) (0.789) (1.972) HML -0.605 0.030 -1.116 -0.256 -1.482 -0.928 0.571 0.599 (t-stat) (-0.227) (0.011) (-0.423) (-0.09) (-0.581) (-0.364) MOM 0.585 0.531 0.475 0.510 0.509 0.622 1.000 1.000 (t-stat) (0.184) (0.148) (0.148) (0.16) (0.173) (0.211) RMW 1.663 2.302 1.936 1.199 1.633 1.409 0.937 0.960 (t-stat) (0.929) (1.085) (1.019) (0.585) (0.82) (0.75) CMA 0.532 0.492 0.520 1.418 -0.244 0.655 0.064 0.073 (t-stat) (0.281) (0.267) (0.278) (0.76) (-0.13) (0.361) Const 1.809 3.447 1.805 1.758 0.850 1.469 0.003 0.001 (t-stat) (2.259) (3.144) (1.929) (2.484) (0.912) (1.964) R2 0.578 0.539 0.534 0.651 0.556 0.568 N¯ 2364 473 473 473 473 472 Joint pval 0.000 0.000 41

Table 17: Predicting abnormal returns: Forest, Tree and Homogeneous models, from the Fama-French (2015)-Carhart (1997) model. This table presents results from panel regressions to predict one-year-ahead abnormal returns. The first specification is a first-order autoregression, and corresponds to (1) in Table 10. The second specification decomposes lagged abnormal returns into positive and negative components and corresponds to (3) in Table 10. Specifications (3) to (5) use lagged skill and efficiency as predictors, as measured by the Forest, Tree and Homogeneous models. Specifications (6) to (8) decomposes skill and efficiency into their positive and negative components. All t-statistics use Thompson (2011) standard errors clustered by firm and time. (1) (2) (3) (4) (5) (6) (7) (8) Forest Tree Hom. Forest Tree Hom. AbRet 0.127 (t-stat) (12.681) AbRet+ 0.107 (t-stat) (5.940) AbRet− 0.142 (t-stat) (8.085) Skill 0.122 0.121 0.126 (t-stat) (11.935) (11.730) (12.43) Eff 0.145 0.145 0.127 (t-stat) (10.995) (11.371) (8.691) Skill+ 0.189 0.187 0.185 (t-stat) (13.502) (13.199) (13.782) Skill− 0.023 0.028 0.029 (t-stat) (0.892) (1.117) (1.171) Eff+ -0.096 -0.080 -0.237 (t-stat) (-2.509) (-2.331) (-4.82) Eff− 0.260 0.246 0.278 (t-stat) (16.826) (16.035) (18.290) Const -0.851 -0.763 -0.793 -0.795 -0.848 -0.609 -0.661 -0.445 (t-stat) (-23.134) (-10.711) (-16.123) (-17.069) (-15.655) (-6.791) (-7.869) (-4.581) Obs. 43,039 43,039 43,039 43,039 43,039 43,039 43,039 43,039 R2 (%) 1.728 1.746 1.754 1.761 1.728 2.459 2.477 2.804 42

the sign decomposition to skill and efficiency separately leads to large gains, with an R2 of between 2.5% and 2.8% depending on the model. As in the case when using a four-factor model, we see that lagged positive skill and lagged negative efficiency are the most useful predictors of future abnormal returns. 5.3 Allowing for partial homogeneity Recall that in Table 5 we found significant heterogeneity in the risk premia earned for the market factor and in skill, but not for the other three factors. One might then wish to estimate a model where we impose that the risk premia for some factors are constant in the cross section, but allow for heterogeneity in others. We now present methods to do so. When using ex ante sorts and allowing all parameters to vary across groups, the model can be estimated separately for each group. When imposing that some parameters are common across all groups, the model must be estimated across all funds simultaneously. This can be accomplished by interacting the intercept and factor exposures with indicators for membership of a group. Let γ denote the group to which fund i belongs in month t. i,t Then each period we estimate: G R −Rf = (cid:88)(cid:0) λ +λ⊤ βHet (cid:1) 1{γ = g}+λ ⊤ βHom +u (13) i,t t 0,g,t g,t i,t−1 i,t t i,t−1 i.t g=1 Where “Het” denotes the subset of factors for which we want to allow heterogeneous risk premia, and“Hom”denotesthesubsetoffactorsforwhichwewanttoimposehomogeneous risk premia. Table 18 presents results when sort funds using the average market capitalization of the stocks in their portfolios (ME), and allow for heterogeneity only in market risk premia and skill. As in Table 5, where complete heterogeneity was allowed, we see that funds that invest in larger-cap stocks tend to be better at accruing the market risk premium, with 43

Table 18: Partial heterogeneity in mutual fund efficiency: Sorts by ME of holdings. This table presents estimates of risk premia associated with the four risk factors in the Fama-French (1992)-Carhart (1997) model, as well as abnormal returns in the second row, estimated using the Fama-MacBeth (1973) procedure. Using the method described in Section 5.3, we impose that the risk premia associated with the SMB, HML and MOM factors are homogeneous across mutual funds, and allow for heterogeneity only in the MKT risk premia and the constant. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds for which we have holdings data. The next five columns report parameter estimates separately for each quintile of mutual funds sorted by the value-weighted average of the market capitalization (ME) of their stock holdings. The last two columns present the p-values (asymptotic and simulation-based) associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across the 300 months in the sample. The row labeled N ¯ reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all quintiles have the same parameters. All tstatistics and asymptotic p-values use Newey-West (1987) standard errors with five lags. ME Quintile p-vals All funds Small 2 3 4 Large Asymp. Sim. MKT 5.085 4.537 5.354 4.739 5.880 5.464 0.388 0.384 (t-stat) (1.696) (1.608) (1.887) (1.475) (1.986) (1.778) SMB 3.149 – – 2.436 – – (t-stat) (1.722) – – (1.45) – – HML -0.490 – – -0.750 – – (t-stat) (-0.183) – – (-0.282) – – MOM 0.256 – – 0.211 – – (t-stat) (0.08) – – (0.067) – – Const 1.802 3.334 1.800 1.816 1.003 1.303 0.024 0.040 (t-stat) (2.208) (2.747) (1.794) (2.372) (1.096) (1.7) R2 0.543 0.591 N¯ 2364 473 Joint p-val 0.001 0.000 44

the estimates ranging from 4.5% for the smallest quintile to 5.5% for the largest, though the test of homogeneity does not reject the null. Funds who invest in small cap stocks are found to have higher average skill, at 3.3% compared with 1.3% for funds investing in the large cap stocks. A joint test of equal skill and equal market risk premia across all quintiles rejects the null at the 0.01 level. When using regression trees or random forests to capture heterogeneity in risk premia for a subset of factors we need a different approach to the simple one above using indicators for group membership, since group membership (i.e., the “leaf” to which a fund belongs) is itself estimated as part of the model. To accomplish this, we use a method motivated by the Frisch-Waugh (1933)-Lovell (1963) (FWL) theorem. We “partial out” the coefficients for which homogeneity is imposed from both mutual fund returns and from the regressors (including the intercept, if applicable) whose coefficients are allowed to vary. Concretely, we first regress returns and the heterogeneous regressors on the subset of regressors whose coefficients are restricted to be constant, and retain the corresponding residuals. In a second stage, we apply regression trees or random forests to these residualized outcomes and residualized regressors to estimate the heterogeneous components. If the tree collapses to a single leaf, the second stage reduces to OLS and by the FWL theorem the resulting coefficient estimates coincide exactly with those obtained from the model estimated jointly in a single stage under homogeneity. If instead the tree contains multiple leaves, the estimated risk premia will differ across leaves for the factors for which heterogeneity is allowed, while imposing homogeneity for the other factors. We implement this second stage using the same methods described in Section 3.3, allowing only the intercept and the market risk premium to vary in the cross section. We find an average of 7.1 leaves to be optimal, compared with 4.6 when we allow heterogeneity in all risk premia. This is perhaps indicative that some non-market factors do have heterogeneous risk premia, and the regression tree attempts to overcome the imposed ho- 45

Table 19: Predicting abnormal returns, allowing for partial homogeneity of lambdas. This table presents results from panel regressions to predict one-year-ahead abnormal returns. Skill and efficiency are estimated using a random forest model that allows for heterogeneity only in the intercept and the market risk premium, as described in Section 5.3. The first specification is a first-order autoregression, and corresponds to the last column in Table 9. The second specification uses lagged skill and efficiency as predictors. The third specification decomposes lagged abnormal returns into positive and negative components. The fourth specification decomposes skill and efficiency into their positive and negative components. All t-statistics use Thompson (2011) standard errors, clustered by firm and time. (1) (2) (3) (4) Abnormal Return 0.102 (t-stat) (10.293) Skill 0.094 (t-stat) (9.358) Efficiency 0.148 (t-stat) (11.911) AbRet+ 0.069 (t-stat) (3.967) AbRet− 0.129 (t-stat) (7.151) Skill+ 0.174 (t-stat) (13.462) Skill− 0.002 (t-stat) (0.109) Eff+ 0.008 (t-stat) (0.235) Eff− 0.217 (t-stat) (17.179) Const -0.897 -0.748 -0.739 -0.894 (t-stat) (-23.784) (-14.772) (-9.993) (-10.416) Obs. 43,039 43,039 43,039 43,039 R2 (%) 1.108 1.275 1.163 1.750 46

mogeneity by increasing its complexity. It is also consistent with a decrease in the implicit penalty for estimation error that appears in our cross-validation approach for selecting the optimal number of leaves, as fewer parameters are estimated in the partially homogeneous tree model, thereby allowing for a larger optimal tree size. In Table 19 we consider predicting one-year-ahead abnormal returns using lagged measures of ability using the random forest model with heterogeneity in only the intercept and the market risk premium. Columns 1 and 3 use lagged abnormal returns, which do not depend on the model for efficiency, and so are identical to the corresponding columns in Table 10. In column 2 we again see a significant gain from decomposing lagged abnormal returns into skill and efficiency, and the fit is approximately the same as using these measures from a model allowing complete heterogeneity. When decomposing skill and efficiency into positive and negative components, reported in column 4, we see a small improvement in fit relative to the fully heterogeneous model (R2 of 1.8% versus 1.7%), and we continue to observe that for predicting future abnormal returns the most important components are lagged positive skill and lagged negative efficiency. 6 Conclusion The idea that the performance of a fund manager should be gauged by risk-adjusted performance goes back over half a centery (Jensen, 1968) and has been extensively used by investors and academics. The number of risks that are considered in such models has grown (e.g., see Fama and French, 1992; Carhart, 1997; Fama and French, 2015 and others) but the assumption that simply earning the risk premia associated with the risks to which a fund is exposed is not a skill has gone largely unquestioned. Work on implementation costs (e.g., Keim and Madhavan, 1997; Wermers, 2000; Novy- Marx and Velikov, 2016; Patton and Weller, 2020) has revealed that earning the risk pre- 47

mium associated with certain factors, momentum in particular, is difficult, especially for large investors. We define “efficiency” as how well the manager is able to accrue the risk premia associated with a given risk factor, and we show that the familiar abnormal return, or alpha, is made up of two components: “aggregate efficiency,” which is the sum of the fund’s (in)efficiencies across risk factors, weighted by the fund’s exposures to those factors, and “skill,” a component that is unrelated to factor exposures. Using a variety of methods, including ex ante sorts and machine learning tools, we find significant heterogeneity in efficiency in our sample of nearly 5,000 U.S. mutual funds. Put simply, some funds earn a higher risk premium for exposure to the same risk factor. These differences can be economically large, in addition to being statistically significant. For example, funds with returns that are well-explained by the Fama and French (1992)- Carhart (1997) model accrue 6.2% per year for exposure to the market factor, while funds with more idiosyncratic returns earn only 4.8% for the same exposure. We study the properties of skill and efficiency in both the cross section and the time series. Across funds, we find that skill and efficiency are negatively correlated, consistent with these being distinct types of manager ability, and each requiring effort to achieve. In the time series we find that efficiency is more persistent than skill, consistent with the former being a more fundamental property of the manager, and with the latter being more attributable to luck. Consistent with this, we find that abnormal returns can be better predicted using lagged skill and efficiency separately, rather than combined. In fact, the out-of-sample predictive gains from decomposing lagged abnormal returns into skill and efficiency are over three times as large as the gains from exploiting the well-known result (seeBrownandGoetzmann,1995;Carhart,1997)thatabnormalreturnpersistenceisdriven by poor past performers. Combining these two ideas leads to even greater predictive gains. 48

Appendix A Data cleaning methods Following Berk and van Binsbergen (2015) and Patton and Weller (2020), we clean the CRSP mutual fund data in two stages: first at the individual fund level to address missing and erroneous data, and then at the fund group level to apply higher-level sample filters. For simplicity, we refer to ‘fund groups’ as ‘funds’ throughout the main text. A.1 Fund-level data cleaning Our data cleaning begins at the individual fund level, addressing inconsistencies and missing values prior to aggregation. We use mutual fund monthly return data from CRSP, covering January 1970 through December 2023, resulting in 9,399,372 observations. We first flag fund-months with returns reported less frequently than monthly. For these, we mark the return as missing if neither adjacent month contains a valid, non-zero return. These infrequent annual returns represent 0.71% of all fund-month observations. We next clean the total net asset (TNA) values, which are crucial for value-weighting fund returns. Following established practices, we treat TNAs less than or equal to $100,000 as bottomcoded and mark them as missing. We also exclude extremely high values exceeding $1 trillion. These filters result in 12.29% of TNA observations being marked as missing. To recover missing TNAs, we use a three-step interpolation process. First, we project missing TNAs by compounding the last known TNA using cumulative fund returns since that point — though this does not capture cash flows. Second, when a subsequent valid TNA exists, we calculate the discrepancy between the predicted and actual TNA as the ratio of actual TNA to predicted TNA (minus one) to infer implied flows, which we assume are evenly distributed over time. Specifically, the predicted TNA is adjusted by multiplying it by 49

(1 + discrepancy)s/∆t where s is the number of months since the last valid TNA and ∆t is the total gap length. If no subsequent TNA is available, we assume zero discrepancy. Third, we apply the first two steps in reverse to impute TNAs prior to the first observed value. After interpolation, we again flag any TNAs below $100,000 or above $1 trillion as invalid. These procedures reduce missing TNA values to 3.6% of the dataset. To adjust for share class differences in fees, we convert net returns into gross returns by addingone-twelfthoftheannualexpenseratio, followingFamaandFrench(2010). Initially, 19.02% of observations in the fund summary file have missing or non-positive expense ratios. We first fill these gaps using the nearest non-missing value within each CRSP fund number group, reducing the missing rate to 10.6%. We then merge the monthly return and summary files by fund number and calendar quarter, assigning expense ratios to 74.88% of fund-month observations. For the remaining unmatched records, we merge again using fund number and calendar year, averaging the expense ratios within each fund-year. This additional merge raises coverage to 85.49% (or 8,035,470 observations). We remove 136 observations with expense ratios over 50%, as these are likely data errors. Following Patton and Weller (2020), we remove return outliers, specifically any observation with an absolute return exceeding 100%, which is likely due to reporting mistakes. This results in the exclusion of 162 such observations. We also eliminate any fund-month observation ever flagged as an ETF, ETN, or VAU fund during the fund’s life, based on the “et flag” variable. These filters exclude approximately 14.3% of the total observations. A.2 Aggregating funds into fund groups To identify funds belonging to the same fund group, we begin by imputing missing fund namesusingthenearestvalidnamewithineachCRSPfundnumbergroup. Outof2,713,579 observations in the fund summary file, this procedure recovers names for 23,184 entries. We remove 285 observations where a name cannot be recovered. A similar approach is used 50

to fill missing ticker symbols, recovering 136,537 out of 379,100 missing tickers. Unlike missing names, we retain records with missing tickers. We then follow the two-step grouping algorithm detailed by Patton and Weller (2020), Berk and van Binsbergen (2015), and Pastor et al. (2015): 1. Parsing fund names: We identify share classes using three mutually exclusive rules: • If the fund name contains a semicolon and the part following the last semicolon does not include a forward slash, we use the portion before the last semicolon as the group name. • If the name contains a forward slash and the substring after the last slash lacks spaces or semicolons, we use the part before the slash. • If neither condition applies, we assume the name does not include a share class indicator. Before applying these rules, we adjust for fund names where slashes are not share class delimiters—e.g., “Franklin/Templeton” or “M/M.” To address these, we replace slashes with backslashes in known patterns like T/F, T/E, M/M, L/S, Small/Mid, Long/Short, S/T, and L/T. 2. Creating equivalence classes: We define funds as equivalent if they share an adjusted name or a ticker symbol. Equivalence is treated as transitive: if Fund A shares a name with Fund B, and Fund B shares a ticker with Fund C, all three are grouped together. Applying this procedure reduces the 72,070 unique fund IDs in the CRSP monthly return file to 28,192 fund groups. Of the total observations in the monthly return file, 5,615 could not be matched to a group and were removed. 51

A.3 Fund group-level cleaning We construct fund group returns and TNAs using component fund IDs. Group returns are value-weighted using one-month lagged TNAs. If a group contains only a single fund, we retain its observation even when the lagged TNA is missing. Group-level TNA is the sum of current TNAs across all component funds. These steps result in 2,548,148 monthly fund-group observations. To address incubation bias, as noted in Fama and French (2010), we retain only observations after a fund group surpasses a TNA threshold of $10 million (in December 2023 dollars). We do not drop post-threshold data if the TNA later falls below this cutoff. Excluding fund groups that never reach $10 million eliminates 2.22% of group-month observations, while dropping pre-threshold months removes an additional 4.76%. We apply additional filters to exclude non-equity and international funds. Specifically, we drop any fund whose name includes terms such as “ETF,” “ETN,” “exchangetraded fund,” “exchange traded fund,” “exchange-traded note,” “exchange traded note,” “iShares,” and “PowerShares” (not case sensitive). We also remove funds with terms suggesting non-domestic equity exposure or fixed-income strategies—for instance: “international,” “intl,” “bond,” “emerging,” “frontier,” “rate,” “fixed income,” “commodity,” “oil,” “gold,” “metal,” “world,” “global,” “China,” “Europe,” “Japan,” “real estate,” “absolute return,” “government,” “exchange,” “euro,” “India,” “Israel,” “treasury,” “Australia,” “Asia,” “pacific,” “money,” “cash,” “yield,” “U.K.,” “UK,” “kingdom,” “municipal,” “Ireland,” “LIBOR,” “govt,” “obligation,” “mm,” “m/m,” and “diversified” (but not “diversified equity”), and “short term” (not case sensitive). These exclusions reduce the number of valid fund groups from 14,268 to 6,366, leaving 953,001 non-missing monthly return observations for the full period from December 1970 to December 2023. Next, we restrict our sample to fund groups with at least two years of monthly data during the 52

1970–2023 period. This filter yields a sample of 5,722 mutual fund groups with 942,779 valid monthly return observations. To construct realized betas, we require daily returns for mutual fund groups. Using the fund groups and their component funds identified earlier, we retrieve daily returns and TNAs for the component funds from CRSP and compute daily fund group returns as (daily) TNA-weighted averages of the component funds’ daily returns. We then restrict the sample to fund groups with at least two years of daily data during the 1999–2023 period. This filter yields a sample of 5,046 mutual fund groups with valid daily return observations. We finally impose three additional filters on the sample. First, we remove a small number of funds that have full-sample annualized standard deviation of less than 1% (these are cash funds that were not detected using the keyword check above). Second, given our focus on factor models, we eliminate funds with fewer than 12 monthly returns over our sample period. Finally, as we rely on the daily mutual fund data to compute realized betas (see Appendix B), but our main focus is on monthly mutual fund returns, we remove funds with mismatches between these databases. Specifically, we construct monthly fund returns from the daily return database, and we remove a small number of funds where the correlation between the monthly database returns and the constructed monthly returns is less than 0.99. This leaves us with a total of 4,853 funds. B Mutual fund realized betas The standard method for estimating factor exposures is to use a rolling window of 36 or 60 months of returns, see Ferson (2019) and Kaniel et al. (2023) for example. The beta for month t is estimated using data from t−36 to t−1: R −Rf = α +β⊤ F +ε , for s = t−36,...,t−1 (14) i,s s i,t−1 i,t−1 s i,s 53

InthepopularFamaandMacBeth(1973)procedure, theestimatedfactorexposures, β , i,t−1 are then used as regressors in the second-stage, cross-sectional, regressions: R −Rf = λ +λ⊤β +u , for i = 1,2,...,N (15) i,t t 0,t t i,t−1 i,t t where N denotes the number of funds available at time t. t With the availability of daily mutual fund returns, we can exploit advances in the econometrics of high frequency data by estimating factor exposures over shorter windows of time, obtaining more accurate estimates of these important quantities, see Barndorff- Nielsen and Shephard (2004) and Andersen et al. (2006). Lewellen and Nagel (2006) use this “short window regression” approach for stock portfolios, and Akbas and Genc (2020) use it for U.S. mutual funds. Let d denote days and let D denote the number of trade days t in months {t−2,t−1,t}, a number that is typically around 65, then we run the regression: R −Rf = α˜ +β⊤ F +ε , for d = 1,2,...,D (16) i,d d i,t−1 i,t−1 d i,d t−1 The slope coefficients in equation (16) are timely and accurate estimates of the exposures of the fund, and avoid the need to specify a parametric model for their movements over time. C Simulation-based testing of homogeneity In Section 3.2 we consider tests of equal risk premia across ex ante groups of mutual funds. For a single risk factor, e.g. MKT, we consider: H : λMKT = λMKT for all g,h (17) 0 g h vs. H : λMKT ̸= λMKT for some g,h (18) 1 g h 54

Table A1: Comparing asymptotic and simulation-based tests. This table considers test of equal parameters across quintile groups of mutual funds. The first five rows test each parameter of the model separately, while the last row tests all parameters jointly. Placebo groupings of mutual funds are formed by randomly assigning funds to groups, and averaging results across 1,000 random groupings. The first column of this table presents the rejection rates for these placebo groupings using χ2 critical values at the 0.05 nominal level. The next two columns present the asymptotic χ2 critical values for these tests and the critical values obtained from the distribution of test statistics across the 1,000 placebo groupings. The last five columns report the test statistics using each of the five sorting variables considered in Section 3.2. Placebo Critical value Test statistic rejection rate Asymp. Sim. R2 TNA ME BM MO FF4 MKT 0.067 9.488 10.658 17.532 6.302 11.367 5.051 3.833 SMB 0.058 9.488 9.950 8.365 5.769 10.284 11.159 1.071 HML 0.060 9.488 9.978 7.708 3.750 6.676 5.023 2.520 MOM 0.067 9.488 10.554 2.643 8.512 7.530 6.442 2.309 Const 0.061 9.488 10.205 1.894 14.416 0.588 0.150 3.757 Joint 0.243 31.410 40.043 42.200 42.752 51.118 47.948 36.333 where g and h indicate groups of mutual funds. Given that we use quintile sorts, we test a total of q = 4 restrictions. For the joint test, across all parameters, we test a total of (K +1)×4 = 20 restrictions. Asymptotic p-values for these tests are obtained from the χ2 distribution. q Toconfirmthattheasymptoticcriticalvaluesleadtotestswithsatisfactorysizecontrol, we construct “placebo” groupings of mutual funds and run the tests on those groups. The placebo groupings are formed in an identical way to those formed in Section 3.2, except that we use iid draws from the U(0,1) distribution as the sorting variable, with the same missing data structure present in the real sorting variable imposed on the placebo sorting variable. Thus in this analysis all groups are random subsets of the population of funds, and the rejection rate of the test across 1,000 random groupings should approximately 55

Figure A1: Distribution of joint tests of homogeneity. This figure plots the asymptotic distribution of the joint test of homogeneity of all lambdas acrossquintilegroupsformedusingasortingvariable,andthesimulation-baseddistribution using random group assignments. The F-statistics associated with tests of homogeneity across quintiles formed using one of five sorting variables (MO, R2 , TNA, ME and BM) FF4 are presented as vertical lines. 70 Asymp. dist Sim. dist MO 60 R2 TNA 50 BM ME 40 30 20 10 0 0 10 20 30 40 50 60 70 F statistic equal the nominal size of the test. The first column of Table A1 shows that the rejection rates for tests of homogeneity for a single variable (the first five rows) are close to the 5% nominal level, ranging from 5.8% to 6.7%, however the joint test across all variables has a rejection rate of 24.3%, far above the nominal level.19 Figure A1 compares the asymptotic distribution and the simulation-based distribution for the joint test, and shows that the latter is a right-shift of the former. Given the importance of the joint test for determining whether mutual funds significantly differ in their ability to accrue risk premia, and the poor size control of this test 19The size control of the tests of homogeneity for a single factor are better because they only involve estimating a 4 × 4 asymptotic covariance matrix, while the joint test across all parameters involves a 20×20 asymptotic covariance matrix. The latter is a challenging task with only 300 months of data. The simulation-based critical values capture the difficulty of estimating this matrix. 56

when using the asymptotic critical value, we base our conclusions from tests based on ex ante sorts using the simulation-based test results. D Regression tree details This appendix provides technical details on the two tree-based machine learning methods employed in this paper: regression trees and random forests. These methods adapt the partitioning logic of classification and regression trees (CART, Breiman et al., 1984, 2017) to capture cross-sectional heterogeneity in mutual fund efficiency. D.1 Regression tree Regression trees recursively partition the feature space into mutually exclusive regions and estimate a parametric model within each region. The CART algorithm constructs such partitions greedily by selecting splits that maximize the improvement in fit at the next stage. We use the sum of squared errors (SSE) from the leaf-specific Fama and MacBeth (1973) second-stage cross-sectional regressions as our measure of fit. For a calendar year y, let M denote the 12 months in that year. For each t ∈ M , y y our data consist of the monthly cross-section {R −Rf, β , z }Nt , where β are i,t t i,t−1 i,t−1 i=1 i,t−1 the realized betas estimated from the prior three months of daily returns (see Appendix B), z ∈ R16 is a vector of 16 predetermined state variables used to split fund-month i,t−1 observations, and N is the number of mutual funds present in month t. For each year, t we estimate a single regression tree using the pooled fund-month observations from all 12 monthlycross-sectionsinthatyear. Ifz(k) ismissing, wereplaceitwiththecross-sectional i,t−1 median of {z(k) }Nt in month t. i,t−1 i=1 A regression tree T partitions the state space into J terminal regions or “leaves”, denoted {P }J(T), based on the state variable z . In month t, fund i is assigned to leaf j j=1 i,t−1 57

j(i,t) if z ∈ P . Within each leaf, we estimate the Fama–MacBeth cross-sectional i,t−1 j(i,t) regression with month-t coefficients: R −Rf = λ +λ⊤β +u , ∀ i : z ∈ P . (19) i,t t 0,j,t j,t i,t−1 i,t i,t−1 j For a given tree T , define the annual SSE as J(T) (cid:88) (cid:88) (cid:88) SSE(T ) ≡ uˆ (T )2. (20) i,t j=1 t∈Myi:zi,t−1∈Pj The tree structure is estimated once for each calendar year using the pooled observations from all 12 months in that year, while the leaf-specific premia {λ }J(T) are estimated j,t j=1 separately for each month t conditional on that annual partition. Starting from the stump tree T (0) with one leaf, we grow the tree sequentially. At splitting step s ≥ 1, suppose the current tree T (s−1) has leaves {P(s−1),...,P(s−1) }. 1 J(T(s−1)) For each current leaf P(s−1), each splitting variable k ∈ {1,...,16}, and each candidate m threshold c, we proceed as follows. First, we split P(s−1) using the kth state variable at threshold c, producing child leaves: m P(s) = {z ∈ P(s−1) : z(k) ≤ c}, m,k,c,L m P(s) = {z ∈ P(s−1) : z(k) > c}. m,k,c,R m Weevaluate19candidatethresholds,givenbytheempiricalpercentiles{0.05,0.10,...,0.95} of the pooled sample {z(k) : z ∈ P(s−1), t ∈ M , i = 1,...,N }, that is, over all fundi,t−1 i,t−1 m y t month observations in the candidate leaf during year y. We discard a candidate split if, in either child leaf, any month contains fewer than 20 funds. Second, for each triplet (m,k,c), we re-estimate (19) separately in the two child leaves P(s) and P(s) , while leaving the regression fits in all other leaves unchanged. m,k,c,L m,k,c,R 58

Third, let T (s) denote the candidate tree obtained by replacing P(s−1) with its two m,k,c m child leaves and leaving all other leaves unchanged. We compute the resulting annual SSE, SSE(T (s) ), and select the split that minimizes this quantity: m,k,c (cid:16) (cid:17) (m ,k ,c ) = argmin SSE T (s) . s s s m,k,c m,k,c We then set T (s) = T (s) . ms,ks,cs This procedure is repeated until the tree reaches the number of leaves selected by crossvalidation. We select the number of leaves J∗ ∈ {1,...,10} by five-fold cross-validation. To construct folds, within each of the 12 months in year y, we randomly partition fund-month observations into five approximately equally sized groups. We then combine corresponding groups across all months (e.g., group 1 from each month forms fold 1), so that each fold contains a balanced representation of observations from all 12 months. For each candidate J, we estimate the annual tree using four folds, evaluate the out-of-sample SSE on the held-out fold, and repeat for all five folds. We select the J that minimizes the total out-of-sample SSE across all folds. Fund i’s efficiency in month t under the regression-tree specification is ηTree = λTree −λStocks, i,t j(i,t),t t where j(i,t) is defined by z ∈ P , and λStocks is the “on-paper” premium vector i,t−1 j(i,t) t estimated from the 199 stock portfolios. 59

D.2 Random forest Randomforests(Breiman,2001)reducethesensitivityofasingleregressiontreetosampling variation by averaging across many bootstrap trees. Let J∗ denote the number of leaves selected for the single-tree model in Section D.1. For each b = 1,...,B (with B = 500), we draw a bootstrap sample D[b] by resampling fund-month observations with replacement separately within each of the 12 months in year y, then stacking the month-specific bootstrap samples. This stratified bootstrap procedure maintains the monthly structure of the data, with each bootstrap sample containing the same number of observations as the original sample. Using D[b], we estimate a fixed-size tree T [b] with J∗ leaves by the same CART/SSE algorithm as in Section D.1. As in the single-tree case, the tree structure is held fixed within each calendar year. To reduce the correlation between the trees, we follow the literature (e.g., Hastie et al., 2009) and use only a subset of the state variables to estimate each bootstrap tree. Specifically, we randomly select 5 of the 16 state variables and restrict all candidate splits in that tree to this subset. Given tree T [b], fund i in month t is assigned to a leaf j[b](i,t), and the corresponding leaf-specific Fama and MacBeth (1973) second-stage cross-sectional regression yields η[b] = λ[b] −λStocks. i,t j[b](i,t),t t The random-forest efficiency estimate is then the average across bootstrap trees: B 1 (cid:88) ηForest = η[b]. i,t B i,t b=1 Thus, the random forest preserves the interpretation of the single-tree model while producing more stable estimates by averaging estimates across many bootstrap trees. 60

E Additional tables Table A2: Differences in earned risk premia, monthly betas. This table presents estimates of risk premia associated with the four risk factors in the Fama-French (1992)-Carhart (1997) model, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior 36 monthly returns, for comparison with the results using “realized betas” presented in Table 2. The first column uses 199 stock portfolios described in Section 2.1, and represents the “on-paper” risk premia for these factors. The second column presents estimates using our sample of 4,853 mutual funds, also described in Section 2.1. The third column presents the difference between the two estimates, and can be interpreted as the average “(in)efficiency” of mutual funds in our sample for a specific risk factor. The row labeled R2 reports the average of the cross-sectional R2s across the 300 months in the sample. The row labeled N ¯ reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all four differences are zero. All t-statistics and p-values use Newey-West (1987) standard errors with five lags. Mutual Stocks funds Difference MKT 8.667 6.555 -2.112 (t-stat) (2.697) (1.991) (-5.074) SMB 3.097 3.43 0.333 (t-stat) (1.722) (2.009) (0.491) HML 2.212 0.996 -1.216 (t-stat) (0.846) (0.367) (-1.623) MOM 2.654 -0.110 -2.764 (t-stat) (0.904) (-0.034) (-1.632) R2 0.274 0.479 N¯ 199 2,240 Joint p-val 0.000 61

Table A3: Differences in earned risk premia, with intercept This table presents estimates of risk premia associated with the four risk factors in the Fama-French (1992)-Carhart (1997) model, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses 199 stock portfolios described in Section 2.1, and represents the “on-paper” risk premia for these factors. The second column presents estimates using our sample of 4,853 mutual funds, also described in Section 2.1, and an intercept (not reported) is included in this specification. The third column presents the difference between the two estimates, and can be interpreted as the average “(in)efficiency” of mutual funds in our sample for a specific risk factor. The row labeled R2 reports the average of the cross-sectional R2s across the 300 months in the sample. The ¯ row labeled N reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all four differences are zero. All t-statistics and p-values use Newey-West (1987) standard errors with five lags. Mutual Stocks funds Difference MKT 8.834 5.085 -3.749 (t-stat) (2.718) (1.696) (-4.695) SMB 2.575 3.149 0.573 (t-stat) (1.378) (1.722) (0.79) HML 1.378 -0.49 -1.867 (t-stat) (0.56) (-0.183) (-1.931) MOM 1.821 0.256 -1.565 (t-stat) (0.576) (0.08) (-1.472) R2 0.324 0.543 N¯ 199 2364 Joint p-val 0.000 62

Table A4: Optimal number of leaves in the regression tree. This table presents the number of leaves selected by five-fold cross-validation for each of the yearsinoursampleperiod. Themaximumnumberofleavesconsideredisten. Selectingone leaf corresponds to using OLS on the full sample of funds and not splitting into subgroups. Number of Number of Year leaves Year leaves 1999 3 2012 6 2000 6 2013 10 2001 4 2014 7 2002 3 2015 3 2003 2 2016 4 2004 4 2017 5 2005 4 2018 6 2006 1 2019 4 2007 2 2020 6 2008 10 2021 4 2009 3 2022 4 2010 5 2023 4 2011 4 Average 4.56 63

Table A5: Heterogeneity in mutual fund efficiency: Sorts by MO of holdings. This table presents estimates of risk premia associated with the four risk factors in the Fama-French (1992)-Carhart (1997) model, as well as abnormal returns in the fifth row, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds for which we have holdings data, and differs slightly from the second column in Table A3 due to missing data. The next five columns report parameter estimates separately for each quintile of mutual funds sorted by the value-weighted average of the twelve-month momentum (MO) of their stock holdings. The last column presents the p-value associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across the ¯ 300 months in the sample. The row labeled N reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all quintiles have the same parameters. All t-statistics and p-values use Newey-West (1987) standard errors with five lags. MO Quintiles p-vals All funds Down 2 3 4 Up Asymp. Sim. MKT 5.085 5.266 4.669 4.405 4.610 4.636 0.899 0.906 (t-stat) (1.696) (1.949) (1.669) (1.386) (1.578) (1.526) SMB 3.149 3.658 2.581 3.382 2.526 2.806 0.641 0.667 (t-stat) (1.722) (1.852) (1.355) (1.914) (1.315) (1.49) HML -0.490 -0.338 0.064 -0.048 -0.864 0.185 0.679 0.706 (t-stat) (-0.183) (-0.132) (0.025) (-0.017) (-0.338) (0.067) MOM 0.256 -0.473 -2.051 -0.519 -2.082 -0.448 0.440 0.470 (t-stat) (0.08) (-0.158) (-0.695) (-0.152) (-0.736) (-0.152) Const 1.802 1.413 1.847 1.755 2.481 2.717 0.429 0.487 (t-stat) (2.208) (1.292) (1.812) (2.382) (2.593) (2.827) R2 0.543 0.482 0.498 0.618 0.501 0.473 N¯ 2364 473 473 473 473 472 Joint p-val 0.014 0.131 64

Table A6: Skill, efficiency, and abnormal returns: Tree and Homogeneous models. This table presents results on the proportion of mutual funds that have skill, efficiency, and abnormal returns that are significantly different from zero at the 10% level. Estimates of skill and efficiency are based on the regression tree model described in Section 3.3, or standard OLS using the full sample of mutual funds. “Full efficiency” corresponds to an estimate that is not different from zero. The left panel shows a two-way contingency table for skill and efficiency, and the right column shows the results for abnormal returns. Note that abnormal returns are not affected by the model used to estimate skill and efficiency, and so the right columns in each panel are identical. Panel A: Regression tree estimates Efficiency Abnormal Negative Full Positive Total Return Negative 0.008 0.024 0.002 0.033 0.312 Skill Zero 0.377 0.298 0.018 0.693 0.610 Positive 0.205 0.042 0.026 0.274 0.078 Total 0.590 0.364 0.046 1 1 Panel B: Homogeneous (OLS) estimates Efficiency Abnormal Negative Full Positive Total Return Negative 0.009 0.024 0.002 0.034 0.312 Skill Zero 0.393 0.242 0.006 0.642 0.610 Positive 0.261 0.056 0.008 0.324 0.078 Total 0.662 0.322 0.016 1 1 65

Table A7: Persistence of measures of manager ability: Regression tree and Homogeneous models. This table presents parameter estimates from a panel autoregressive model of order one for annualized measures of mutual fund manager ability, given in the column titles, estimated using the random forest and boosted regression tree models described in Section 3.3. All t-statistics use Thompson (2011) standard errors, clustered by firm and time. Regression tree Homogeneous Abnormal Skill Efficiency Skill Efficiency Return AR(1) 0.019 0.096 0.043 0.156 0.102 (t-stat) 2.292 13.272 4.779 25.376 10.293 Const 1.892 -2.642 1.924 -2.552 -0.897 (t-stat) 40.967 -73.443 41.552 -88.168 -23.784 Obs. 43,039 43,039 43,039 43,039 43,039 R2 (%) 0.039 0.975 0.196 2.701 1.108 66

Table A8: Predicting abnormal returns: Forest, Tree and Homogeneous models. This table presents results from panel regressions to predict one-year-ahead abnormal returns. The first specification is a first-order autoregression, and corresponds to (1) in Table 10. The second specification decomposes lagged abnormal returns into positive and negative components and corresponds to (3) in Table 10. Specifications (3) to (5) use lagged skill and efficiency as predictors, as measured by the Forest, Tree and Homogeneous models. Specifications (6) to (8) decomposes skill and efficiency into their positive and negative components. All t-statistics use Thompson (2011) standard errors clustered by firm and time. (1) (2) (3) (4) (5) (6) (7) (8) Forest Tree Hom. Forest Tree Hom. AbRet 0.102 (t-stat) (10.293) AbRet+ 0.069 (t-stat) (3.967) AbRet− 0.129 (t-stat) (7.151) Skill 0.092 0.084 0.096 (t-stat) (9.189) (8.311) (9.486) Eff 0.153 0.171 0.149 (t-stat) (11.816) (14.106) (10.623) Skill+ 0.162 0.148 0.164 (t-stat) (12.055) (10.905) (12.373) Skill− 0.007 0.005 0.004 (t-stat) (0.318) (0.202) (0.166) Eff+ 0.000 0.103 -0.274 (t-stat) (0.000) (3.790) (-5.206) Eff− 0.228 0.203 0.304 (t-stat) (15.618) (14.048) (20.505) Const -0.897 -0.739 -0.734 -0.673 -0.753 -0.793 -0.925 -0.375 (t-stat) (-23.784) (-9.993) (-14.692) (-14.38) (-13.829) (-9.194) (-11.763) (-3.942) Obs. 43,039 43,039 43,039 43,039 43,039 43,039 43,039 43,039 R2 (%) 1.108 1.163 1.290 1.594 1.217 1.691 1.861 2.311 67

Table A9: Heterogeneity in mutual fund efficiency: R2 sorts, long-only factors. This table presents estimates of risk premia associated with “long only” versions of the four risk factors in the Fama-French (1992)-Carhart (1997) model, as well as abnormal returns in the fifth row, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds. The next five columns report parameter estimates separately for each quintile of mutual funds sorted by their time series R2 from the Fama-French (1992)-Carhart (1997) model. The last two columns present the p-values (asymptotic and simulation-based) associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across the 300 months in the sample. The row labeled N ¯ reports the average number of assets used to estimate the risk premia each period. The row labeled R2 reports the average R2 from the French (1992)-Carhart (1997) model for the funds FF4 in that column. The last row reports the p-value on the joint test that all quintiles have the same parameters. All t-statistics and asymptotic p-values use Newey-West (1987) standard errors with five lags. R2 Quintile p-values FF4 All funds Low 2 3 4 High Asymp. Sim. MKT 5.225 5.241 5.324 6.261 7.028 6.246 0.156 0.194 (t-stat) (1.747) (1.675) (1.864) (2.04) (2.293) (2.159) SMB+ 8.115 6.159 8.807 8.802 9.562 8.799 0.166 0.173 (t-stat) (2.096) (1.538) (2.329) (2.208) (2.495) (2.391) HML+ 6.623 4.971 7.169 7.809 8.334 7.863 0.180 0.195 (t-stat) (1.727) (1.207) (1.932) (1.974) (2.177) (2.066) MOM+ 6.950 4.988 7.276 7.892 8.726 7.178 0.041 0.059 (t-stat) (1.897) (1.344) (2.07) (2.086) (2.336) (2.071) Const 1.635 1.922 1.758 0.804 -0.200 0.401 0.006 0.008 (t-stat) (1.934) (2.355) (1.502) (0.805) (-0.171) (0.373) R2 0.539 0.461 0.472 0.542 0.574 0.666 N¯ 2364 473 473 473 473 472 R¯2 0.826 0.442 0.850 0.916 0.945 0.975 FF4 Joint p-val 0.014 0.099 68

Table A10: Heterogeneity in mutual fund efficiency: TNA sorts, long-only factors. Thistablepresentsestimatesofriskpremiaassociatedwith“longonly”versionsof the four risk factors in the Fama-French (1992)-Carhart (1997) model, as well as abnormal returns in the fifth row, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds. The next five columns report parameter estimates separately for each quintile of mutual funds sorted by total net assets. The last two columns present the p-values (asymptotic and simulation-based) associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across the 300 months in the ¯ sample. The row labeled N reports the average number of assets used to estimate the risk premia each period. The row labeled TNA reports the average total net assets (in millions of December 2023 dollars) of the funds in that column. The last row reports the p-value on the joint test that all quintiles have the same parameters. All t-statistics and asymptotic p-values use Newey-West (1987) standard errors with five lags. Total Net Asset Quntile p-values All funds Small 2 3 4 Large Asymp. Sim. MKT+ 5.222 6.072 5.080 4.753 4.994 5.098 0.187 0.187 (t-stat) (1.744) (2.045) (1.694) (1.535) (1.671) (1.718) SMB+ 8.111 8.923 8.058 7.826 8.294 7.478 0.042 0.065 (t-stat) (2.093) (2.319) (2.074) (1.97) (2.133) (1.964) HML+ 6.615 6.829 7.110 6.313 6.446 6.304 0.620 0.639 (t-stat) (1.723) (1.802) (1.867) (1.546) (1.694) (1.608) MOM+ 6.937 7.519 6.604 7.031 6.576 6.768 0.514 0.520 (t-stat) (1.892) (2.12) (1.808) (1.89) (1.754) (1.815) Const 1.630 0.831 1.764 1.870 1.792 1.862 0.170 0.169 (t-stat) (1.93) (1.052) (2.224) (2.205) (1.864) (1.853) R2 0.540 0.495 0.553 0.573 0.579 0.597 N¯ 2344 469 469 469 469 468 TNA 2186 25 98 301 950 9562 Joint pval 0.004 0.058 69

Table A11: Heterogeneity in mutual fund efficiency: Sorts by BM of holdings, long-only factors. This table presents estimates of risk premia associated with “long only” versions of the four risk factors in the Fama-French (1992)-Carhart (1997) model, as well as abnormal returns in the fifth row, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds for which we have holdings data, and differs slightly from the second column in Table A3 due to missing data. The next five columns report parameter estimates separately for each quintile of mutual funds sorted by the value-weighted average of the book-to-market ratio (BM) of theirstockholdings. Thelasttwocolumnspresentthep-values(asymptoticandsimulationbased) associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across the 300 ¯ months in the sample. The row labeled N reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all quintiles have the same parameters. All t-statistics and asymptotic p-values use Newey-West (1987) standard errors with five lags. BM Quintile p-vals All funds Growth 2 3 4 Value Asymp. Sim. MKT 5.225 4.892 5.271 4.591 5.082 3.924 0.181 0.206 (t-stat) (1.747) (1.677) (1.722) (1.455) (1.769) (1.439) SMB+ 8.115 6.196 6.992 8.475 8.173 7.293 0.028 0.036 (t-stat) (2.096) (1.593) (1.719) (2.185) (2.193) (2.101) HML+ 6.623 4.842 4.636 5.691 6.932 6.667 0.256 0.289 (t-stat) (1.727) (1.293) (1.203) (1.344) (1.964) (1.899) MOM+ 6.950 6.266 6.275 6.006 7.051 5.936 0.486 0.519 (t-stat) (1.897) (1.658) (1.618) (1.611) (1.999) (1.853) Const 1.635 1.768 1.647 1.501 1.669 2.681 0.325 0.372 (t-stat) (1.934) (1.44) (2.095) (2.022) (1.733) (2.273) R2 0.539 0.479 0.580 0.613 0.503 0.456 N¯ 2364 473 473 473 473 472 Joint p-val 0.000 0.006 70

Table A12: Heterogeneity in mutual fund efficiency: Sorts by MO of holdings, long-only factors. This table presents estimates of risk premia associated with “long only” versions of the four risk factors in the Fama-French (1992)-Carhart (1997) model, as well as abnormal returns in the fifth row, estimated using the Fama-MacBeth (1973) procedure. Factor loadings (“betas”) are estimated using the prior three months of daily returns, as described in Appendix B. The first column uses all mutual funds for which we have holdings data, and differs slightly from the second column in Table A3 due to missing data. The next five columns report parameter estimates separately for each quintile of mutual funds sorted by the value-weighted average of the twelve-month momentum (MO) of their stock holdings. The last column presents the p-value associated with a test that the parameter estimates in that row are equal across all quintiles. The row labeled R2 reports the average of the cross-sectional R2s across the 300 months in the sample. The ¯ row labeled N reports the average number of assets used to estimate the risk premia each period. The last row reports the p-value on the joint test that all quintiles have the same parameters. All t-statistics and p-values use Newey-West (1987) standard errors with five lags. MO Quintiles p-vals All funds Down 2 3 4 Up Asymp. Sim. MKT 5.225 5.360 4.811 4.542 4.792 4.978 0.905 0.921 (t-stat) (1.747) (1.997) (1.728) (1.446) (1.654) (1.636) SMB 8.115 8.752 7.290 7.921 7.189 7.584 0.389 0.446 (t-stat) (2.096) (2.402) (1.99) (2.006) (1.917) (1.935) HML 6.623 7.137 6.084 5.988 5.831 6.455 0.752 0.771 (t-stat) (1.727) (1.988) (1.729) (1.391) (1.588) (1.604) MOM 6.950 6.681 4.915 5.605 5.124 6.970 0.073 0.095 (t-stat) (1.897) (2.079) (1.417) (1.491) (1.45) (1.86) Const 1.635 1.228 1.675 1.553 2.269 2.266 0.569 0.605 (t-stat) (1.934) (1.083) (1.626) (2.024) (2.342) (2.117) R2 0.539 0.476 0.496 0.617 0.498 0.473 N¯ 2364 473 473 473 473 472 Joint p-val 0.011 0.088 71

Table A13: Persistence of measures of manager ability using long-only factors. This table presents parameter estimates from a panel autoregressive model of order one for annualized measures of mutual fund manager ability, given in the column titles, estimated using the random forest, regression tree, and homogeneous models described in Sections 3.1 and 3.3. All t-statistics use Thompson (2011) standard errors, clustered by firm and time. Skill Efficiency AbRet Panel A: Random forest AR(1) 0.030 0.136 0.086 (t-stat) (3.702) (18.651) (8.781) Const 1.706 -2.403 -0.926 (t-stat) (37.300) (-71.444) (-23.770) Obs. 43,039 43,039 43,039 R2 (%) 0.097 1.943 0.795 Panel B: Regression Tree AR(1) 0.015 0.146 0.086 (t-stat) (1.917) (22.036) (8.781) Const 2.067 -2.675 -0.926 (t-stat) (43.714) (-77.065) (-23.770) Obs. 43,039 43,039 43,039 R2 (%) 0.026 2.215 0.795 Panel C: Homogeneous AR(1) 0.028 0.192 0.086 (t-stat) (3.209) (30.104) (8.781) Const 1.832 -2.366 -0.926 (t-stat) (39.758) (-80.072) (-23.770) Obs. 43,039 43,039 43,039 R2 (%) 0.088 4.078 0.795 72

Table A14: Persistence of measures of manager ability using Fama-French (2015)-Carhart (1997) factors. This table presents parameter estimates from a panel autoregressive model of order one for annualized measures of mutual fund manager ability, given in the column titles, estimated using the random forest, regression tree, and homogeneous models described in Sections 3.1 and 3.3. All t-statistics use Thompson (2011) standard errors, clustered by firm and time. Skill Efficiency AbRet Panel A: Random forest AR(1) 0.047 0.086 0.127 (t-stat) (5.844) (12.659) (12.681) Const 1.756 -2.564 -0.851 (t-stat) (41.047) (-76.882) (-23.134) Obs. 43,039 43,039 43,039 R2 (%) 0.235 0.786 1.728 Panel B: Regression Tree AR(1) 0.045 0.058 0.127 (t-stat) (5.410) (7.341) (12.681) Const 1.632 -2.514 -0.851 (t-stat) (38.529) (-66.700) (-23.134) Obs. 43,039 43,039 43,039 R2 (%) 0.224 0.359 1.728 Panel C: Homogeneous AR(1) 0.068 0.125 0.127 (t-stat) (7.633) (20.520) (12.681) Const 1.866 -2.603 -0.851 (t-stat) (41.777) (-84.919) (-23.134) Obs. 43,039 43,039 43,039 R2 (%) 0.496 1.706 1.728 73

References Akbas, F. and Genc, E. (2020). Do mutual fund investors overweight the probability of extreme payoffs in the return distribution? Journal of Financial and Quantitative Analysis, 55(1):223–261. Aleti, S., Bollerslev, T., and Siggaard, M. (2025). Intraday market return predictability culled from the factor zoo. Management Science. Amihud, Y. and Goyenko, R. (2013). Mutual fund’s r2 as predictor of performance. Review of Financial Studies, 26(3):667–694. Ammann, M., Fischer, S., and Weigert, F. (2020). Factor exposure variation and mutual fund performance. Financial Analysts Journal, 76(4):101–118. Andersen, T., Bollerslev, T., Diebold, F., and Wu, J. (2006). Realized beta: Persistence and predictability. In Fomby, T. and Terrell, D., editors, Advances in Econometrics: Econometric Analysis of Economic and Financial Time Series in Honor of R.F. Engle and C.W.J. Granger, pages 1–40. Emerald Publishing Limited. Banegas, A., Gillen, B., Timmermann, A., and Wermers, R. (2013). The cross section of conditional mutual fund performance in european stock markets. Journal of Financial economics, 108(3):699–726. Barndorff-Nielsen, O.E.andShephard, N.(2004). Econometricanalysisofrealizedcovariation: High frequency based covariance, regression, and correlation in financial economics. Econometrica, 72(3):885–925. Barras, L., Scaillet, O., and Wermers, R. (2010). False discoveries in mutual fund performance: Measuring luck in estimated alphas. Journal of Finance, 65(1):179–216. Berk, J. B. and van Binsbergen, J. H. (2015). Measuring skill in the mutual fund industry. Journal of Financial Economics, 118(1):1–20. Bianchi, D., Bu¨chner, M., and Tamoni, A. (2021). Bond risk premiums with machine learning. Review of Financial Studies, 34(2):1046–1089. Bollen, N. P. and Busse, J. A. (2001). On the timing ability of mutual fund managers. Journal of Finance, 56(3):1075–1094. Bonhomme, S. and Manresa, E. (2015). Grouped patterns of heterogeneity in panel data. Econometrica, 83(3):1147–1184. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. (1984). Classification and Regression Trees. CRC Press. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (2017). Classification and regression trees. Routledge. 74

Brown, S. J. and Goetzmann, W. N. (1995). Performance persistence. Journal of Finance, 50(2):679–698. Carhart, M. M. (1997). On persistence in mutual fund performance. Journal of Finance, 52(1):57–82. Daniel, K., Grinblatt, M., Titman, S., and Wermers, R. (1997). Measuring mutual fund performancewithcharacteristic-basedbenchmarks. Journal of Finance,52(3):1035–1058. Daniel,K.andTitman,S.(1997). Evidenceonthecharacteristicsofcrosssectionalvariation in stock returns. Journal of Finance, 52(1):1–33. DeMiguel, V., Martin-Utrera, A., and Uppal, R. (2025). Rethinking mutual fund performance: From traditional alpha to achievable alpha. Available at SSRN 5052445. Fama, E. F. (1991). Efficient capital markets: Ii. Journal of Finance, 46(5):1575–1617. Fama, E. F. and French, K. R. (1992). The cross-section of expected stock returns. Journal of Finance, 47(2):427–465. Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, 33(1):3–56. Fama, E. F. and French, K. R. (2010). Luck versus skill in the cross-section of mutual fund returns. Journal of Finance, 65(5):1915–1947. Fama, E. F. and French, K. R. (2015). A five-factor asset pricing model. Journal of Financial Economics, 116(1):1–22. Fama, E. F. and MacBeth, J. D. (1973). Risk, return, and equilibrium: Empirical tests. Journal of Political Economy, 81(3):607–636. Fan, J., Ke, Z.T., Liao, Y., andNeuhierl, A.(2022). Structuraldeeplearninginconditional asset pricing. Available at SSRN 4117882. Farrell, M. H., Liang, T., and Misra, S. (2025). Deep learning for individual heterogeneity. arXiv 2010.14694. Ferson, W. (2019). Empirical Asset Pricing: Models and methods. MIT Press. Ferson, W. and Wang, J. L. (2021). A panel regression approach to holdings-based fund performance measures. Review of Asset Pricing Studies, 11(4):695–734. Freyberger, J., Neuhierl, A., and Weber, M. (2020). Dissecting characteristics nonparametrically. Review of Financial Studies, 33(5):2326–2377. Frisch, R. and Waugh, F. V. (1933). Partial time regressions as compared with individual trends. Econometrica, 1(4):387–401. Gu, S., Kelly, B., and Xiu, D. (2020). Empirical asset pricing via machine learning. Review of Financial Studies, 33(5):2223–2273. 75

Harvey, C. R. and Liu, Y. (2022). Luck versus skill in the cross section of mutual fund returns: Reexamining the evidence. Journal of Finance, 77(3):1921–1966. Hastie,T.,Tibshirani,R.,andFriedman,J.H.(2009). The Elements of Statistical Learning: Data mining, Inference, and Prediction, volume 2. Springer. Jensen, M. C. (1968). The performance of mutual funds in the period 1945-1964. Journal of Finance, 23(2):389–416. Kacperczyk, M., Van Nieuwerburgh, S., and Veldkamp, L. (2016). A rational theory of mutual funds’ attention allocation. Econometrica, 84(2):571–626. Kaniel, R., Lin, Z., Pelger, M., and Van Nieuwerburgh, S. (2023). Machine-learning the skill of mutual fund managers. Journal of Financial Economics, 150(1):94–138. Keim, D. B. and Madhavan, A. (1997). Transactions costs and investment style: An inter-exchange analysis of institutional equity trades. Journal of Financial Economics, 46(3):265–292. Kelly, B. and Xiu, D. (2023). Financial machine learning. Foundations and Trends in Finance, 13(3-4):205–363. Koijen, R. S. (2014). The cross-section of managerial ability, incentives, and risk preferences. Journal of Finance, 69(3):1051–1098. Lewellen, J. and Nagel, S. (2006). The conditional CAPM does not explain asset-pricing anomalies. Journal of Financial Economics, 82(2):289–314. Lovell, M. C. (1963). Seasonal adjustment of economic time series and multiple regression analysis. Journal of the American Statistical Association, 58(304):993–1010. Markowitz, H. (1952). Portfolio selection. Journal of Finance, 7(1):77–91. Newey, W. K. and West, K. D. (1987). A simple, positive definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica, 55:703–708. Novy-Marx, R. and Velikov, M. (2016). A taxonomy of anomalies and their trading costs. Review of Financial Studies, 29(1):104–147. Pastor, L., Stambaugh, R. F., and Taylor, L. A. (2015). Scale and skill in active management. Journal of Financial Economics, 116(1):23–45. Patton, A. J. and Weller, B. M. (2020). What you see is not what you get: The costs of trading market anomalies. Journal of Financial Economics, 137:515–549. Patton, A. J. and Weller, B. M. (2022). Risk price variation: The missing half of empirical asset pricing. Review of Financial Studies, 35(11):5127–5184. Sharpe,W.F.(1964). Capitalassetprices: Atheoryofmarketequilibriumunderconditions of risk. Journal of Finance, 19(3):425–442. 76

Thompson, S. B. (2011). Simple formulas for standard errors that cluster by both firm and time. Journal of Financial Economics, 99(1):1–10. Treynor, J. L. (1966). How to rate management investment funds. Harvard Business Review, 43. Wermers, R. (2000). Mutual fund performance: An empirical decomposition into stockpicking talent, style, transactions costs, and expenses. Journal of Finance, 55(4):1655– 1695. 77

Cite this document

APA

Dong Hwan Oh and Andrew J. Patton (2026). Skill and Efficiency in the U.S. Mutual Fund Industry (FEDS 2026-032). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2026-032

BibTeX

@techreport{wtfs_feds_2026_032,
  author = {Dong Hwan Oh and Andrew J. Patton},
  title = {Skill and Efficiency in the U.S. Mutual Fund Industry},
  type = {Finance and Economics Discussion Series},
  number = {2026-032},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2026},
  url = {https://whenthefedspeaks.com/doc/feds_2026-032},
  abstract = {We propose a new measure of mutual fund manager ability: "efficiency" is the ability to accrue the risk premium associated with a risk factor. The familiar abnormal return, or alpha, is shown to be the sum of two distinct measures of ability: "aggregate efficiency" which is the beta-weighted sum of the fund's (in)efficiencies across risk factors, and "skill," the component that is unrelated to factor exposures. Using a panel of U.S. equity mutual fund returns from 1999-2023, we document significant heterogeneity in mutual fund manager skill and efficiency. We employ regression trees and their extensions to capture this heterogeneity. We find that efficiency is more persistent than skill, and we show that future abnormal returns can be better predicted by decomposing lagged abnormal returns into skill and efficiency.},
}