feds · September 22, 2020

Revealing Cluster Structures Based on Mixed Sampling Frequencies

Abstract

This paper proposes a new nonparametric mixed data sampling (MIDAS) model and develops a framework to infer clusters in a panel regression with mixed frequency data. The nonparametric MIDAS estimation method is more flexible and substantially simpler to implement than competing approaches. We show that the proposed clustering algorithm successfully recovers true membership in the cross-section, both in theory and in simulations, without requiring prior knowledge of the number of clusters. This methodology is applied to a mixed-frequency Okun’s law model for state-level data in the U.S. and uncovers four meaningful clusters based on the dynamic features of state-level labor markets. Accessible materials (.zip)

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C. Revealing Cluster Structures Based on Mixed Sampling Frequencies Yeonwoo Rho, Yun Liu, and Hie Joo Ahn 2020-082 Please cite this paper as: Rho, Yeonwoo, Yun Liu, and Hie Joo Ahn (2020). “Revealing Cluster Structures Based on Mixed Sampling Frequencies,” Finance and Economics Discussion Series 2020-082. Washington: Board of Governors of the Federal Reserve System, https://doi.org/10.17016/FEDS.2020.082. NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers.

Revealing Cluster Structures Based on Mixed Sampling Frequenciesi Yeonwoo Rho1, Yun Liu2, and Hie Joo Ahn3 ii 1Michigan Technological University 2Quicken Loans 3Federal Reserve Board September 9, 2020 Abstract Thispaperproposesanewnonparametricmixeddatasampling(MIDAS)modelanddevelopsaframeworktoinferclustersinapanelregressionwithmixedfrequencydata. ThenonparametricMIDASestimationmethodismoreflexibleandsubstantiallysimplertoimplementthan 5 competing approaches. We show that the proposed clustering algorithm successfully recovers true membership in the cross-section, both in theory and in simulations, without requiring prior knowledge of the number of clusters. This methodology is applied to a mixed-frequency Okun’s law model for state-level data in the U.S. and uncovers four meaningful clusters based on the dynamic features of state-level labor markets. 10 key words: Clustering; forecasting; mixed data sampling regression model; panel data; penalized regression. iOpinionsexpressedhereinarethoseoftheauthorsaloneanddonotnecessarilyreflecttheviewsoftheFederal Reserve System. We thank Gianni Amisano, Eric Ghysels and the participants of Midwest Econometrics Group 2018 Meeting, and the joint meeting of 11th International Conference of the ERCIM WG on Computational and Methodological Statistics and 12th International Conference on Computational and Financial Econometrics. Rho andLiuarepartiallysupportedbyNSF-CPSgrant#1739422. Aprimitiveversionofthispaperhadbeencirculated underthetitle”PanelNonparametricMIDASModel: AClusteringApproach”bythefirsttwoauthors. iiEmails: Y.Rho(yrho@mtu.edu),Y.Liu(yliu26@mtu.edu),andH.J.Ahn(HieJoo.Ahn@frb.gov) 1

1 Introduction Following technological advances, the diffusion of social media, and the efforts of statistical agencies and private companies, new data sources have recently become available for empirical 15 research in economics. In many cases, these data are characterized by large time series and cross sectionaldimensions,withdetailedinformationoneconomicagentsoftenatalevelofdisaggregation more granular than that of traditional data sources. To cope with the changing data environment, new methods have been developed in the econometrics literature. For instance, to utilize the high-frequency data for forecasting, regression models concatenating low- and high-frequency vari- 20 ables—socalledmixeddatasampling(MIDAS)models(e.g.,Ghyselsetal.[2007])—arenowwidely used in practical applications. In addition, due to the increasing availability of richer cross-section dataithasbecomeparticularlyimportanttoefficientlysummarizeandidentifythemostimportant features of subjects in the cross-section, with the aim of properly account for unit heterogeneity. In this respect, clustering algorithms developed in the machine learning literature are increasingly 25 being used in econometric applications. Currently,thereisnostatisticalmethodologythatallowsaresearchertoidentifydistinctgroups in a panel data and mixed-frequency regression setting. This paper aims at filling this gap in the literaturebyproposinganewnonparametricmethod. Thekeyinnovationofourproposedapproach is the following. First, we propose a new non-parametric MIDAS model based on the Fourier 30 flexible form and polynomials. In parametric MIDAS models, arbitrary parametric functions (e.g., exponentialAlmonlagfunction,betafunction)areusedtomodelthecoefficientsonhigh-frequency variables. As these parametric functions are highly nonlinear in general, complicated numerical optimization is required for estimation. As this estimation is numerically costly and challenging, practitionersoftengiveMIDASmodelsawideberth. Ourmodelrequiresjustordinaryleastsquares 35 (OLS), which eschews estimation difficulties. In addition, it does not require any arbitrary choice ofparametricfunctionalformsinspecifyingthedistributedlagsstructureofthecoefficientsonhigh frequencyvariables. Tomodelthecoefficientsonhigh-frequencyvariables,theFourierflexibleform and polynomials are used, which allows the trajectory of coefficients on high-frequency variables to be flexibly determined by the data.iii 40 iiiArecentstudybyBabiietal.[2019]developsageneralframeworkwithdictionariesformachine-learningtime- 2

Breitung and Roling [2015] have recently proposed a nonparametric MIDAS model that, like ours, does not require to specify any functional form for the coefficients on the high-frequency. Unlikeourapproach,however,BreitungandRoling[2015]’smethodologyrequiresthecarefulsetting of two tuning parameters, which is much more computationally demanding. AnothercontributionofthispaperistoextendthenewMIDASmodeltothepaneldataframe- 45 work, andtodeviseanovelclusteringmethodforthispanelsetting. Thisclusteringalgorithmuses the idea of penalized regression with penalty function as in Ma and Huang [2017], and is adapted to general panel data settings including mixed-frequency panel data. An important feature of the proposedmethodisthattheidentificationofclustersisentirelydata-driven, whilepreviousstudies (e.g. Andreou et al. [2010]) arbitrarily divide observations into different groups. However, this ap- 50 proach crucially depends on prior knowledge and homogeneity within each group is not necessarily guaranteed.iv Simulations conducted in this paper show many desirable features of our method. First, the proposednon-parametricMIDASmodelprovidesthebestone-step-aheadforecastamongparametric and non-parametric MIDAS models widely used in empirical studies. Second, our clustering 55 methodyieldsmorepreciseparameterestimationandbetterforecastingpropertiesthancompeting approaches. Third, the proposed clustering algorithm is faster than competitors (e.g. K-means clustering) when the number of underlying clusters is unknown. Asarelevantempiricalapplication,weuseourmethodtoexploreheterogeneityinlabormarket dynamics across states in the U.S. using a mixed-frequency panel Okun’s law model. Okun’s law is 60 an empirical relationship that relates changes in unemployment rate to GDP growth. Usually an Okun’slawmodelisspecifiedatquarterlyfrequency, asGDPgrowthisavailableonlyquarterly. In seriesregressionmodels,andrecommendstouseLegendrepolynomialsforthecoefficientsonhigh-frequencyvariables inaMIDASmodel. OuruseoftrigometricfunctionsfortheMIDAScoefficientsisaspecialcaseofBabiietal.[2019]’s framework,andinheritsadvantagessimilartothoseofLegendrepolynomialsintheestimation. ivIt should be noted that the application of proposed clustering algorithm is not limited to datasets with mixed samplingfrequencies. Itcanbeappliedtoageneralpaneldatasettingwithafixedsamplingfrequency,inwhichcase theproposedmethodcanbeanappealingalternativetoSuetal.[2016]’smethod. Therearethreeaspectswhyour approachmayhaveadvantagesoverSuetal.[2016]’smethod. First,ourpenaltyfunctionismoregeneralthantheirs. Second, Suetal.[2016]’smethodrequirestopre-specifyapossiblerangesforthenumberofclustersinadditionto the turning parameter. If there is no prior knowledge of the number of clusters and if the size of cross-section is large, then finding right clusters becomes very computationally challenging as the various possibilities need be considered. Ourmethodrequiresafewuser-chosenparameters,buttherangeofchoicesiswelldefinedanddoesnot yieldintractablepossibilities. Lastly,Suetal.[2016]’smethodrequiresthenumberofclusterstobefixedregardless ofthesizeofcross-section. Thetheorybehindourclusteringalgorithmdoesflexiblyallowthenumberofgroupsto adjusttoachangeinthesizeofcross-section. 3

our application, we include weekly initial claims of unemployment insurance (UI) benefits, which is known as the most timely indicator of job losses, as the high-frequency variable in a mixed frequency Okun’s law model. By doing so, the model can better characterize the sudden rise in 65 unemployment rate at the onset of a recession, picking up sudden bursts in layoffs. An additional desirable feature of the mixed-frequency Okun’s law model is that it can be used to nowcast the unemployment rate at the state level on a weekly basis. The algorithm identifies four clusters of states based on the responsiveness of unemployment rate to GDP growth and on the pattern of coefficients on weekly initial claims within the quarter. 70 The coefficients on GDP growth and initial claims most likely reflect structural aspects of statelevel labor markets (e.g. the industry composition) and local labor-market practices in hiring and firing. Hence, the clusters identified by the model most likely capture relevant heterogeneity in the functioning of labor markets in different states. We relate the identified clusters to observable state–level attributes such as the small-firms 75 employment share, industry composition, the relevance of oil production, and the share of longtermunemploymentoutoftotalunemployment. Eachclusterexhibitsmultidimensionalattributes, suggesting that the differences in labor-market dynamics across states cannot be determined or accurately summarized by one or two observable factors. Another way to say this is that the clustering algorithm is able to capture a state’s unobserved attributes which are not fully reflected 80 in the data but are nevertheless crucial for unemployment dynamics. In this regard, our proposed methodology can reveal similarities and differences across states in the functioning of their labor markets purely based on the data, and can provide a new understanding of regional heterogeneity in labor market dynamics. The rest of this paper is organized as follows. Section 2 introduces the proposed nonparametric 85 MIDAS approach using the Fourier flexible form and polynomials. Subsection 2.1 discusses the nonparametricMIDASestimationinanon-panelsetting. Subsection2.2demonstratestheproposed method’s estimation and forecasting accuracy in finite samples. Section 3 presents the clustering algorithmusingtheFourier-transformedhigh-frequencyvariableandotherpossiblecovariates. The proposed clustering approach delivers accurate estimates, as proved in theory and shown in the 90 finite sample simulations. Section 4.1 provides an empirical application of the method. Technical 4

proofs are relegated to Section B in the appendix. The following notation will be used throughout the paper. The p-norm of a vector x=(x ,..., 1 x )(cid:48) is||x|| =( (cid:80)m xp)1/p. Foranm×nmatrixAwithits(i,j)thelementbeinga , ||A|| indim p i=1 i ij p catesthep-norminducedbythecorrespondingvectornorm. Thatis, ||A|| =sup ||Ax|| /||x|| . 95 p x(cid:54)=0 p p Inparticular, ||A|| =max (cid:80)m |a |and||A|| =max (cid:80)n |a |. Forasymmetric 1 j=1,...,n i=1 ij ∞ i=1,...,m j=1 ij and positive definite matrix A, let λ (A) and λ (A) indicate the smallest and largest eigenvalmin max ues of A, respectively. It is worth noting that ||A|| = λ (A). I is a p×p identity matrix and 2 max p ⊗ denotes the Kronecker product. For any real number x, (cid:98)x(cid:99) denotes the largest integer that is smaller than or equal to x. The symbol 1{·} denotes the indicator function. 100 2 Nonparametric MIDAS Inthissection,weintroduceournonparametricMIDASapproachusingtheFourierflexibleform [Gallant, 1981]. We first introduce the framework, then confirm that the proposed nonparmetric MIDAS model is a good approximation of popular parametric MIDAS models in finite samples. 2.1 Nonparametric MIDAS with the Fourier flexible form and polyno- 105 mials Consider the following MIDAS model with the forecast lead h≥0: q m−1 (cid:88) (cid:88) y = α z + β∗ x +ε =z(cid:48)α+x (cid:48)β∗+ε , (1) t+h i t,i j/m t,j t+h t t t+h i=1 j=0 fort=1,...,T. Here,z istheq-vectoroflow-frequencycovariatesattimet,andα=(α ,...,α )(cid:48) t 1 q is the corresponding coefficient vector. The vector x = (x ,...,x )(cid:48) is the high-frequency t t,0 t,m−1 variable at t and β∗ = (β∗ ,...,β∗ )(cid:48) are the coefficients that aggregate x to the low- 0/m m−1/m t frequency. In a parametric MIDAS model, the coefficients β∗ can be written as a multiple of j/m ω (θ),whereweightsω (θ)areassumedtobegeneratedby,forexample,anexponentialAlmonlag j j function α∗exp(θ j+θ j2+···+θ jQ) β∗ =α∗ω (θ)= 1 2 Q , j/m j (cid:80)m−1exp(θ i+θ i2+···+θ jQ) i=0 1 2 Q 5

and α∗ and θ =(θ ,θ ,...,θ ) are parameters that need to be estimated from data. However, the 1 2 Q form of ω (·) is somewhat limited, and it requires nonlinear estimation. In this paper, we propose j to model β∗ using the Fourier flexible form and polynomials. The MIDAS coefficients β∗ are 110 j/m j/m assumed to be generated by L K (cid:88) (cid:88) β∗ = β (j/m)l+ {β sin(2πk·j/m)+β cos(2πk·j/m)}, (2) j/m l 1,k 2,k l=0 k=1 for some positive integers L and K. The Fourier flexible form has been frequently used in macroeconomics and finance since Gallant [1981]. It has been demonstrated that the Fourier flexible form iscapableofapproximatingmostformsofnonlineartimetrendstoanydegreeofaccuracyifasufficient number of parameters is used, and that a small K is often enough to reasonably approximate 115 smooth functions with finite numbers of breaks [Becker et al., 2004, 2006, Enders and Lee, 2012, Rodrigues and Robert Taylor, 2012, Gu¨ri¸s, 2017, Perron et al., 2017]. In addition to the Fourier flexible form, we also consider a few polynomial trends to cover wider range of nonlinear functions, following suggestions in Perron et al. [2017]. The MIDAS model (1) with the Fourier flexible form (2) can be expressed as y=Zα+Xβ∗+ε=Zα+X(cid:101)β+ε=Wγ+ε, where y =(y ,...,y )(cid:48), ε=(ε ,...,ε )(cid:48), Z=[z ,···,z ](cid:48), X=[x ,···,x ](cid:48), W =(Z, 120 1+h T+h 1+h T+h 1 T 1 T X(cid:101)), X(cid:101) =XM(cid:48) =[x (cid:101)1 ,···,x (cid:101)T ](cid:48), and   (0/m)0 (1/m)0 ··· ((m−1)/m)0    . . . . . .   . . .       (0/m)L (1/m)L ··· ((m−1)/m)L        sin(2π·1·0/m) sin(2π·1·1/m) ··· sin(2π·1·(m−1)/m) M = . (3)   cos(2π·1·0/m) cos(2π·1·1/m) ··· cos(2π·1·(m−1)/m)    . . .   . . .   . . .      sin(2π·K·0/m) sin(2π·K·1/m) ··· sin(2π·K·(m−1)/m)     cos(2π·K·0/m) cos(2π·K·1/m) ··· cos(2π·K·(m−1)/m) 6

Here,thematrixM canbeunderstoodasaFouriertransformoperator. ThisFouriertransformation summarizes the information in an m-dimensional vector x into a (2K+L+1)-dimensional vector t x =Mx =(x ,x ,···,x ,x(s),x(c),···,x(s) ,x(c) )(cid:48), where x , x(s) and x(c) are transformed (cid:101)t t (cid:101)t,0 (cid:101)t,1 (cid:101)t,L (cid:101)t,1 (cid:101)t,1 (cid:101)t,K (cid:101)t,K (cid:101)t,l (cid:101)t,k (cid:101)t,k high-frequency data for l = 0,...,L and k = 1,...,K, and are defined as x = (j/m)lx , 125 (cid:101)t,l t,j x(s) =sin(2πk·j/m)x , and x(c) =cos(2πk·j/m)x .v. (cid:101)t,k t,j (cid:101)t,k t,j Unlike parametric MIDAS models, this model is linear. Noting that β∗ = M(cid:48)β, the ordinary least squares (OLS) estimator of β∗ can be written as (cid:18) 1 (cid:19)−1(cid:18) 1 (cid:19) β(cid:99)∗ =M(cid:48)Dγ =M(cid:48)D(W(cid:48)W)−1W(cid:48)y=β∗+M(cid:48)D W(cid:48)W W(cid:48)ε , (4) (cid:98) T T (cid:2) (cid:3) where D = 0 ,I , and I is an identity matrix. Under some regularity (L+1+2K)×q L+1+2K L+1+2K conditions, β can be estimated consistently by the OLS estimator β(cid:99)∗. 130 2.2 Simulation: Nonparametric MIDAS This section presents simulation results to check the estimation and forecasting accuracy of the proposed nonparametric MIDAS estimation using the Fourier flexible form in finite samples. Our method is compared with the nonparametric MIDAS approach proposed by Breitung and Roling [2015], where a smoothness condition was imposed on β∗ , without involving a weight function 135 j/m ω (·), similarly to our method. However, Breitung and Roling [2015] requires choosing a tuning j parameter. See Section A.1 in the supplementary material for more details about Breitung and Roling [2015]. The data is generated in a similar setting as Breitung and Roling [2015]. For j =0,...,m−1, t=1,...,T, 140 m−1 (cid:88) y =α + β∗x +ε , x =c+dx +u , (5) t+h 0 j t,j t+h t,j t,j−1 t,j j=0 where ε i ∼ id N(0,0.125), u i ∼ id N(0,1), α = 0.5, β∗ = α ω (θ), α ∈ {0.2,0.3,0.4}, T ∈ t+h t,j 0 j 1 j 1 {100,200,400}, and the frequency ratio m ∈ {20,40,60,150,365}. For the AR(1) high-frequency vNote that x (cid:101)t can effectively summarize the information in xt, because relatively small K and L are enough to capture main characteristics of a nonparametric trend function. For instance, Enders and Lee [2012] reported that evenasinglefrequencyK=1allowsformultiplesmoothbreaks. 7

regressor, c=0.5 and d=0.9 are considered. Four MIDAS weight functions ω (θ) are considered: j exp{θ j+θ j2} • Exponential Decline: ω (θ ,θ )= 1 2 , θ =7×10−4, θ =−6×10−3; j 1 2 (cid:80)m−1exp{θ i+θ i2} 1 2 i=0 1 2 exp{θ j−θ j2} • Hump-Shaped: ω (θ ,θ )= 1 2 , θ =0.08, θ =θ /10,θ /20,θ /30; 145 j 1 2 (cid:80)m−1exp{θ i−θ i2} 1 2 1 1 1 i=0 1 2 θ +θ (j−1) • Linear Decline: ω (θ ,θ )= 1 2 , θ =1, θ =0.05; j 1 2 θ (m)+θ (m)(m+1)/2 1 2 1 2 (cid:26) (cid:18) (cid:19)(cid:27) θ j • Cyclical: ω (θ ,θ )= 1 sin θ +2π , θ =0.01, θ =5,5/2,5/3; j 1 2 m 2 m−1 2 1 • Discrete: ω =(0,0,···,0,5/m,···,5/m)wherethevalue5/misassignedtothelastonefifth j elements and 0 to the rest. For our method, K =3 and L=2 are used, and the tuning parameter in Breitung and Roling 150 [2015] is chosen to minimize AIC. For the evaluation of the estimation accuracy, the root mean square errors (RMSE) of estimators of β∗ = (β∗,...,β∗ )(cid:48) are considered. Our estimator β(cid:98) 0 m−1 is brought back to the original scale by taking M(cid:48)β(cid:98). The RMSE of our method is calculated as RMSE =(cid:107)M(cid:48)β(cid:98)−β∗(cid:107) 2 .ThenumberofMonte-Carlo(MC)replicationsis1000. Forthecomparison offorecastingaccuracy,therootmeansquareforecasterror(RMSFE)oftheone-step-aheadforecast 155 is considered. The number of MC replication is 250. The RMSFE is calculated as following: 1. Obtain the estimated parameter β(cid:99)∗ in the regression model y =x (cid:48)β∗+ε for t=1, T/2 t+h t t+h ···,T/2. 2. Calculate the one-step-ahead forecast using β(cid:99)∗ , that is, y =x (cid:48)β(cid:99)∗ . T/2 (cid:98)T/2+h+1 T/2+1 T/2 3. Repeatsteps1-2andobtainy =x β(cid:99)∗ fork =2,...,T/2. Here,β(cid:99)∗ 160 (cid:98)T/2+h+k T/2+k T/2+k−1 T/2+k−1 is calculated using (y ,x(cid:48)) for all t=k,...,T/2+k−1. t+h t 4. Oncetheestimatedresponsesy fort=T/2+1,...,T arecalculated,calculatetheRMSFE (cid:98)t+h (cid:113) of the predicted response: RMSFE = (2/T) (cid:80)T/2(y −y )2. k=1 (cid:98)T/2+h+k T/2+h+k Table 1 presents the medians of RMSEs of β estimation using Breitung and Roling [2015] (B&R) and our method (Fourier). For both methods, the estimation accuracy generally increases 165 as the frequency ratio or the sample size become larger. Also for all five shapes of MIDAS weights, 8

our approach substantially improves estimation accuracy compared with B&R’s method. This improvement is more substantial when the sample size T or the frequency ratio m is relatively large. This finding implies that our approach tends to capture the flexibility of various shapes of MIDAS weights more precisely than B&R’s approach. Another notable feature is that α does not 170 1 have much effect on the accuracy of the estimation for both methods. It seems that for these two nonparametric methods, the MIDAS shape matters, but not the magnitude of the signal. Table 2 presents the median one-step ahead RMSFEs. For both methods, the forecasts become more accurate as the sample size T, or the frequency ratio m increases for all five MIDAS shapes. In general, the Fourier flexible form tends to provide slightly more precise forecasts compared with 175 the B&R’s method. These results show that the proposed the Fourier flexible form approach tends to deliver more accurate estimation and forecasting compared with a competing nonparametric method. 9

seireSemiTelgniSrofsdohtemSADIMciretemarapnoNfoycaruccAnoitamitsEretemaraP :1 elbaT 563 051 06 04 02=m 4.0 3.0 2.0 4.0 3.0 2.0 4.0 3.0 2.0 4.0 3.0 2.0 4.0 3.0 2.0=1α dohteM T 2344.0 5544.0 5344.0 0284.0 1184.0 8584.0 0075.0 8165.0 4075.0 4356.0 1656.0 8656.0 8619.0 5919.0 6609.0 R&B 9711.0 8980.0 8160.0 8811.0 1390.0 7170.0 2231.0 6031.0 1431.0 9922.0 0432.0 0432.0 1585.0 9285.0 5965.0 reiruoF 001 6873.0 5973.0 4183.0 9893.0 2693.0 2693.0 1494.0 4194.0 0494.0 1195.0 6085.0 1185.0 1088.0 0978.0 0368.0 R&B pxE 6511.0 2780.0 4950.0 7211.0 3780.0 7260.0 7190.0 0290.0 8190.0 6951.0 0651.0 0651.0 8604.0 7514.0 2614.0 reiruoF 002 enilceD 6213.0 3413.0 0313.0 8143.0 7043.0 0143.0 3444.0 1244.0 1444.0 8735.0 5345.0 5345.0 9848.0 1448.0 3838.0 R&B 6411.0 2680.0 3850.0 4011.0 0480.0 3850.0 9460.0 1460.0 9460.0 3901.0 6801.0 6801.0 1582.0 8182.0 0582.0 reiruoF 004 7344.0 1144.0 5644.0 0284.0 8284.0 8684.0 0265.0 6965.0 2965.0 7356.0 4556.0 3656.0 2719.0 2719.0 2509.0 R&B 0420.0 2320.0 7220.0 9540.0 8640.0 5640.0 7031.0 1431.0 1431.0 8922.0 9332.0 9332.0 1585.0 9285.0 5965.0 reiruoF 001 7773.0 5973.0 9773.0 4104.0 3993.0 2204.0 5194.0 5394.0 5394.0 3195.0 4085.0 7185.0 6778.0 6978.0 9368.0 R&B pmuH 6710.0 4610.0 3610.0 5230.0 1230.0 2230.0 1290.0 0290.0 0290.0 7951.0 0651.0 0651.0 9604.0 7514.0 2614.0 reiruoF 002 depahS 3313.0 2513.0 3313.0 2933.0 2143.0 7043.0 0244.0 0444.0 0444.0 8735.0 2445.0 5345.0 0948.0 9348.0 0938.0 R&B 6310.0 4210.0 6110.0 1220.0 6220.0 2220.0 0460.0 0560.0 1560.0 2901.0 2211.0 5801.0 1582.0 8182.0 0582.0 reiruoF 004 7344.0 1144.0 5644.0 2284.0 8284.0 8684.0 1665.0 3165.0 6865.0 8446.0 1156.0 6046.0 7419.0 1919.0 4609.0 R&B 3220.0 5220.0 2220.0 9540.0 8640.0 5640.0 0531.0 7031.0 1431.0 4612.0 1022.0 4322.0 1585.0 9285.0 4965.0 reiruoF 001 7773.0 5973.0 9773.0 4104.0 3993.0 3204.0 3594.0 7194.0 5394.0 4585.0 9285.0 6385.0 7878.0 6878.0 5368.0 R&B raeniL 5510.0 3510.0 8510.0 5230.0 1230.0 1230.0 2290.0 0290.0 0290.0 8941.0 7351.0 1551.0 8604.0 7514.0 2614.0 reiruoF 002 enilceD 3313.0 2513.0 3313.0 3933.0 2143.0 6043.0 6144.0 6144.0 3344.0 9635.0 4135.0 4925.0 3848.0 1448.0 9738.0 R&B 0110.0 9010.0 7801.0 1220.0 6220.0 2220.0 9460.0 0460.0 1560.0 0601.0 2501.0 6401.0 1582.0 8182.0 0582.0 reiruoF 004 7344.0 1144.0 5644.0 0284.0 8284.0 0784.0 2665.0 1165.0 8965.0 9656.0 8356.0 8756.0 7529.0 6529.0 4419.0 R&B 3220.0 5220.0 2220.0 9540.0 8640.0 5640.0 0531.0 7031.0 1431.0 4032.0 7922.0 0432.0 4965.0 5285.0 9865.0 reiruoF 001 7773.0 5973.0 9773.0 4104.0 2993.0 2204.0 4694.0 5194.0 5394.0 3985.0 7985.0 6975.0 7088.0 4778.0 7768.0 R&B lacilcyC 5510.0 3510.0 8510.0 5230.0 1230.0 1230.0 2290.0 0290.0 6919.0 9951.0 7951.0 0651.0 1604.0 9514.0 3614.0 reiruoF 002 3313.0 2513.0 3313.0 1933.0 1143.0 7043.0 5044.0 1244.0 5344.0 0435.0 0935.0 0845.0 2748.0 6548.0 6248.0 R&B 0110.0 9010.0 9010.0 1220.0 6220.0 2220.0 9460.0 0460.0 7056.0 1011.0 2901.0 5801.0 0582.0 8182.0 8482.0 reiruoF 004 0444.0 1144.0 5644.0 9484.0 9384.0 6784.0 2606.0 8685.0 7975.0 5557.0 3117.0 3386.0 0454.1 5162.1 8380.1 R&B 3860.0 3350.0 3930.0 4361.0 2621.0 8090.0 5904.0 7713.0 6532.0 6916.0 0784.0 4073.0 4622.1 5699.0 4587.0 reiruoF 001 1873.0 6973.0 1873.0 8404.0 8004.0 3304.0 5725.0 5015.0 9105.0 5876.0 5736.0 4406.0 3923.1 7151.1 4600.1 R&B etercsiD 7560.0 3050.0 6530.0 9851.0 9021.0 1480.0 6593.0 8203.0 2312.0 8295.0 2654.0 0623.0 6241.1 7209.0 9776.0 reiruoF 002 4313.0 3513.0 5313.0 4143.0 9243.0 8043.0 3174.0 4854.0 5054.0 5116.0 7385.0 6365.0 4002.1 8560.1 4449.0 R&B 5460.0 0940.0 6330.0 9651.0 5811.0 7080.0 3093.0 8592.0 0302.0 0185.0 0244.0 5503.0 3601.1 2948.0 3506.0 reiruoF 004 .001ybdeilpitlumrehtrufsihcihw,selpmasCM0001fosESMRfonaidemehtstroperllechcaE 10

seireSemiTelgniSrofsdohtemSADIMciretemarapnoNfoycaruccAgnitsaceroFdaehA-petS-enO :2 elbaT 563 051 06 04 02=m 4.0 3.0 2.0 4.0 3.0 2.0 4.0 3.0 2.0 4.0 3.0 2.0 4.0 3.0 2.0=1α dohteM T 0223.0 6813.0 7813.0 1832.0 0232.0 4242.0 1202.0 4891.0 7691.0 0002.0 1002.0 0991.0 7012.0 6702.0 6012.0 R&B 6254.0 4053.0 8952.0 0881.0 0361.0 5151.0 9041.0 6631.0 0831.0 8731.0 0831.0 5731.0 7631.0 6631.0 8531.0 reiruoF 001 3062.0 1362.0 2062.0 5981.0 0981.0 7981.0 0381.0 2481.0 8481.0 1881.0 7881.0 8881.0 2602.0 6502.0 4402.0 R&B pxE 9824.0 6433.0 7542.0 1241.0 0731.0 3431.0 0131.0 9031.0 6131.0 6921.0 5921.0 7131.0 3031.0 9131.0 0131.0 reiruoF 002 enilceD 4291.0 3291.0 6291.0 1861.0 4861.0 0861.0 3471.0 7571.0 0671.0 7581.0 7581.0 7381.0 2302.0 8202.0 5102.0 R&B 0224.0 7523.0 6732.0 0041.0 9431.0 6131.0 3821.0 0821.0 6721.0 2821.0 9821.0 0821.0 7721.0 3821.0 0821.0 reiruoF 004 8223.0 9813.0 1813.0 1832.0 0232.0 3242.0 1202.0 4891.0 8691.0 1002.0 1002.0 0991.0 7012.0 7702.0 5012.0 R&B 6141.0 5831.0 9731.0 1731.0 0631.0 0831.0 7931.0 3631.0 6731.0 4731.0 9731.0 4731.0 7631.0 5631.0 8531.0 reiruoF 001 1062.0 1362.0 1062.0 5981.0 0981.0 7981.0 0381.0 1481.0 0581.0 0881.0 4881.0 7881.0 3602.0 7502.0 5402.0 R&B pmuH 6231.0 4231.0 6921.0 8921.0 2131.0 1131.0 5031.0 9031.0 4131.0 6921.0 4921.0 7131.0 2031.0 0231.0 0131.0 reiruoF 002 depahS 5291.0 3291.0 6291.0 1861.0 4861.0 0861.0 3471.0 6571.0 7571.0 6581.0 9581.0 7381.0 1302.0 8202.0 7102.0 R&B 4921.0 5821.0 5721.0 8721.0 9721.0 1821.0 3721.0 4721.0 3721.0 8721.0 7821.0 9721.0 7721.0 3821.0 9721.0 reiruoF 004 8223.0 9813.0 1813.0 1832.0 0232.0 3242.0 1202.0 6891.0 4791.0 2002.0 1002.0 1991.0 7012.0 6702.0 6012.0 R&B 5041.0 7731.0 8731.0 6731.0 2631.0 0831.0 2041.0 0631.0 7731.0 7731.0 8731.0 5731.0 8631.0 6631.0 9531.0 reiruoF 001 3812.0 1362.0 1062.0 5981.0 0981.0 7981.0 0381.0 4481.0 1581.0 1881.0 3881.0 7881.0 2602.0 6502.0 5402.0 R&B raeniL 4131.0 7031.0 2921.0 0031.0 3131.0 4131.0 6031.0 9031.0 3131.0 6921.0 6921.0 7131.0 4031.0 9131.0 1131.0 reiruoF 002 enilceD 5291.0 3291.0 6291.0 1861.0 4861.0 0861.0 3471.0 7571.0 7571.0 6581.0 7581.0 8381.0 3702.0 8202.0 5102.0 R&B 3821.0 5721.0 1721.0 1821.0 0821.0 1821.0 7721.0 6721.0 4721.0 0821.0 8821.0 0821.0 7721.0 4821.0 1821.0 reiruoF 004 8223.0 9813.0 1813.0 1832.0 0232.0 3242.0 1202.0 3891.0 1791.0 0002.0 9991.0 0991.0 9112.0 8702.0 2012.0 R&B 5931.0 4731.0 0831.0 3731.0 2631.0 0831.0 8931.0 7631.0 5731.0 1731.0 1831.0 6731.0 8631.0 7631.0 3531.0 reiruoF 001 1062.0 1362.0 1062.0 0981.0 0981.0 7981.0 2381.0 2481.0 9481.0 0881.0 0981.0 0981.0 3602.0 1602.0 5302.0 R&B lacilcyC 8031.0 7031.0 3921.0 9031.0 9031.0 0131.0 5031.0 0131.0 4131.0 7921.0 5921.0 5131.0 1031.0 0231.0 0131.0 reiruoF 002 5291.0 3291.0 6291.0 4861.0 4861.0 0861.0 5471.0 6571.0 8571.0 6581.0 9581.0 9381.0 9202.0 6202.0 7102.0 R&B 6721.0 6721.0 0721.0 8721.0 8721.0 1821.0 4721.0 3721.0 3721.0 8721.0 6821.0 8721.0 8721.0 2821.0 0821.0 reiruoF 004 6323.0 8813.0 7713.0 2832.0 5232.0 0242.0 9202.0 9891.0 7691.0 5002.0 1102.0 8991.0 0412.0 6112.0 3112.0 R&B 6261.0 2151.0 3341.0 6651.0 9741.0 2441.0 7051.0 7341.0 9041.0 7441.0 2341.0 0041.0 7241.0 6041.0 5631.0 reiruoF 001 4062.0 3362.0 3062.0 4981.0 8881.0 9981.0 6381.0 5481.0 3581.0 0981.0 8981.0 1981.0 0902.0 3802.0 9302.0 R&B etercsiD 2351.0 2541.0 0631.0 8941.0 9041.0 1631.0 1141.0 7631.0 6331.0 8631.0 8331.0 5331.0 4531.0 9431.0 0231.0 reiruoF 002 3291.0 3291.0 7291.0 2861.0 2861.0 0861.0 0571.0 7571.0 8571.0 7581.0 1681.0 4481.0 9402.0 9302.0 3202.0 R&B 1151.0 1141.0 0231.0 4641.0 7831.0 9231.0 1731.0 8231.0 0031.0 5531.0 1331.0 6921.0 7131.0 2131.0 0921.0 reiruoF 004 .selpmasCM052fosEFSMRfonaidemehtstroperllechcaE 11

3 Panel Data and Clustering In this section, a clustering procedure of MIDAS coefficients for panel data is proposed. The 180 high-frequency regressors are aggregated using nonparametric MIDAS coefficient functions introduced in Section 2 for each cross-section object. These coefficients are further clustered using a penalized regression approach. The linearity of our MIDAS model confers a great advantage to the proposedclusteringprocedure,astheclusteringalonewouldrequirequiteheavycomputations. We first review relevant literature on clustering. 185 3.1 Literature Review on Clustering Based on Penalized Regression Clustering using penalized regression based on similarity in the coefficients is a recent development. Suetal.[2016],toourbestknowledge,isthefirststudythatdevelopsaclusteringalgorithm for panel data using penalized regression. Su et al. [2016] modified the traditional Lasso penalty in regressionmodelsintoclassifier-Lasso(C-Lasso)thatpenalizesthedifferencebetweentheestimated 190 parameters of each subject and the estimated average parameters of groups. C-Lasso requires a predetermined maximum for the number of groups and a choice of tuning parameter. Ma and Huang [2017] introduced a penalized method for cross-sectional data, with clustering based on intercepts. The penalty functions used in Ma and Huang [2017] are minimax concave penalty (MCP) [Zhang, 2010] and smoothly clipped absolute deviations penalty (SCAD) [Fan and 195 Li, 2001], which not only share the sparsity properties like Lasso but are also asymptotically unbiased. Later on, Ma and Huang [2016] extended their work, increasing the number of parameters usedinclustering. However,neitherMaandHuang[2016]norMaandHuang[2017]canbeapplied toapaneldatasetting. Indeed,theirmethodisbasedonstrongassumptionsthatmakeitnontrivial to extend to panel data. 200 ZhuandQu[2018]istheonlystudy,tothebestofourknowledge,thatextendsMaandHuang’s clustering procedure to a data environment similar to panel data. Zhu and Qu [2018] applied Ma and Huang [2017]’s algorithm to repeated cross-section data with one dependent variable and one covariate. In their model, the dependent variable is assumed to vary smoothly in response to the covariate,andthissmoothfunctionisestimatedusinganonparametricB-spline. Strictlyspeaking, 205 12

Zhu and Qu [2018]’s method is not designed for panel data. However, if a covariate is allowed to vary over time, Zhu and Qu [2018]’s method can be applied to simple panel data. Lv et al. [2019] further extended Zhu and Qu [2018]’s approach allowing for one random effect as an additional covariate. TheclusteringprocedureweproposeisbasedonMaandHuang[2017]andMaandHuang[2016]. 210 However, it should also be noted that this extension is nontrivial. In particular, the assumption (C3) in Ma and Huang [2016] requires all variables on the right-hand side of the equation to be non-random and have length exactly 1. This assumption may be appropriate for a clinical trial setting,forwhichMaandHuang[2017],MaandHuang[2016],ZhuandQu[2018],Lvetal.[2019], and other related papers are developed. However, this assumption is too strong for a more general 215 panel data setting where time-varying regressors are included. The theory we present circumvent this issue. 3.2 Nonparametric MIDAS for Panel Data and Clustering Suppose there are n subjects in the cross-section of panel data. For simplicity, all subjects are assumed to have the same sample size T and frequency ratio m. For the i-th subject, let z i,t be the q-vector of covariates including the intercept at time t (t = 1,...,T), and let α be the i corresponding coefficient. Consider the following MIDAS model with lead h≥0: y =z(cid:48) α +x(cid:48) β∗+ε , t=1,...,T, i=1,...,n, i,t+h i,t i i,t i i,t+h or equivalently, y =Z α +X β∗+ε , i=1,...,n, (6) i i i i i i where y = (y ,...,y )(cid:48), ε = (ε ,...,ε )(cid:48), β∗ = (β∗ ,...,β∗ )(cid:48), X is a T ×m 220 i i,1+h i,T+h i i,1+h i,T+h i i,0 i,m−1 i matrix with t-th row being x(cid:48) =(x ,x ,...,x ), and Z is a T×q matrix with t-th row i,t i,t,0 i,t,1 i,t,m−1 i being z(cid:48) =(z ,...,z ). i,t i,t,1 i,t,q We assume that the MIDAS coefficients β∗ takes the Fourier flexible form as in (2). For each i subject i=1,...,n, X(cid:101)i =X i M(cid:48), where M is from (3). Let W i =(Z i ,X(cid:101)i ) and γ i =(α(cid:48) i ,β(cid:48) i )(cid:48). The 13

equation (6) can be rewritten as 225     α α i i y i =(Z i ,X i )    +ε i =(Z i ,X(cid:101)i )    +ε i =W i γ i +ε i (7) β∗ β i i Concatenating the y in (7) into y, a vector of length nT, we have: i y=Wγ+ε, (8) where y = (y(cid:48),...,y(cid:48))(cid:48), W = diag(W ,...,W ), γ = (γ(cid:48),...,γ(cid:48) )(cid:48), and ε = (ε(cid:48),...,ε(cid:48) )(cid:48). Let 1 n 1 n 1 n 1 n p=q+2K+L+1. In our formulation, γ is a vector of length p and γ is of length npvi. i Remark 1. The arguments in this section should still be valid with different sample sizes and different frequency ratios for different subjects/time periods, at the expense of more complicated notation and slight changes in the results. The major complication arises from the need of using different M for each i and t. That is, x =M x , where i,t (cid:101)i,t i,t i,t   (0/m )0 (1/m )0 ··· {(m −1)/m }0 i,t i,t i,t i,t    . . . . . .   . . .       (0/m )L (1/m )L ··· {(m −1)/m }L   i,t i,t i,t i,t      sin(2π·1·0/m i,t ) sin(2π·1·1/m i,t ) ··· sin{2π·1·(m i,t −1)/m i,t } M =  i,t   cos(2π·1·0/m ) cos(2π·1·1/m ) ··· cos{2π·1·(m −1)/m }  i,t i,t i,t i,t   . . .   . . .   . . .      sin(2π·K·0/m ) sin(2π·K·1/m ) ··· sin{2π·K·(m −1)/m }  i,t i,t i,t i,t    cos(2π·K·0/m ) cos(2π·K·1/m ) ··· cos{2π·K·(m −1)/m } i,t i,t i,t i,t shouldbeused, andy isavectoroflength (cid:80)n T ratherthannT. Asthismakesthenotationfor i=1 i thesubsequentproofsmorecomplicatedwithoutaddingfundamentaldifferences,thisgeneralization 230 is not pursued in this paper. In contrast, it is necessary to use the same L and K for all subjects viTheframeworkintroducedinthissectionandinSection2.1considersonlyonehigh-frequencyvariable. However, ourframeworkcaneasilyextendtoaccommodatemorethanonehigh-frequencyvariables. Forinstance,iftwohighfrequency variables are considered, the length of γ will be q+2K+L+1+2K(cid:48)+L(cid:48)+1, where 2K(cid:48) and L(cid:48)+1 i arethenumbersoftrigonometricfunctionsandpolynomialsconsideredforthesecondhigh-frequencyvariable. The subsequentclusteringprocedureisalsostraightforward. 14

i=1,...,n, to allow for direct comparison of coefficients γ . i Considertheestimationofparametersin(8)ifthesubjectscanbeseparatedintoasmallnumber of groups. Denote the number of groups by G. The advantage of the proposed procedure is that it does not require any prior knowledge of group information or the number of groups. The only 235 informationrequiredisthefeaturesofcluster. Forexample,ifwearewillingtoassumethatacluster has the same parameters of interest–that is, all elements in γ are the same within a group–the i clusters are identified solely based on parameter estimates.vii An OLS solution of (8) would minimize 1||y−Wγ||2, but this would not reflect the relevant 2 2 groupinformation. Torevealclusters,weproposeapenalizedregressionmethodtoforceallelements 240 in γ to have similar values within a group. Our method is based on the assumption that if two i subjectsiandj belongtothesamegroup,thedifferenceoftheirgroup-specificparameterwouldbe zero, i.e., η =γ −γ =0. In this case, the OLS estimator of η would also be somewhat close ij i j ij toazerovector,thoughitwouldnotbeexactlyzero. However,sinceiandj areinthesamegroup, η should be better estimated to be exactly zero, rather than “somewhat close” to zero. This can 245 ij be forced by imposing a penalty for small values of η . In particular, if the number of groups N ij is much smaller than the number of subjects n, only a small number of η would be nonzero. The ij following penalized objective function is considered: 1 (cid:88) Q(γ)= ||y−Wγ||2+ ρ (γ −γ ,λ ), (9) 2 2 θ i j 1 1≤i<j≤n whereρ (·,λ )isanappropriatepenaltyfunction,andθandλ aretuningparametersthatdiscipline θ 1 1 clustering. Clusteringusingapenalizedregressionasin(9)hasbeenexploredinanumberofpapers [Ma and Huang, 2017, 2016, Zhu and Qu, 2018, Lv et al., 2019]. As illustrated in the previouslymentioned papers, this optimization problem can be solved using the alternating direction method ofmultipliers(ADMM)algorithm,whichcanalsobeimplementedinoursetting. SectionA.2inthe supplementarymaterialintroducestheADMMalgorithminoursetting, provingthattheproposed viiIt is possible to relax this assumption by letting some of γ be individual-specific, rather than assuming all i parameters are strongly tied with groups. If there are subject-specific coefficients, a similar argument would still work, although some ratesand conditionswouldchange. Inparticular, thenumberof coefficientsthat aresubjectspecific should be added following a similar argument to Ma and Huang [2016, 2017]. However, for brevity, this directionwillbenotelaboratedinthispaper. 15

algorithmisconvergent. Thetuningparameterλ canbechosenbyminimizinginformationcriteria 1 such as (cid:16) (cid:17) (cid:18) (cid:107)y−Wγ(cid:107)2(cid:19) log(n)· G(cid:98)p BIC =log (cid:98) 2 + . λ1 n n The rest of this section presents theoretical properties of the estimators that solve the optimization problem in (9). Let G be the true number of groups and G be the set of subject 250 g indices that corresponds to the g-th group, for g = 1,...,G. Assume that each subject belongs to exactly one group; that is, G ,...,G are mutually exclusive and G ∪...∪G = {1,...,n}. 1 G 1 G Denote |G | be the number of elements in G , for g =1,...,G. Define g =min |G | and g g min g=1,...,G g g =max |G |. Letγ0 bethetrueparameterofthei-thsubject,andϕ0 thetruecommon max g=1,...,G g i g vector for the group G . The common value for the γ s of the group G is denoted by ϕ ; that is, 255 g i g g γ =ϕ for all i∈G and for any g =1,···,G. Set γ0 =(γ0(cid:48) ,···,γ0(cid:48) )(cid:48), ϕ0 =(ϕ0(cid:48) ,···,ϕ0(cid:48) )(cid:48), and i g g 1 n 1 G ϕ=(ϕ(cid:48) 1 ,···,ϕ(cid:48) G )(cid:48). Denote the estimated group by G(cid:98)g ={i:γ (cid:98)i =ϕ (cid:98)g ,1≤i≤n}, for g =1,...,G(cid:98), where G(cid:98) is the estimated number of groups. For an estimate γ (cid:98) of γ, the corresponding estimated group parameter for the g-th group is defined as ϕ (cid:98)g = |G(cid:98)g |−1(cid:80) i∈G(cid:98)g γ (cid:98)i . Note that ϕ (cid:98)1 ,···,ϕ (cid:98)G(cid:98) are the distinct values, since the clustering algorithm would lead to η =0 [Ma and Huang, 2017]. 260 (cid:98)ij Let Π be an n×G matrix with (i,g)-th element being 1 if i-th subject belongs to the g-th group, and 0 otherwise. Then γ = (Π⊗I )ϕ = Γϕ, where Γ = (Π⊗I ). An oracle estimator of p p 1 γ0 canbedefinedasγor =Γϕor, whereϕor =argmin ||y−WΓϕ||2 =(Γ(cid:48)W(cid:48)WΓ)−1Γ(cid:48)W(cid:48)y. (cid:98) (cid:98) (cid:98) ϕ∈RGp2 2 The matrix Γ(cid:48)W(cid:48)WΓ is invertible as long as n (cid:29) G. Here, the estimator γor is called an oracle (cid:98) estimator since it utilizes the knowledge of the true group memberships in Π, which is not feasible 265 in practice. Asymptotic properties of this oracle estimator γor will be presented in Theorem 1. (cid:98) Then the asymptotic equivalence of our estimator γ and the oracle estimator will be introduced in (cid:98) Theorem 2. Assumption 1. The number of clusters is much smaller than the number of subjects, i.e., G(cid:28)n. In this paper, the case with G≥2 is considered. The smallest group size g is smaller than n/G. 270 min Assumption2. Assumeλ ( (cid:80) W(cid:48)W )≥c|G |T,λ ( (cid:80) W(cid:48)W )≤c(cid:48)nT,andmax min i∈Gg i i g max i∈Gg i i 1≤i≤n λ (W(cid:48)W ) ≤ c(cid:48)(cid:48)T for some constants c, c(cid:48) and c(cid:48)(cid:48) that do not depend on g = 1,...,G. Further, max i i 16

assume that for any (cid:15)>0, there exist M ,...,M >0 such that 1 4 (cid:18) (cid:19) (cid:18) √ (cid:19) (cid:112) P sup ||Z(cid:48)Z || > qTM <(cid:15), P sup ||X(cid:48)X || > mTM <(cid:15), i i ∞ 1 i i ∞ 2 i=1,...,n i=1,...,n (cid:18) √ (cid:19) (cid:18) (cid:19) (cid:112) P sup ||Z(cid:48)X || > mTM <(cid:15), P sup ||X(cid:48)Z || > qTM <(cid:15). i i ∞ 3 i i ∞ 4 i=1,...,n i=1,...,n Assumption3. Thepenaltyfunctionρ (t)=λ−1ρ (t,λ)issymmetric,nondecreasing,andconcave θ θ in t, on t ∈ [0,∞). There exists a positive constant c such that ρ(t) is constant for all t ≥ c λ. ρ ρ Assume that ρ(t) is differentiable, ρ(cid:48)(t) is continuous except for a finite number of t, ρ(0)=0, and ρ(cid:48)(0+)=1. Assumption 4. There exists a constant c˜>0 such that (cid:40) (cid:32) n T (cid:33)(cid:41) (cid:32) n T (cid:33) (cid:88)(cid:88) (cid:88)(cid:88) E exp ν ε ≤exp c˜ ν2 i,t i,t i,t i=1 t=1 i=1 t=1 for any real numbers ν , for i=1,...,n and t=1,...,T. 275 i,t Assumption 1 assures sparsity, which is often necessary for the validity of the penalized regression such as (9). We also limit our interest to the case with more than one cluster, but similar arguments also works for the homogeneous caseviii. Assumption 2 is reasonable considering the usual assumption that the smallest eigenvalue of W(cid:48)W is bounded by cT where T is the sample i i sizeandcissomeconstant. Thisconditioncanberelaxedallowingdifferentc fordifferentgroups. 280 g In such a case, our results would not hold if the number of clusters G grows to infinity. It would still work as long as G is finite, by choosing c = min c in the statement of Theorem 1. g=1,...,G g Assumption 3 is adapted from Ma and Huang [2017] and is conventional in the literature. Popular penaltyfunctionssuchasMCPandSCADpenaltysatisfythisassumption. Assumption4holdsfor any independent subgaussian vector ε, which is commonly assumed in high-dimensional settings. 285 Remark 2. Assumption 2 is more appropriate for time series data than those in Ma and Huang [2017, 2016]. For instance, assumption (C3) in Ma and Huang [2016] requires, for a given t, (cid:80)n z2 =n,forl=1,...,q and (cid:80)n x˜2 1{i∈G }=|G |forj =1,...,2K+L+1,iftheclusi=1 i,t,l i=1 i,t,j g g tering is solely based on x˜ , but not on z . Here, x are the elements of Fourier transformed i,t,j i,t,l (cid:101)i,t,j viiiTheextensiontoahomogeneouscasecanbedonesimilarlytothatofMaandHuang[2017]. 17

high-frequency variable x =Mx ix. If we were to extend these assumptions to a panel setting, 290 (cid:101)i,t i,t one might modify them to (cid:80)T (cid:80)n z2 =nT, for l=1,...,q and (cid:80)T (cid:80)n x˜2 1{i∈G }= t=1 i=1 i,t,l t=1 i=1 i,t,j g |G |T for j = 1,...,2K +L+1. These assumptions are unnecessarily strong for panel data. The g former assumption, (cid:80)T (cid:80)n z2 = nT, cannot be satisfied for a time series z unless it is t=1 i=1 i,t,l i,t,l properly standardized. Standardizing relevant variables before clustering is often necessary, but only for the variables that are involved in clustering. In this case, clustering is not based on z , 295 i,t,l standardizing this variable would add a redundant step that would not even affect the clustering results. Thelatterassumption, (cid:80)T (cid:80)n x˜2 1{i∈G }=|G |T, isalsotoostrong, asitrequires t=1 i=1 i,t,j g g standardizing x˜ within its true cluster, even before any clustering can be done. To remedy i,t,j the issues in Ma and Huang [2017, 2016], we lifted these strong assumptions and replaced them with Assumption 2 above, which is more appropriate for time series. Lemmas in Section B.1 of the 300 supplementarymaterialaddresstheissuesinproofsduetotheabsenceofthesestrongassumptions. The following theorem provides conditions for the convergence of the oracle estimator γor. (cid:98) Theorem 1. If Assumptions 1–4 hold, then P(||γor−γ0|| ≤φ )≥1−e−ι, (cid:98) ∞ n,T,G,ζ √ 2c˜ (mM˜g )1/2(Gp)3/4 √ √ where φ = B1/2 max (Gp+2 Gp ζ+2ζ)1/2, B =[q1/2+m1/2(L+ n,T,G,ζ c q,m g T3/4 q,m min 1+2K)]1/2, M˜ =max{M ,M ,M ,M }, ι=min{ζ,−log((cid:15))}−log(2), for(cid:15)choseninAssumption 1 2 3 4 2. Furthermore, if g3 /g (cid:29) n5/3T1/3, for any vector c ∈ RGp such that (cid:107)c (cid:107) = 1, the min max n n 2 asymptotic distribution of γˆor is c(cid:48) (γˆor−γ0)→N(0,σ2), n γ where σ2 =Var(γˆor−γ0). γ The proof of Theorem 1 can be found in Section B in the supplementary material. Theorem 1 implies that with an appropriate choice of ζ , the oracle estimator converges to the true 305 n,T,G ixNote that this setting is slightly different from our setting, where both z and x˜ are considered in the i,t,l i,t,l clusteringprocedure. TreatmentsforvariablesthatarenotincludedintheclusteringisdescribedinSectionA.5in supplementarymaterial. 18

parameter in probability. Corollary 1. Under the assumptions of Theorem 1, the oracle estimator γor converges to the true (cid:98) parameter γ0 in probability if one of the following conditions holds: 1. The number n is fixed, and T →∞. 2. The number n → ∞, and G is fixed. The number T is either fixed or T → ∞. Further, the 310 size of the smallest group is large enough such that g min =O(n1/2+α˜4) for a positive constant α˜ <1/2. 4 3. The number n → ∞, and G → ∞. The number T is either fixed or T → ∞. Further, the size of the smallest group is large enough such that g min =O(n5/7+α˜5) for a positive constant α˜ <2/7. 315 5 Corollary1statesthattheoracleestimatorisconsistentifnisfixed,orifthesizeofthesmallest group grows somewhat comparably to the increase of n. More specifically, if n is fixed, increasing information across time is necessary for consistent estimation. On the contrary, when increasing information across panel can be obtained, T can be held fixed, as long as all the groups have reasonable sizes. 320 Theorem 2 demonstrates that the proposed estimator of the parameter γ is equivalent to the oracle estimator with probability approaching to 1, which implies that our estimator converges to the true parameter without prior knowledge of the true group memberships. For our clustering algorithm to work properly, groups should be distinctive enough. Assumption 5 states that the pairwise differences of the true parameters should be large enough for different groups. 325 Assumption 5. The minimal difference of the common values between two panels is b = min (cid:107)γ0−γ0(cid:107) = min(cid:107)ϕ0−ϕ0 (cid:107) >aλ +2pφ , n,T,G i j 2 g g(cid:48) 2 1 n,T,G i∈Gg,j∈G g(cid:48),g(cid:54)=g(cid:48) g(cid:54)=g(cid:48) for some constant a>0. Theorem 2. Assume the conditions of Theorem 1 and Assumption 5 hold. For λ (cid:29) pφ , 1 n,T,G where φ is given in Theorem 1, the local minimizer γ of (9) is almost surely the same as the n,T,G (cid:98) oracle estimator γor, if one of the following conditions hold: (cid:98) 19

1. Suppose n → ∞, and T is fixed. The size of the smallest group is large enough such that 330 √ (p+2 p+2)1/2n1/2 (cid:28)g min =O(n7/9+α˜0) for a positive constant α˜ 0 <2/9; 2. Suppose n,T → ∞, and G is fixed. The size of the smallest group is large enough such that g min =O(n1/2+α˜4) for a positive constant α˜ 4 <1/2; 3. Suppose n,T,G → ∞. The size of the smallest group is large enough such that one of the following conditions is met: 335 (cid:110) √ (cid:111) (a) Forapositiveconstantα˜ 3 <2/9,max T n7 1 / / 1 1 3 3 ,(p+2 p+2)1/2n1/2 (cid:28)g min =O(n7/9+α˜3); or, (b) for a positive constant α˜ 5 <2/7, g min =O(n5/7+α˜5). That is, if one of the above conditions holds, as nT →∞, P(γ =γor)→1. (cid:98) (cid:98) Theorem 2 demonstrates that our estimator with prior knowledge of the group information is, asymptotically, as good as the oracle estimator with probability 1, under the presented set of 340 assumptions.x There are a couple of differences between the two estimators, γ and γor. The first (cid:98) (cid:98) note-worthydifferenceisthattheoracleestimatorconvergestothetrueparameterwithprobability 1, even when n is fixed, whereas the non-oracle estimator does need n→∞. This is expected since the oracle estimator already knows the true group membership, so that increasing the information in time domain only can make the estimator precise enough. On the contrary, the non-oracle 345 estimatorneedsincreasinginformationincross-sectiontoestimatethegroupmembershipscorrectly. The other difference is that the non-oracle needs stronger assumption on the minimum group size. Again, this is natural, since the non-oracle estimator lacks the group information. xThe theoretical results presented in this section handle the case with G ≥ 2. In the homogeneous panel case, similarargumentscanbemade,followingMaandHuang[2016]. Thedetailsarenotpresentedinthispaper,butare availableuponrequest. 20

3.3 Simulation: Clustering The simulation settings in this section are designed to achieve two goals. The first goal is 350 demonstrating the clustering accuracy of the proposed method in finite samples, and providing a guidanceonthechoiceofthetuningparametersinvolvedinourmethod,inparticular,θandλ . The 1 other goal is comparing the performance of the proposed method with other clustering algorithms usingthepenalizedregressionidea. Toourbestknowledge,theonlyothersuchclusteringalgorithm is Su et al. [2016] (SSP, hereafter). This method is employed in our MIDAS context by taking the 355 Fourier transformation and applying SSP’s penalty function as in equation (A.4) in Section A.4 in thesupplementarymaterial,ratherthan(9). Thismethodislabeledas“Fourier-SSP.”Inaddition, our penalty function (9) does not limit how the MIDAS part should be handled. Therefore, B&R’s approachcanbeadaptedinplaceoftheFourierflexibleformforourmethod. Thismethodislabeled as “B&R-clust.” Our method is labeled as “F-clust”. Sections A.3 and A.4 in the supplementary 360 material provide algorithms and relevant details of theses two additional clustering methods. Section3.3.1providesthefinitesampleperformanceofF-clustandB&R-clustfordifferentvalues of θ. Section 3.3.2 provides guidance on the choice of θ and λ for our method. Using the optimal 1 θ suggested in Section 3.3.2, Section 3.3.3 compares the three methods, F-clust, B&R-clust, and Fouier-SSP, in term of parameter estimation accuracy and forecasting accuracy. 365 In all simulation settings, two clusters with the exponential decline and the cyclical function shapes shown in Section 2.2 are considered. In each cluster, 15 independent time series are generated. That is, there are 30 coefficient vectors, and two groups are expected after clustering. Each data process follows (5) shown in Section 2.2; θ ∈ {2,2.5}, λ ∈ {1,1.5,···,4.5}, β = 0, 1 0 T ∈ {100,200,400}, m = {20,40}, and α ∈ {0.2,0.3,0.4} are considered. Notice that α are all 370 1 i null-vectors,thatis,γ =β . TheADMMalgorithmusedfortheoptimizationproblem(9)requires i i one additional parameter, λ . See Algorithm 1 in Section A.2 in the supplementary material for 2 more details. Following the choice of Zhu and Qu [2018], λ =1 is used. The clustering algorithm 2 wasforcedtostopatthe3,000-thiterationifthestoppingconditionscannotbesatisfiedduringthe process. For the Fourier flexible form and polynomials, K =3 and L=2 is used. 375 Remark 3. The B&R-clust method involves an additional tuning parameter (θ in Section A.3 γ∗ 21

in the supplementary material) to mange the B-spline-like smoothing. According to the simulation results in Breitung and Roling [2015], the choice of θ is sensitive to the sample size. In all our γ∗ simulations, comparisons are made with the optimal choice of B&R-clust for each pair of θ and λ . The optimal θ is chosen in [0,100] that minimizes AIC. It should be noted that by involving 380 1 γ∗ an additional parameter, the B&R-clust method is much more time-consuming than our proposed method,F-clust. Thisadditionalparameteralsotendstomakeitdifficulttofindtheoptimalvalues for other tuning parameters by increasing the dimension of the parameter space by one. Remark4. Ingeneral,Fourier-basedmethodsaremuchfasterthanothermethodsbasedonB&R’s nonparametric MIDAS estimation. This is because our the Fourier flexible form and polynomials 385 reducesthenumberofparametersfrommtoq+2K+L+1. SmallvaluesofK andLaregenerally acceptable. This makes the Fourier-based clustering computationally fast, when m is much larger than K and L. 3.3.1 Clustering Performance This section explores the clustering accuracy of the proposed method and B&R-clust over a 390 range of θ and λ . As measures of clustering accuracy, the Rand index [Rand, 1971], the adjusted 1 Randindex(ARI)[HubertandArabie,1985],JaccardIndex[Jaccard,1912],theestimatednumber of groups G(cid:98), and the median of RMSE of γ (cid:98) are presented. In particular, the first three measures (Rand, ARI, and Jaccard) assess the similarity of the estimated clusters and the true clusters, TP +TN Rand−E(Rand) and defined as Rand = , ARI = , and Jaccard = 395 TP +TN +FP +FN max(Rand)−E(Rand) TP .Here,TP,TN,FP,andFNindicatetruepositives,truenegatives,falsepositives, TP +FP +FN andfalsenegatives,respectively. TheestimatednumberofclustersandmedianRMSEofestimated γ are also presented. The RMSE of F-clust is calculated as RMSE = (cid:112) n−1(cid:80)n (cid:107)M(cid:48)γ −γ∗(cid:107)2; (cid:98) i=1 (cid:98)i i 2 200 Monte Carlo samples are generated to evaluate performance. Table 3 reports clustering indexes, the number of clusters, and medians of RMSE of estimated 400 γ, for T =100, m=20, and α =0.4. When θ =2, B&R-clust reveals much better clustering per- 1 formance than our method in general. In particular, B&R-clust presents almost perfect clustering, when λ exceeds 2.5. On the contrary, our method exhibits poor clustering performance; the Rand 1 and Jaccard indexes for our method are about half of that of B&R-clust, and the ARI is almost 22

Table 3: The Influence of Tuning Parameters (θ and λ ) on the Clustering Performance 1 θ λ Method Rand ARI Jaccard Clusters RMSE λ 1 1,BIC F-clust 0.531 0.030 0.026 26.45 0.5246 1 B&R-clust 0.530 0.756 0.027 27.05 0.4270 F-clust 0.545 0.057 0.059 23.62 0.5741 1.5 B&R-clust 0.950 0.899 0.899 3.57 0.5984 F-clust 0.526 0.020 0.021 26.32 0.6197 2 B&R-clust 0.950 0.899 0.899 3.63 0.7630 F-clust 0.483 0.000 0.483 1.00 0.6620 2.5 B&R-clust 0.995 0.989 0.989 2.17 0.8954 F-clust 0.517 0.007 0.517 1.07 0.6937 2 3 B&R-clust 0.998 0.996 0.996 2.05 1.0139 F-clust 0.483 0.000 0.483 1.00 0.7408 3.5 B&R-clust 0.999 0.998 0.998 2.01 1.1296 F-clust 0.483 0.000 0.480 1.20 0.7676 4 B&R-clust 0.995 0.989 0.989 2.12 1.2880 F-clust 0.483 0.000 0.483 1.00 0.8055 4.5 B&R-clust 0.984 0.967 0.966 2.47 1.3130 F-clust 0.498 0.029 0.497 1.06 0.7094 3.157 BIC B&R-clust 0.951 0.905 0.931 2.51 0.8385 2.226 F-clust 0.671 0.326 0.319 13.52 0.5308 1 B&R-clust 0.962 0.924 0.922 3.19 0.4368 F-clust 0.906 0.810 0.805 5.43 0.5789 1.5 B&R-clust 0.985 0.983 0.966 2.13 0.6120 F-clust 0.968 0.935 0.933 3.00 0.6321 2 B&R-clust 0.998 0.996 0.995 2.06 0.7533 F-clust 0.999 0.998 0.998 2.01 0.6618 2.5 B&R-clust 1.000 1.000 1.000 2.00 0.8597 F-clust 1.000 1.000 1.000 2.00 0.6897 2.5 3 B&R-clust 1.000 1.000 1.000 2.00 0.9593 F-clust 1.000 1.000 1.000 2.00 0.7325 3.5 B&R-clust 1.000 1.000 1.000 2.00 1.0465 F-clust 1.000 1.000 1.000 2.00 0.7736 4 B&R-clust 1.000 1.000 1.000 2.00 1.1210 F-clust 1.000 1.000 1.000 2.00 0.8146 4.5 B&R-clust 1.000 1.000 1.000 2.00 1.1837 F-clust 0.998 0.996 0.996 2.05 0.6534 2.190 BIC B&R-clust 0.994 0.987 0.987 2.18 0.4388 1.107 200MCsamples,T =100,α1=0.4,m=20. Eachcellinthe“RMSE”columnreportsthemedianofRMSEsof200MCsamples,whichisfurthermultipliedby100. 23

zero. Nonetheless, it is interestig that in terms of RMSE, our method still is comparable or some- 405 times better than B&R-clust. However, when θ = 2.5, with a proper choice of tuning parameter λ , the two clustering methods seem to have similar clustering performance. In particular, when 1 λ exceeds 1.5, both methods result in almost perfect clustering. RMSEs seem to be comparable 1 as well. Two conclusions can be drawn from this simulation exercise. One is that choosing the right 410 range of tuning parameters affects the result greatly. The other is that upon the right choice of tuning parameters, both methods lead to reliable clusters, and they both estimate the coefficients quite precisely. 3.3.2 Selection of Tuning Parameters The clustering performance shown above raises the need to carefully choose the tuning param- 415 eters, θ and λ , as they affect the clustering performance considerably. In particular, when θ = 2, 1 ourclusteringmethoddoesnotworkwellforthesimulationsettingsweconsidered,nomatterwhat λ is. Therefore, it is important for users to make a wise choice of θ. Due to the limited knowledge 1 about the true groups in practice, the BIC alone cannot be the right guide to choose the parameters appropriately. In this section, we shall propose a strategy to select the tuning parameters by 420 calculating the globally convex interval. Let c∗(λ ) be the minimal eigenvalue of the corresponding design matrix W(Π∗⊗I )/n, where θ 1 p Π∗ containstheestimatedgroupinformationwiththegivenparameters. NotethatΠ∗ issimilarto Π, except that it is built with estimated groups from data rather than the true groups. Following the arguments in Breheny and Huang [2011], it can be shown that a subset of the globally convex 425 regions of θ and λ is given by λ ≥λ∗ and θ that satisfy: 1 1 1 λ∗ =inf{λ :θ >1/c∗(λ )} if the MCP penalty is used, 1 1 θ 1 (10) λ∗ =inf{λ :θ >1+1/c∗(λ )} if the SCAD panelty is used. 1 1 θ 1 Here is one strategy to find a convex region: Step 1 For a given θ, choose λ that minimizes BIC. Denote it as λ . 1 1,BIC 24

Table 4: Selection of λ given θ 1 Sample θ =2 θ >2 1 4.5 3.5 λ 1,BIC 2 5.0 4.0 1 0.1452 0.0681 c∗(λ ) θ 1,BIC 2 0.1420 0.0694 1 (6.89,∞) (14.69,∞) Globally Convex Interval of θ 2 (7.04,∞) (14.41,∞) Step 2 Find c∗(λ ) and λ∗. θ 1,BIC 1 Step 3 Check if λ >λ∗ and θ satisfy (10). If not, increase the value of θ and go back to Step 1. 430 1 1 Table 4 presents examples of subsets of convex intervals for an MCP penalty, determined from the simulation settings in Section 3.3.1. Two random samples are considered. Let us take sample 1 as an example. When θ = 2, the BIC-chosen λ is 4.5, and the subset of 1 the globally convex interval for θ is calculated as (6.89,∞). Since θ is not in this region, increase the value of θ. Repeat the process with θ=2.1. The convex interval for θ is (14.68,∞), which 435 does not include θ. We need to increase θ again. As a matter of fact, for our simulation setting, the clustering results was the same for all θ = 2.1,2.2,...,16, which successfully identify the true clusters. Thedesignmatricesarealsothesameasaresult, whichleadstothe(almost)samechoice of λ and c∗(λ ). Therefore, for this dataset, sample 1, as long as θ is greater 2, we would 1,BIC θ 1,BIC have an optimal clustering results with a BIC-chosen λ . This observation is consistent with our 440 1 simulation results. Our method performed well when θ >2 but not when θ =2. 3.3.3 Comparison of the three clustering methods This section compares the three clustering methods (F-clust, B&R-clust, and Fourier-SSP) and thesubject-wisenonparametricMIDASusingtheFourierflexibleformandpolynomials(F-noclust). These methods are compared in terms of the accuracy for parameter estimation in RMSE and for 445 forecastinginRMSFE.ForF-clustandB&R-clust,θ =2.5isconsidered,followingthesuggestionin Section 3.3.2. The frequency ratios m selected inTable 5are 20and 40 tosaveworkloadon B&R’s method. 250 samples are generated in MC simulation. Other than that, the sample size T and the scale α of weights are the same as those considered in Sections 2.2 and 3.3.1. In Fourier-SSP 1 25

method, the maximum number of groups is fixed as two for the grid search to save the calculation 450 load, taking advantage of prior knowledge of the true number of clusters. However, in practice, it could be a problem if this number is improperly chosen. Table 5: Parameter Estimation Accuracy in a Panel Setting m 20 40 α Method T 100 200 400 100 200 400 1 F-clust 0.4031 0.3466 0.3059 0.1587 0.1442 0.1304 B&R-clust 0.3945 0.3279 0.2261 0.1487 0.1103 0.1005 0.2 Fourier-SSP 9.6804 8.4929 7.9253 8.8707 7.5578 6.2652 F-noclust 8.2571 5.5480 3.7683 13.7938 8.4324 5.5765 F-clust 0.5163 0.4691 0.4315 0.2152 0.2012 0.1828 B&R-clust 0.4306 0.3531 0.2404 0.1670 0.1221 0.0922 0.3 Fourier-SSP 7.4175 6.8505 6.2241 7.2194 5.8779 4.3612 F-noclust 8.2573 5.5478 3.7685 13.7938 8.3948 5.5765 F-clust 0.6392 0.5966 0.5558 0.2744 0.2603 0.2145 B&R-clust 0.4496 0.3663 0.2482 0.1789 0.1364 0.0959 0.4 Fourier-SSP 5.8207 5.4699 5.1157 5.8455 4.8590 4.6615 F-noclust 8.2573 5.5478 3.7685 13.7938 8.3948 5.5765 EachcellreportsthemedianofRMSEsof250MCsamples,whichisfurthermultipliedby100. Table5presentsmedianRMSEsofγ. Inparticular,theRMSEofallFourier-basedmethodsare (cid:98) (cid:113) calculated as RMSE = n−1(cid:80)n (cid:107)M(cid:48)β(cid:98) −β∗(cid:107)2. Measures of clustering accuracy are not prei=1 i i 2 sented in this table, because all three clustering methods have perfectly identified the true clusters 455 using BIC. In terms of estimation accuracy, F-clust and B&R-clust tend to outperform Fourier- SSP and the subject-level linear regression. Fourier-SSP and the subject-level linear regression do become more accurate as the sample size increases, but not to the extent that they exceed the accuracy of the other two methods based on the penalized regression with (9). The B&R-clust seems to have the best performance for all settings, while our approach is quite close to the B&R-clust. 460 All three cluster-based method tend to improve as the scale α increases, whereas F-noclust is not 1 affected. This is consistent with the results in Table 5, where Fourier method is not affected much by a different α . In contrast, the three cluster-based methods tend to perform better if the signal 1 is stronger. Computation is the fastest in F-noclust since it does not involve the penalized optimization. 465 F-clust is the next fastest method, followed by Fourier-SSPxi. B&R-clust is the slowest, taking xiIn our simulations, these two methods have similar computation time. This is because we limit the maximum 26

at least three times the computation time of our method. F-noclust does not utilize the group information, and parameter estimation tends to be less accurate than in other methods, especially when the sample size T is smaller or the frequency ratio m is larger. The quality does get better at a faster rate than Fourier-SSP as T increases, but to achieve the same amount of accuracy as 470 F-clust or B&R cluster, one would need T (cid:29)400, which is often not possible in practice. When T isrelativelysmallforagivenm, usingtheneighborinformationinthesameclustercanbeoneway to improve the quality of the parameter estimation. Therefore, our method (F-clust) successfully identifiestrueclustersandsavecomputationtimesubstantially,withoutloosingtoomuchaccuracy in parameter estimation. 475 Table 6: One-Step-Ahead Forecasting Accuracy in a Panel Setting m 20 40 α Method T 100 200 400 100 200 400 1 F-clust 0.7700 0.7437 0.7164 0.7591 0.7173 0.7081 B&R-clust 0.9942 0.7925 0.7214 0.7781 0.7336 0.7139 0.2 Fourier-SSP 2.5779 2.6228 2.7425 2.6582 2.5276 2.4786 F-noclust 0.1619 0.1401 0.1319 0.2916 0.1617 0.1398 F-clust 0.7911 0.7591 0.7192 0.7836 0.7150 0.7051 B&R-clust 0.9937 0.8214 0.7197 0.8010 0.7243 0.7144 0.3 Fourier-SSP 2.4774 2.4952 2.5131 2.5103 2.3803 2.3066 F-noclust 0.1619 0.1401 0.1319 0.2916 0.1621 0.1398 F-clust 0.8072 0.7722 0.7290 0.8058 0.7252 0.7176 B&R-clust 1.0281 0.8336 0.7289 0.8166 0.7277 0.7257 0.4 Fourier-SSP 2.2377 2.2315 2.2493 2.2844 2.1698 2.0982 F-noclust 0.1619 0.1401 0.1319 0.2916 0.1621 0.1398 EachcellreportsthemedianofRMSFEsof250MCsamples. Table 6 presents the median RMSFEs of the one-step-ahead forecast. The RMSFEs are computed in a similar way as presented in Section 2.2, replacing β(cid:99)∗ with the one obtained from the (cid:113) penalized regression (9), and RMSFE = (nT/2)−1(cid:80)T/2 (cid:80)n (y −y )2. It is k=1 j=1 (cid:98)j,T/2+h+k j,T/2+h+k worth noting that F-noclust outperforms all the cluster-based approaches. This is somewhat ex- 480 pected, as β(cid:99)∗ from the subject-level regression is supposed to be the most efficient estimator of β∗ among all unbiased estimators under our set of assumptions. Nonetheless, F-clust and B&Rclust provide reasonably accurate forecast compared to the Fourier-SSP method. This observation demonstrates that our penalty functions in (9) may perform better than SSP’s penalty functions, number of groups of Fourier-SSP to 2, utilizing the true group information, which saves the computation time considerably. Inreality,ourmethodisfasterwhenthetruegroupinformationcannotbeused. 27

both in terms of estimation and forecast accuracy in a setting similar to ours. Overall, if one is interested in identifying clusters in a panel MIDAS data without prior knowl- 485 edgeongroupstructures, itseemsthatourmethodperformsreasonablywellwithoutrequiringtoo heavy computations. 4 Heterogeneity in Labor Market Dynamics across States: Through the Lens of a Mixed-Frequency Okun’s Law Model With the new method, we explore the heterogeneity in labor market dynamics across states 490 through the lens of a mixed-frequency Okun’s law model. 4.1 Panel Data of State-Level Labor Markets and Model Description Okun’s law refers to the empirical negative correlation between output growth and unemployment rate. A popular specification often adopted in the literature (e.g., Knotek II [2007]) is the following. Let u be the first-differenced unemployment rate and y be the growth rates of GDP. 495 t t Okun’s law is a linear relationship between these two variables u =δ+αy +ε , t t t where δ is a constant, ε is an error term and the coefficient α has a negative sign.xii t It has been observed that an Okun’s law model might encounter difficulties in dealing with a sudden and abrupt rise in the unemployment rate due to a burst of job losses at the inception of an economic downturn (e.g., Lee [2000], Moazzami and Dadgostar [2011], Kargı [2016]). In other 500 words, an Okun’s law model with GDP growth as the sole explanatory variable is likely to have difficulties in explaining the nonlinear feature of unemployment dynamics. Weekly initial claims have the highest frequency among the publicly available labor market indicators, and thus can capture the magnitude of job loss in a timely manner. In this regard, the Okun’s law model with weekly initial claims can better capture the non-linearity in unemployment dynamics and also can 505 be used to nowcast the unemployment rate on a weekly basis in real time. xiiThisspecificationisoftenreferredtoasthedifferencedversionofOkun’slaw. 28

The variables that we use for the mixed-frequency Okun’s law model are the quarterly growth rate of log GDP in state i in quarter t (y ), the first-differenced unemployment rate of state i in i,t quarter t (u ), and the log of initial claims in week j of quarter t in state i (x ).xiii i,t i,t,j Weconsider50statesandtheDistrictofColumbia,atotal51cross-sectionunits(orsubjects).xiv 510 The sample period is from 2005 to the second quarter of 2018, as the quarterly real GDP at the state level is available from 2005. The mixed-frequency Okun’s law model is specified as follows: u =δ +α y +x(cid:48) β∗+ε , i,t i i i,t i,t i i,t where x =(x ,···,x )(cid:48) is the collection of weekly initial claims of the corresponding quari,t i,t,1 i,t,mt ter. Onecomplicationofthemixed-frequencyOkun’slawmodelisthatthedistributedlagstructure of weekly initial claims coefficient β∗ is not well defined, as a quarter has a different number of i weeks ranging from 12 to 14. In this case, the construction of a MIDAS model usually requires a more complicated parameterization to cope with these irregular frequencies. Notably, our method does not require such a procedure. The proposed MIDAS model can flexibly handle the changing number of MIDAS parameters, as the algorithm allows the Fourier transformation matrix M to i,t vary over time as noted in Remark 1. The Fourier-transformed log initial claims can be written as x =M x , and now the model is re-specified as follows: (cid:101)i,t i,t i,t u =δ +α y +x˜(cid:48) β +ε , i,t i i i,t i,t i i,t where β = (β ,···,β )(cid:48) and L and K are the number of parameters in Fourier approxii i,1 i,2K+L+1 mation. We cluster states based on α and β , as these parameters capture the dynamic features i i of unemployment in state i. States that share similar values for (α ,β(cid:48)) are allocated to the same 515 i i group. xiiiThe state-level GDP growth is from Bureau of Economic Analysis, the state-level unemployment rate is from LocalAreaUnemploymentStatisticsbyBureauofLaborStatistics,andthestate-levelinitialclaimsarefromDepart of Labor. We seasonally adjust initial claims using seasonal-trend decomposition using LOESS (STL), and use the seasonallyadjustedclaimsfortheestimationofOkun’slawmodel. xivInsomestates,thereisasmallnumberofweekswhentheinitialclaimsdataarenotreleased,due,forinstance, totheshutdownofalocalagencycollectingthedata. 29

In our clustering algorithm, L = 1 and K = 2 are chosen to effectively summarize the highfrequencyinformation.xv Thisselectionensuresthatthetotalnumberofparametersissmallerthan in the conventional MIDAS model. For this dataset, we could not find a global convex area for θ and λ . Instead, we conducted a grid search on a range of λ for each θ =2,3,5,8,10. Among the 520 1 1 values on the grid, we found that θ = 2 and λ = 2.6 minimize both AIC and BIC criteria and 1 therefore our reported results are based on these values. 4.2 Clustering Analysis for the State-Level Labor Markets in the United States Table 7 summarizes estimation results. The algorithm identifies four clusters. There are 24 525 statesincluster1,19statesincluster2,7statesincluster3,and1stateincluster4. Clusters1,2, and 3 account for 47.0%, 36.4%, and 15.3% of national payroll employment, respectively. Cluster 4 consistsofasinglestate—Louisiana—andconstitutes1.4%ofaggregateemployment. Theclusters are determined jointly by the coefficients on GDP growth and the coefficients on log weekly initial claims. Based on the absolute size of coefficients on GDP growth and log initial claims (columns 530 5 and 6 of Table 7), the labor markets of cluster 3 are the most cyclically sensitive, while those in clusters 2 and 1 are moderately and weakly cyclical, respectively. Quite differently, the coefficient onGDPgrowthincluster4isclosetozeroandstatisticallyinsignificant,butthesumofcoefficients onloginitialclaimsispositiveandstatisticallysignificant: hence,incluster4GDPgrowthratedoes not affect the unemployment rate, while initial claims do. The clusters are further distinguished 535 by the pattern of coefficients on log weekly initial claims. The estimated trajectories within the quarter are plotted in Figure 1, and summarized in Column 7 of Table 7. These trajectories are quite distinct across clusters as shown in Figure 1. The coefficients exhibit an uptrend in cluster 1, while those in cluster 2 and 3 show “W ” and “M ” shapes, respectively. The coefficients of Cluster 4 show an “N ” shape. This result suggests that both the trajectory and the size of these 540 coefficients are important in distinguishing clusters. The large coefficients on certain weeks’ initial claims suggest that those who file for UI benefits during these weeks are more likely to raise the state’s unemployment rate than others. This might xvTheclusteringresultsweresimilartounreportedresultswithL=2andK=3. 30

Table 7: Summary of identified clusters Number Member Emp. GDP Sum of IC IC coeff.’s of states states share Coeff. Coeff. Shape Cluster 1 24 SouthCarolina,NorthCarolina,Florida, 47.0% -0.141 0.862 Upward Wisconsin,Colorado,RhodeIsland, (0.00925) sloping Iowa,SouthDakota,Kansas,NorthDakota, Hawaii,Indiana,Wyoming,Oklahoma, NewHampshire,NewJersey,Maine, Michigan,Vermont,Nebraska, California,Delaware,NewYork,Alaska Cluster 2 19 Georgia,Oregon,Ohio,Utah,Tennessee, 36.4% -0.203 0.972 W-shape Texas,NewMexico,WestVirginia, (0.0150) Missouri,Mississippi,Arkansas, Massachusetts,Kentucky, DistrictofColumbia,Massachusetts, Idaho,Pennsylvania,Montana,Connecticut Cluster 3 7 Alabama,Arizona,Illinois,Washington, 15.3% -0.313 1.468 M-shape Nevada,Minnesota,Virginia (0.0280) Cluster 4 1 Louisiana 1.4% 0.0250 1.244 N-shape (0.0625) NotetoTable7: Theabbreviation”Emp. share”referstotheshareoutofaggregatepayrollemployment;”GDP Coeff.”referstothecoefficientonGDPgrowth;”ICcoeff’sshapereferstotheshapeofcoefficientsontheweekly initialclaimsthroughthecorrespondingquarter. Numbersintheparenthesesarethestandarderrors. be related hirings and layoffs practices and the timing of regular employment turnovers in different statesorregions. Layoffsrelatedtotemporaryhiringsmightbeconcentratedinparticularweeksin 545 somestates. Workerspreviouslyhiredbythefirmswhoperiodicallylayoffandrecalltheirworkers typically file for UI claims during specific weeks of quarter and are pretty quickly re-employed. On the other hand, those who file for UI benefits outside these weeks might be more likely to be permanentjobloserswhotendtostayunemployedforalongerperiod. Therefore,theinitialclaims filed by these workers tend to be more strongly correlated with the unemployment rate than those 550 filed by temporary job losers. As an example, in cluster 2 more permanent job losers might file for UI claims during weeks 1, 5, 6, 11 and 12 than in other weeks. Quite differently, in cluster 1 where the coefficients on initial claims exhibit an upward trend, temporary layoffs might be concentrated earlyinthequarter. Insynthesis,eachcluster’scoefficientspatternmightreflecttheseinstitutional factors. Hence, the different shapes of coefficients can be interpreted as the outcomes of labor 555 market conventions that differ across clusters.xvi xviThe large positive coefficients observed on week 5,6, 11, and 12 might also be related to the reference week of Current Population Survey that usually falls in the second week of each month. The number of those file for the 31

Figure 1: Coefficients on log initial claims by cluster NotetoFigure1: Authors’calculations.Theshadedareadenotesthe95%confidenceintervals. Figure 2 displays the geographical locations of clusters. Cluster 1, denoted as light blue, is composed of (1) agricultural states in the Midwest region, (2) manufacturing states in the Eastnorth central region, and (3) states in the Northeast. Far from the states in cluster 1, however, California, Alaska, Florida, and North and South Carolina also belong to Cluster 1.xvii Cluster 560 2 denoted as pink is broadly composed of (1) agricultural states in the West (mountain region), (2) states in the central South region, and (3) manufacturing states in the middle Atlantic region of the Northeast. Overall, states that belong to the same cluster are adjacent with each other in cluster 1 and 2. Quite differently, states in cluster 3 denoted as orange are widely dispersed. UI claims in the first half of month might be highly correlated with a change in the unemployment rate captured by the survey, if the recent filing of UI claims make the survey respondent more likely to report unemployment as their labor force status. However, this feature is not clearly observed in other clusters. Therefore, the pattern of coefficientsoninitialclaimsislesslikelytobetheoutcomeofreference-weekeffect. xviiWefollowtheCensusBureau’sdivisionofregions. 32

This observation suggests that the geographical proximity is not a necessary condition for the 565 identification of a cluster, but is only partially correlated with cluster membership. This might be because adjacent states often share similar structural characteristics, such as available natural resources, oil production, and industrial structure. Figure2: Statesbycluster(cluster1=lightblue,cluster2=pink,cluster3=orange,cluster4=red) NotetoFigure2: Authors’calculations. We further relate the clusters to observable state characteristics in order to find an economic interpretation. Tothisend,weconsiderfivevariables: (1)thesmallfirmsshare,(2)theemployment 570 share of manufacturing, (3) the employment share of finance industry, (4) the GDP share of oil production, and (5) the fraction of long-term unemployment on total unemployment. The first four characteristics are considered in Hamilton and Owyang [2012] as possible explanations for heterogeneous regional business cycles.xviii We also include the share of long-term unemployment, xviiiFollowingthisstudy,wealsoanalyzetheclustersbasedontheseattributes. Thestate-leveldataoffourvariables 33

Table 8: Features of each cluster Small-firm Manufacturing Finance Oil- Long-term share intensive intensive producing unemployment Cluster 1 0.54 0.46 0.46 0.17 0.38 Cluster 2 0.26 0.53 0.37 0.26 0.63 Cluster 3 0.14 0.43 0.43 0 0.43 Cluster 4 1 0 0 1 0 Note to Table 8: Authors’ calculations. Numbers in red are larger than 0.5; those in blue are between 0.4 and 0.5. a component that is likely to reflect the structural unemployment.xix The unemployment rate of 575 a state where the share of long-term unemployment is high might be less responsive to changes in labor demand. Wefindthatthefourclustersaremoderatelydistinctinthesefiveobservabledimensions. Table 8 and Figure 3 summarize the observable features of each cluster. The feature of each cluster is computed from the fraction of states in the cluster whose particular observable characteristic is 580 more prominent than the average of all states. For example, according to the second column of Table 8, the fraction of states in cluster 1 whose small firms share is larger than the average of all states is 0.54, and that in cluster 2 is 0.26. Cluster1issummarilydescribedassmall-firm/manufacturing/financeintensive. Morethanhalf of the states in this cluster have the share of small firms higher than the average of all states. At 585 the same time, a little less than a half of the states in this cluster have above average employment share of manufacturing. Cluster2ischaracterizedaslong-term-unemploymentproneandmanufacturingintensive. About 60% of states in this cluster have a higher than average share of long-term unemployment, and a little more than half of the states have manufacturing shares in employment above the average. 590 Cluster 3 is characterized as manufacturing-finance intensive and long-term unemployment prone. Threeoutofsevenstateshavelargerthanaveragefractionofemploymentinmanufacturing and finance. In addition, the three states have above-average long-term unemployment shares. Louisiana(cluster4)isanoil-producingstate,whoseshareofsmallfirmsislargerthanaverage. Summing up, clusters are heterogeneous in multiple dimensions, characterized by differences 595 arefromHamiltonandOwyang[2012]. xixThefractioniscalculatedbasedonthemicrodatafromtheCurrentPopulationSurvey(CPS). 34

Figure 3: Features of each cluster NotetoFigure3: Authors’calculations. in several observable attributes, as shown in Figure 3. In synthesis, the empirical application demonstrates that our algorithm is be able to reveal meaningful heterogeneity in labor-market dynamics across states without requiring prior knowledge, which in many cases derives from data limitations or from theories lacking empirical support. 5 Conclusion 600 This paper proposed a new clustering method in a panel MIDAS setting, grouping subjects with similar MIDAS coefficients. The clustering is purely data driven, and relies on appropriate adaptation of an existing penalized regression approach. The major advantage of our method is that it does not require prior knowledge of true group membership, not even of the number of groups. A penalized regression already requires at least two tuning parameters, which are often 605 difficult to choose, yet this choice is crucial, as we show in our simulations. In our nonparametric MIDAS specification, based on the Fourier flexible form and polynomials, only a single tuning parameter is required. We provide a strategy for choosing the tuning parameter based on a convex region approach. We show that our proposed clustering method works well both asymptotically 35

and in finite samples. In addition, our non-parametric approach is very simple and fast, and 610 leads to parameter estimates that are as accurate as competing methods, in many cases even more accurate. As an empirical example, we provide an application to labor market dynamics at state level in the United States. The application, based on a mixed frequency Okun’s law model, allows us to categorize the states in four meaningful clusters that correspond to relevant and measurable differences along different dimensions. 615 References E. Andreou, E. Ghysels, and A. Kourtellos. Regression models with mixed sampling frequencies. Journal of Econometrics, 158:246–261, 2010. A. Babii, E. Ghysels, and J. Striaukas. Estimationand HAC-based Inferencefor Machine Learning Time Series Regressions. mimeo, 2019. 620 R. Becker, W. Enders, and S. Hurn. A general test for time dependence in parameters. Journal of Applied Econometrics, 19(7):899–906, 2004. R. Becker, W. Enders, and J. Lee. A stationarity test in the presence of an unknown number of smooth breaks. Journal of Time Series Analysis, 27(3):381–409, 2006. P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression, with 625 applications to biological feature selection. The Annals of Applied Statistics, 5(1):232–253, 2011. J. Breitung and C. Roling. Forecasting inflation rates using daily data: A nonparametric midas approach. Journal of Forecasting, 34(7):588–603, 2015. W.EndersandJ.Lee. Aunitroottestusingafourierseriestoapproximatesmoothbreaks. Oxford Bulletin of Economics and Statistics, 74(4):574–599, 2012. 630 J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. A.R.Gallant. Onthebiasinflexiblefunctionalformsandanessentiallyunbiasedform: thefourier flexible form. Journal of Econometrics, 15(2):211–245, 1981. 36

E. Ghysels, A. Sinko, and R. Valkanov. Midas regressions: Further results and new directions. 635 Econometric Reviews, 26(1):53–90, 2007. B.Gu¨ri¸s. ANewNonlinearUnitRootTestwithFourierFunction. MPRAPaper82260, University Library of Munich, Germany, Oct. 2017. J. D. Hamilton and M. T. Owyang. The propagation of regional recessions. Review of Economics and Statistics, 94(4):935–947, 2012. 640 L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985. P. Jaccard. The distribution of the flora in the alpine zone. 1. New phytologist, 11(2):37–50, 1912. B. Kargı. Okun’s law and long term co-integration analysis for oecd countries (1987-2012). EMAJ: Emerging Markets Journal, 6(1), 2016. E. S. Knotek II. How useful is okun’s law? Economic Review-Federal Reserve Bank of Kansas 645 City, 92(4):73, 2007. J. Lee. The robustness of okun’s law: Evidence from oecd countries. Journal of macroeconomics, 22(2):331–356, 2000. Y. Lv, X. Zhu, Z. Zhu, and A. Qu. Nonparametric cluster analysis on multiple outcomes of longitudinal data. Statistica Sinica, page Accepted, 2019. 650 S. Ma and J. Huang. Estimating subgroup-specific treatment effects via concave fusion. arXiv preprint arXiv:1607.03717, 2016. S. Ma and J. Huang. A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517):410–423, 2017. B. Moazzami and B. Dadgostar. Okun’s law revisited: evidence from oecd countries. Int. Bus. 655 Econ. Res. J., 8:21–24, 02 2011. doi: 10.19030/iber.v8i8.3156. P. Perron, M. Shintani, and T. Yabu. Testing for flexible nonlinear trends with an integrated or stationary noise component. Oxford Bulletin of Economics and Statistics, 79(5):822–850, 2017. 37

W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971. 660 P. M. M. Rodrigues and A. M. Robert Taylor. The flexible fourier form and local generalised least squares de-trended unit root tests. Oxford Bulletin of Economics and Statistics, 74(5):736–759, 2012. L. Su, Z. Shi, and P. C. B. Phillips. Identifying latent structures in panel data. Econometrica, 84 (6):2215–2264, 2016. 665 C. Zhang. Nearly unbiased variable selection under minimax concave penalty. Ann. Statist., 38(2): 894–942, 04 2010. X. Zhu and A. Qu. Cluster analysis of longitudinal profiles with subgroups. Electronic Journal of Statistics, 12:171–193, 01 2018. 38

Supplementary Material to “Revealing Cluster Structures Based on Mixed Sampling Frequencies” Yeonwoo Rho1*, Yun Liu2, and Hie Joo Ahn3 1Michigan Technological University 2Quicken Loans 3Federal Reserve Board September 9, 2020 This supplementary material consists of two parts: Section A consists of details of algorithms used in the main paper; and Section B presents all proofs. A Algorithms This section contains details of algorithms introduced in the main paper. Section A.1 introduce details of B&R’s nonparametric MIDAS in our setting in Section 2. Section A.2 present our clusteringalgorithm(F-clust). Detailsonhowtosolvetheoptimizationproblemin(9)inoursettingis presented using the alternating direction method of multipliers (ADMM) algorithm. The proposed algorithm is also shown to be convergent. Sections A.3 and A.4 present the details of the two competing clustering methods. In particular, Section A.3 introduces how to combine the penalized regression approach with objective function (9) and the B&R’s method (B&R-clust). Section A.4 present the algorithm combining Su’s penalty function and the Fourier transformation for MIDAS (Fourier-Su). Section A.5 presents the algorithm to exclude a part of parameters from clustering. A.1 Breitung and Roling [2015]’s Nonparametric MIDAS The nonparametric MIDAS in Breitung and Roling [2015] is based on the discrete form of the cubic smooth spline. The least-squares objective function is penalized by the sum of the second *Address of correspondence: Yeonwoo Rho, Department of Mathematical Sciences, Michigan Technological University,Houghton,MI49931,USA.(yrho@mtu.edu) Emails: Y.Rho(yrho@mtu.edu),Y.Liu(yliu26@mtu.edu),andH.J.Ahn(HieJoo.Ahn@frb.gov) 1

difference of weights to balance the goodness of fit and the smoothness of weights. Assume that the MIDAS model is shown in (1). The penalized least-squares objective function is T (cid:32) m−1 (cid:33)2 m Q = (cid:88) y −α − (cid:88) x β∗ +λ (cid:88)(cid:0) (cid:53)2β∗(cid:1)2 , BR t+h 0 t,i i BR i t=1 i=0 i=2 where (cid:53)2β∗ = (β∗ − 2β∗ + β∗ ) indicates the second difference of weights. The smoothed i i i−1 i−2 least-squares (SLS) estimator [Breitung and Roling, 2015] becomes β(cid:98) ∗ BR =argmin{(y−Xβ∗)(cid:48)(y−Xβ∗)+λ BR (Dβ∗)(cid:48)Dβ∗}, β∗ where   0 1 −2 1 0 ··· 0     0 0 1 −2 1 ··· 0 D = . (m−2)×(m+1) . . . . . . . . . . . . . . . . . . . . .     0 0 0 ··· 1 −2 1 The tuning parameter λ can be chosen using an information criteria. For example, Breitung BR and Roling [2015] proposed to use the modified Akaike information criterion (AIC), 2(s +1) AIC =log{(y−y )(cid:48)(y−y )}+ λBR , λBR (cid:98)BR (cid:98)BR T −s +2 λBR where y =X(X(cid:48)X+λ D(cid:48)D)−1X(cid:48)y. (cid:98)BR BR A.2 Clustering algorithm for the Fourier Transformed data The optimization problem in (9) is not trivial. The alternating direction method of multipliers (ADMM) algorithm by Boyd et al. [2011] has been successfully employed solving this optimization problem [Ma and Huang, 2017, Zhu and Qu, 2018]. This section introduces the ADMM algorithm in our setting and proves that it is convergent. 2

By introducing η =γ −γ , minimizing (9) is equivalent to minimizing ij i j 1 (cid:88) Q(γ,η)= ||y−Wγ||2+ ρ(η ,λ ) subject to η =γ −γ , 2 2 ij 1 ij i j 1≤i<j≤n where η = (η(cid:48) ,...,η(cid:48) )(cid:48). Following Boyd et al. [2011], this constrained optimization problem 12 n−1,n can be solved using a variant of the augmented Lagrangian Q (γ,η,ξ)= 1 ||y−Wγ||2+ (cid:88) ρ(η ,λ )+ λ 2 (cid:88) ||γ −γ −η ||2+ (cid:88) ξ(cid:48) (γ −γ −η ), λ2 2 2 ij 1 2 i j ij 2 ij i j ij i<j i<j i<j (A.1) where ξ = (ξ(cid:48) ,ξ(cid:48) ,...,ξ(cid:48) )(cid:48) and ξ are p-vectors of Lagrangian multipliers. As proposed by 12 13 n−1,n ij Boyd et al. [2011], the optimization problem in (A.1) can be solved using the alternating direction methodofmultipliers(ADMM)algorithm. Atthe(s+1)-thstepoftheADMMalgorithm,estimated parameters γs+1, ηs+1 and ξs+1 are updated as γs+1 =argminQ (γ,ηs,ξs), γ λ2 ηs+1 =argminQ (γs+1,η,ξs), (A.2) η λ2 ξs+1 =ξs +λ (ηs+1−γs+1+γs+1), ij ij 2 ij i j where ηs and ξs are the estimates in the s-th iteration. By collecting terms related to γ, the first function in (A.2) is equivalent to minimizing 1 λ Qγ (γ,η,ξ)= (cid:107)y−Wγ(cid:107)2+ 2(cid:107)Dγ−(η+ξ/λ )(cid:107)2, λ2 2 2 2 2 2 where D = (e −e )(cid:48) ⊗I , D = (D(cid:48) ,D(cid:48) ,···,D(cid:48) )(cid:48), e is an n-dimension vector with the ij i j p 12 13 n−1,n i i-th element as one and the rest as zeros, and I is an identity matrix with rank p. Therefore, p γs+1 =(W(cid:48)W +λ D(cid:48)D)−1{W(cid:48)y+λ D(cid:48)(ηs+ξs/λ )}. 2 2 2 The MCP is shown to be nearly unbiased and is applicable here to update ηs+1 [Zhu and Qu, 2018]. The penalty function of MCP is ρ(γ − γ ,λ ) = ρ ((cid:107)γ − γ (cid:107) ,λ ) where ρ (x, i j 1 θ i j 2 1 θ 3

t)=t (cid:82)x (1− u) du. As a consequence, when the MCP is selected, ηs+1 can be updated by 0 θt + ij   η˜s ij +1 if (cid:107)η˜s ij +1(cid:107) 2 ≥θλ 1 , ηs ij +1 =  θλ θλ − 2 1 (cid:32) 1− (cid:107) λ η˜ 1 s / + λ 1(cid:107) 2 (cid:33) η˜s ij +1 if (cid:107)η˜s ij +1(cid:107) 2 <θλ 1 , 2 ij 2 + whereη˜s+1 =γs+1−γs+1−ξs /λ andθ >1/λ fortheglobalconvexityofthesecondminimization ij i j ij 2 2 function in (A.2) [Wang et al., 2018, Forthcoming]. Iftheminimizationfunctionofηs+1 isnon-convex, assigningappropriateinitialvaluesbecomes essential. Aproperstartwillleadtoanidealsolution. InspiredbyZhuandQu[2018],theclustering method can be initialized as shown in the following algorithm. Algorithm 1: F-clust Algorithm Initialization: ξ0 =0, γ0 =(W(cid:48)W)−1(W(cid:48)y), η0 =argmin Q (γ,η,ξ), where λ and θ >1/λ are η λ2 2 2 fixed. for s=0,1,2,··· do γs+1 =(W(cid:48)W +λ D(cid:48)D)−1{W(cid:48)y+λ D(cid:48)(ηs+ξs/λ )}. 2 2 2 ηs+1 =argmin Q (γs+1,η,ξs), η λ2 ξs+1 =ξs +λ (ηs+1−γs+1+γs+1), for all 1≤i<j ≤n. ij ij 2 ij i j if the stopping criteria are true then Break end end The estimated number of groups, G(cid:98), can be obtained by η. If γ (cid:98)i =γ (cid:98)j , γ i and γ j are expected to be in the same cluster. However, as a penalty η has been imposed in the clustering algorithm, ij theequalityoftwoestimatedparametersisnotachievable. Asaresult,theMCPpenaltyisutilized on η . Two parameters γ and γ are clustered in the same group if η =0. (cid:98)ij i j (cid:98)ij InAlgorithm1,thestoppingcriteriaaredefinedasthefollowing. Letκs+1 =γs+1−γs+1−ηs+1, ij i j ij κ=(κ(cid:48) ,···,κ(cid:48) )(cid:48) andτs+1 =−λ { (cid:80) (ηs+1−ηs )− (cid:80) (ηs+1−ηs )},τ =(τ ,···,τ )(cid:48). 12 n−1,n k 2 i=k ij ij j=k ij ij 1 n At any step s∗, if for some small values (cid:15)κ and (cid:15)τ, (cid:107)κs∗(cid:107) ≤ (cid:15)κ and (cid:107)τs∗(cid:107) ≤ (cid:15)τ, the algorithm 2 2 4

stops. Following Zhu and Qu [2018], define(cid:15)κ and (cid:15)τ as (cid:15)κ = √ np(cid:15)abs+(cid:15)rel(cid:107)D(cid:48)ξs∗ (cid:107) , (cid:15)τ = (cid:112) |I|p(cid:15)abs+(cid:15)relmax{(cid:107)Dηs∗ (cid:107) ,(cid:107)ηs∗ (cid:107) }, 2 2 2 where I = {(i,j) : 1 ≤ i < j ≤ n}, |I| indicates the cardinality of I. Here, (cid:15)abs and (cid:15)rel are predetermined small values. Proposition1. Theaboveclusteringalgorithmensuresconvergence,thatis,(cid:107)κs+1(cid:107)2 →0and(cid:107)τs+1(cid:107)2 → 2 2 0, as s→∞. Proof of Proposition 1. (cid:107)κs+1(cid:107)2 − s − → − ∞ → 0 can be shown similarly to the proof of Proposition 1 in 2 Ma and Huang [2017]. The proof of (cid:107)τs+1(cid:107)2 − s − → − ∞ →0 can be done by ignoring the penalty term in 2 the objective function in the proof of Theorem 3.1 in Zhu and Qu [2018]. Proposition 1 demonstrates that the clustering algorithm is convergent as the number of iteration, s, approaches infinity. The stopping criteria can be satisfied at some step eventually. A.3 Comparable Clustering Methods 1: B&R-clust Recall that in (6), the MIDAS regression model without Fourier transformation of each subject is y =Z α +X β∗+ε , i=1,···,n. i i i i i i For more than one subject, the penal MIDAS model can be written as   α y i =(Z i ,X i )  i   =W(cid:102)i γ∗ i , or y=W(cid:102)γ∗+ε, β∗ i where W(cid:102)i =(Z i ,X i ) is the raw observations, γ∗ i =(α i (cid:48),β i ∗(cid:48))(cid:48), γ∗ =(γ 1 ∗(cid:48),···,γ n ∗(cid:48))(cid:48). Refer to the main idea of Breitung and Roling [2015], the cubic smoothing spline penalty is considered. The penalized objective function will be given as 1 1 Q(γ∗)= (cid:107)y−Wγ∗(cid:107)2+ θ γ∗(cid:48)Aγ∗, 2 2 2 γ∗ 5

where θ is the pre-determined smoothing parameter, A=I ⊗(A(cid:48)A). A is defined as γ∗ n   1 −2 1 0 ··· 0     0 1 −2 1 ··· 0 A = . (m−2)×m . . . . . . . . . . . . . . . . . .     0 0 ··· 1 −2 1 According to Zhu and Qu [2018], our goal is to solve the constrained optimization function Q (γ∗,η,ξ)=Q(γ∗)+ (cid:88) ρ(η ,λ )+ λ 2 (cid:88) ||γ∗−γ∗−η ||2+ (cid:88) ξ(cid:48) (γ∗−γ∗−η ). (A.3) λ2 ij 1 2 i j ij 2 ij i j ij i<j i<j i<j The clustering algorithm of (A.3) is similar to Algorithm 1. Algorithm 2: B&R-clust Algorithm Initialization: (cid:16) (cid:17)−1(cid:16) (cid:17) ξ0 =0, γ0 = W(cid:102)(cid:48)W(cid:102)+θ γ∗ A W(cid:102)(cid:48)y , η0 =argmin η Q λ2 (γ,η,ξ), where λ 2 and θ >1/λ are fixed. 2 for s=0,1,2,··· do γs+1 =(W(cid:48)W +λ D(cid:48)D+θ A)−1{W(cid:48)y+λ D(cid:48)(ηs+ξs/λ )}. 2 γ∗ 2 2 ηs+1 =argmin Q (γs+1,η,ξs), η λ2 ξs+1 =ξs +λ (ηs+1−γs+1+γs+1), for all 1≤i<j ≤n. ij ij 2 ij i j if the stopping criteria are true then Break end end Note that Algorithm 2 follows the same main idea of Zhu and Qu [2018]. However, in Zhu and Qu [2018], the model introduces B-splines to approximate observations, while Algorithm 2 simply uses all high-frequency regressors. Moreover, an additional tuning parameter, θ , is required to γ∗ be predetermined. Refer to Breitung and Roling [2015], Zhu and Qu [2018], the selection of θ is γ∗ 6

based on the minimum of AIC given by AIC = (cid:88) n (cid:26) log (cid:18) (cid:107)y i −W i γ (cid:98)i (cid:107)2 2 (cid:19) + 2·df i (cid:27) , θγ∗ T T i=1 where df =tr{W (W(cid:48)W +θ A(cid:48)A)−1W(cid:48)}. The selection of λ here, is by minimizing i i i i γ∗ i 1 (cid:110) (cid:111) BIC =log (cid:18) (cid:107)y−Wγ (cid:98) (cid:107)2 2 (cid:19) + log(n) G(cid:98)( n 1 (cid:80)n i=1 df i ) . λ1 n n With fixed λ , AIC can be obtained for different values of θ . Then, fix θ with minimum 1 θγ∗ γ∗ γ∗ BIC, BIC can be calculated based on the determined θ . λ1 γ∗ A.4 Comparable Clustering Methods 2: Fourier-SSP Su et al. [2016] introduced C-Lasso for clusters to identify relatively large differences between parameters and group averages rather than the traditional Lasso for each subject to select relevant covariates. The penalized profile likelihood (PPL) function mentioned in Su et al. [2016] is n T 1 (cid:88)(cid:88) Q(γ∗)= φ(w ;γ∗,µ (γ∗)). nT it i (cid:98)i i i=1 t=1 By introducing the group Lasso penalty, the PPL criterion function becomes Q =Q(γ∗)+ λ PPL (cid:88) N (cid:89) G0 (cid:107)β −α (cid:107) , G,λPPL N i g 2 i=1g=1 where λ is a tuning parameter. The C-Lasso estimation γ and α, respectively. Without any PPL (cid:98) (cid:98) prior knowledge of the true clusters, PPL C-Lasso estimation requires a predetermination of a reasonable maximum value, G , of groups. An appropriate choice of (λ ,G ) can be found by 0 PPL 0 minimizing IC based on all possible values of clusters less than G as long as predetermined values 0 of λ . To start the algorithm, Su et al. [2016] suggested a natural initial value as α(0) = 0 for PPL (cid:98)g all g = 1,···,G and γ∗(0) as the quasi-maximum likelihood estimation (QMLE) of γ∗ in each 0 (cid:98) i subjects. More details can be found in Su et al. [2016]. 7

Algorithm 3: SSP – PPL Algorithm Given G and λ 0 PPL Initialization: α(0) =(α(0),···,α(0)) (cid:48) , γ∗(0) =(γ∗(0) ,···,γ∗(0) ) (cid:48) s.t. (cid:98) (cid:98)1 (cid:98)G0 (cid:98) (cid:98)1 (cid:98)n (cid:80)n (cid:107)γ∗(0) −α(0)(cid:107)=(cid:54) 0 for all g =2,···,G . i=1 (cid:98)i (cid:98)g 0 for s=1,2,··· do for g =1,2,···G do 0 Obtain the estimator (γ∗(s,G) ,α(s)) of (γ∗,α ) by minimizing the following (cid:98) (cid:98)g g objective function Q(s,g) (γ∗,α ). G,λPPL g if g =1 then λ Q(s,g) (γ∗,α )=Q(γ∗)+ PPL (cid:80)N (cid:107)γ∗−α (cid:107) (cid:81)G (cid:107)γ ∗(s−1,k) −α(s−1)(cid:107) ; G,λPPL g N i=1 i g k=2 i k else if g (cid:54)=G then Q(s,g) (γ∗,α )= G,λPPL g Q(γ∗)+ λ PPL (cid:80)N (cid:107)γ∗−α (cid:107) (cid:81)g−1(cid:107)γ∗(s,j) −α(s)(cid:107) (cid:81)G (cid:107)γ ∗(s−1,k)−α(s−1)(cid:107); N i=1 i g j=1 (cid:98)i j k=g+1 i k else Q(s,g) (γ∗,α )=Q(γ∗)+ λ PPL (cid:80)N (cid:107)γ∗−α (cid:107) (cid:81)G−1(cid:107)γ∗(s,k) −α(s)(cid:107) ; G,λPPL g N i=1 i g k=1 (cid:98)i k end end if the stopping criteria are true then Break end end Su et al. [2016] provided a stopping criteria for this algorithm: (cid:13) (cid:13)2 (cid:80)G (cid:13)α(s)−α(s−1)(cid:13) Q(cid:98) ( G s , − λ 1 P ) PL −Q(cid:98) ( G s , ) λPPL ≤(cid:15) tl and (cid:80)G g=1 (cid:13) (cid:13) (cid:13) α (cid:98) (s g −1) (cid:13) (cid:13) 2 (cid:98) + g 0.00 (cid:13) 01 ≤(cid:15) tl , g=1(cid:13)(cid:98)g (cid:13) where (cid:15) is a predetermined small value indicating the tolerance level. tl 8

A.5 Algorithm for dropping a part of regressors in clustering In the framework shown in Section 3, the procedure concentrates on clustering weights of Z i and X(cid:101)i at the same time. To cluster part of weights, a selection matrix C s is introduced∗. The modified penalized objective function: 1 (cid:88) Q(γ)= ||y−Wγ||2+ ρ(C γ −C γ ,λ ), 2 2 s i s j 1 1≤i<j≤n where C is a matrix of 1s and 0s that picks up the coefficient of interest. For example, if one is s interested in clustering Fourier transformed weights only, the matrix C is the same as D in (4). s The group-specified parameter is η =C γ −C γ , and the constrained optimization problem is (cid:101)ij s i s j 1 (cid:88) Q (γ,η,ξ)= ||y−Wγ||2+ ρ(η ,λ ) λ2 2 2 ij 1 i<j + λ 2 (cid:88) ||C γ −C γ −η ||2+ (cid:88) ξ(cid:48) (C γ −C γ −η ). 2 s i s j ij 2 ij s i s j ij i<j i<j Equivalently, 1 λ Qγ λ2 (γ,η,ξ)= 2 (cid:107)y−Wγ(cid:107)2 2 + 2 2(cid:107)D(cid:101)γ−(η+ξ/λ 2 )(cid:107)2 2 , whereD ij =(e i −e j )(cid:48)⊗I p andD(cid:101) =(D 1 (cid:48) 2 C s (cid:48),D 1 (cid:48) 3 C s (cid:48),···,D n (cid:48) −1,n C s (cid:48))(cid:48). Thecorrespondingalgorithm can be summarized as Algorithm 4. B Proofs B.1 Lemmas Assumptionsonregressors(inoursetting,W)madeinMaandHuang[2016]andrelatedpapers can be somewhat too strong for our panel setting. For example, (C3) in Ma and Huang [2016] assumes that each column of W, taking only the rows that correspond to the k-th group, should be nonrandom, and the sum of squares of all its elements is assumed to be equal to the size of k-th ∗Althoughwedonotprovideaformalproofforthisargument,thevalidityofthisalgorithmcanbeprovedina similarmanner,followingMaandHuang[2016]’sargument. Tokeepthepaperconcise,wedonotpresentthedetail inthispaper. 9

Algorithm 4: F-clust excluding some coefficients from clustering Initialization: ξ0 =0, γ0 =(W(cid:48)W)−1(W(cid:48)y), η0 =argmin Q (γ,η,ξ), where λ and θ >1/λ are η λ2 2 2 fixed. for s=0,1,2,··· do (cid:16) (cid:17)−1(cid:110) (cid:111) γs+1 = W(cid:48)W +λ 2 D(cid:101)(cid:48)D(cid:101) W(cid:48)y+λ 2 D(cid:101)(cid:48)(ηs+ξs/λ 2 ) . ηs+1 =argmin Q (γs+1,η,ξs), η λ2 ξs+1 =ξs +λ (ηs+1−Cγs+1+Cγs+1), for all 1≤i<j ≤n. ij ij 2 ij i j if the stopping criteria are true then Break end end group, i.e., |G |. This type of assumption could be realistic for data involved with an experimental k design, but not suitable for panel data setting, where columns of W generally consists of random variables. In this proof, we circumvent this issue by using the following lemmas. Lemma 1. Suppose a random vector ε = (ε ,ε ,...,ε )(cid:48) of length nT as in (8) satisfies 1,1 1,2 n,T Assumption 4. Let A ∈ Ra×nT be a nonrandom matrix with a positive integer a. Let Σ = A(cid:48)A. For any ζ >0, (cid:104) (cid:112) (cid:105) P (cid:107)Aε(cid:107)2 >2c˜{tr(Σ)+2 tr(Σ2)ζ+2(cid:107)Σ(cid:107) ζ} ≤e−ζ. 2 2 Proof of Lemma 1. When a=nT, this lemma is a special case of Theorem 2.1 in Hsu et al. [2012]. This can be easily seen by recognizing their µ, σ2, and α are 0, 2c˜, and (ν ,ν ,...,ν )(cid:48), 1,1 1,2 n,T respectively. If a < nT, a similar argument can still be used. Consider a singular value decomposition of A=USV(cid:48), where U and V are a×a and nT ×nT orthogonal matrices, respectively. Let ρ=(ρ , 1 ...,ρ )(cid:48) denotethe nonzeroeigenvaluesof A(cid:48)A and AA(cid:48). S isan a×nT matrix, whereitsdiagonal a √ elements are equal to ρ for i = 1,...,a and all other entries are zero. Let z be a vector of a i independent standard Gaussian random variables. Since U is orthogonal, y =U(cid:48)z is also an a×1 vector of a independent standard Gaussian random variables. Let y = (y ,...,y )(cid:48). Applying 1 a Lemma2.4ofHsuetal.[2012]on(cid:107)A(cid:48)z(cid:107)2 =Z(cid:48)AA(cid:48)z =z(cid:48)USV(cid:48)VS(cid:48)U(cid:48)z =ySS(cid:48)y(cid:48) = (cid:80)a ρ y2,then i=1 i i E (cid:8) exp (cid:0) γ(cid:107)A(cid:48)z(cid:107)2(cid:1)(cid:9) ≤exp (cid:18) (cid:107)ρ(cid:107) γ+ (cid:107)ρ(cid:107)2 2 γ2 (cid:19) (B.1) 1 1−2(cid:107)ρ(cid:107) γ ∞ 10

for any 0 ≤ γ < 1/(2(cid:107)ρ(cid:107) ). For any λ ∈ R and δ ≥ 0, using similar arguments as in (2.3) and ∞ (2.4) of Hsu et al. [2012], Assumption 4, and (B.1), (cid:18) λ2δ (cid:19) (cid:26) (cid:107)ρ(cid:107)2(λ2c˜)2 (cid:27) P((cid:107)Aε(cid:107)2 >δ)≤exp − exp (cid:107)ρ(cid:107) (λ2c˜)+ 2 . 2 1 1−2(cid:107)ρ(cid:107) (λ2c˜) ∞ Letδ =2c˜((cid:107)ρ(cid:107) 1 +τ), λ2 = 1 c˜2(cid:107)ρ 1 (cid:107)∞ (cid:16) 1− (cid:113) (cid:107)ρ(cid:107)2 2 + (cid:107)ρ 2 (cid:107) (cid:107) 2 2 ρ(cid:107)∞τ (cid:17) , andτ =2 (cid:112) (cid:107)ρ(cid:107)2 2 ζ+2(cid:107)ρ(cid:107) ∞ ζ. Thedesired proof is concluded by using similar arguments as Hsu et al. [2012] and observing (cid:107)ρ(cid:107) = (cid:80)a ρ = 1 i=1 i tr(Σ), (cid:107)ρ(cid:107)2 = (cid:80)a ρ2 =tr(Σ2), and (cid:107)ρ(cid:107) =max ρ =(cid:107)Σ(cid:107) . 2 i=1 i ∞ i i 2 A similar proof works for a > nT. In this case, without loss of generality, the only nonzero √ element in S are the first nT diagonal elements of S. Let ρ , i = 1,...,nT, be the nonzero i diagonal elements of S. Then ||A(cid:48)z||2 = (cid:80)nT ρ y2, where y are independent standard Gaussian i=1 i i i random variables. The rest of the proof is the same. Lemma2. SupposeconditionsofLemma1hold. ForanynT×npmatrixW satisfyingAssumption 2, P (cid:104) (cid:107)W(cid:48)ε(cid:107)2 >2c˜(np+2 (cid:112) npζ∗+2ζ∗)(cid:107)W(cid:48)W(cid:107) (cid:12) (cid:12)W (cid:105) ≤e−ζ∗ and 2 2 (cid:12) (cid:104) (cid:112) (cid:12) (cid:105) P (cid:107)Γ(cid:48)W(cid:48)ε(cid:107)2 >2c˜(Gp+2 Gpζ+2ζ)(cid:107)Γ(cid:48)W(cid:48)WΓ(cid:107) (cid:12)W ≤e−ζ 2 2 (cid:12) hold for any ζ∗ >0 and ζ >0. Proof of Lemma 2. Fix a nT ×nP matrix W that satisfies Assumption 2. Using Lemma 1, for any ζ∗ >0, P (cid:104) (cid:107)W(cid:48)ε(cid:107)2 2 >2c˜(tr(WW(cid:48))+2 (cid:112) tr((WW(cid:48))2)ζ∗+2(cid:107)WW(cid:48)(cid:107) 2 ζ∗) (cid:12) (cid:12)W (cid:105) ≤e−ζ∗ , and P (cid:104) (cid:107)Γ(cid:48)W(cid:48)ε(cid:107)2 2 >2c˜(tr(ΓWW(cid:48)Γ(cid:48))+2 (cid:112) tr((ΓWW(cid:48)Γ(cid:48))2)ζ+2(cid:107)ΓWW(cid:48)Γ(cid:48)(cid:107) 2 ζ) (cid:12) (cid:12)W (cid:105) ≤e−ζ. Since (cid:107)WW(cid:48)(cid:107) is the maximum eigenvalue of WW(cid:48), using the fact that WW(cid:48) is symmetric and 2 positive definite with rank np, it can be easily seen that λ (WW(cid:48))=λ (W(cid:48)W), max max (cid:107)WW(cid:48)(cid:107) =(cid:107)W(cid:48)W(cid:107) =(cid:107)diag(W(cid:48)W ,···,W(cid:48)W )(cid:107) ≤max(cid:107)W(cid:48)W (cid:107) , and 2 2 1 1 n n 2 i i 2 i tr(WW(cid:48))=tr(W(cid:48)W)≤np(cid:107)W(cid:48)W(cid:107) , tr((WW(cid:48))2)=tr((W(cid:48)W)2)≤np(cid:107)W(cid:48)W(cid:107)2. 2 2 11

Therefore (cid:112) (cid:112) tr(WW(cid:48))+2 tr[(WW(cid:48))2]ζ∗+2(cid:107)WW(cid:48)(cid:107) ζ∗ ≤(np+2 npζ∗+2ζ∗)(cid:107)W(cid:48)W(cid:107) . 2 2 Similarly, (cid:107)WΓΓ(cid:48)W(cid:48)(cid:107) =(cid:107)Γ(cid:48)W(cid:48)WΓ(cid:107) , 2 2 tr(WΓΓ(cid:48)W(cid:48))=tr(Γ(cid:48)W(cid:48)WΓ)≤Gpλ (Γ(cid:48)W(cid:48)WΓ)=Gp(cid:107)Γ(cid:48)W(cid:48)WΓ(cid:107) , and max 2 tr{(WΓΓ(cid:48)W(cid:48))2}=tr{(Γ(cid:48)W(cid:48)WΓ)2}≤Gp{λ (Γ(cid:48)W(cid:48)WΓ)}2 =Gp(cid:107)Γ(cid:48)W(cid:48)WΓ(cid:107)2. max 2 Therefore for any ζ >0, (cid:112) (cid:112) (cid:112) tr(Γ(cid:48)W(cid:48)WΓ)+2 tr{(Γ(cid:48)W(cid:48)WΓ)2} ζ+2(cid:107)Γ(cid:48)W(cid:48)WΓ(cid:107) ζ ≤(Gp+2 Gpζ+2ζ)(cid:107)Γ(cid:48)W(cid:48)WΓ(cid:107) . 2 2 As a result, given any matrix W, the inequalities in the statement have been validated. Lemma 3. Suppose Assumptions 2 and 4 hold for W and ε. Define S =2c˜(Gp+2 (cid:112) Gpζ+2ζ)g mM˜(cid:112) GpTB , ζ max q,m S =2c˜(np+2 (cid:112) npζ∗+2ζ∗)mM˜ √ TB √ p, ζ∗ q,m where B =(q1/2+m1/2(L+1+2K)), p=q+L+1+2K, M˜ =max(M ,M ,M ,M ) and c˜ q,m 1 2 3 4 given in Assumption 2 and 4, then P (cid:2) (cid:107)W(cid:48)ε(cid:107)2 >S (cid:3) ≤e−ι∗ and P (cid:2) (cid:107)Γ(cid:48)W(cid:48)ε(cid:107)2 >S (cid:3) ≤e−ι where 2 ζ∗ 2 ζ ι=min(ζ,−log((cid:15)))−log(2) and ι∗ =min(ζ∗,−log((cid:15)))−log(2) for any ζ and ζ∗ in Lemma 2. Proof of Lemma 3. Using the law of iterated expectations, E (cid:2) P (cid:0) (cid:107)W(cid:48)ε(cid:107)2 2 >S ζ∗ (cid:12) (cid:12)W (cid:1)(cid:3) =P [(cid:107)W(cid:48)ε(cid:107) 2 >S ζ∗ ] =E (cid:104) I {(cid:107)W(cid:48)ε(cid:107)2 2 >Sζ∗} (cid:12) (cid:12)(cid:107)WW(cid:48)(cid:107) 2 ≤M∗ (cid:105) P((cid:107)WW(cid:48)(cid:107) 2 ≤M∗) +E (cid:104) I {(cid:107)W(cid:48)ε(cid:107)2 2 >Sζ∗} (cid:12) (cid:12)(cid:107)WW(cid:48)(cid:107) 2 >M∗ (cid:105) P((cid:107)WW(cid:48)(cid:107) 2 >M∗) =P (cid:2) (cid:107)W(cid:48)ε(cid:107)2 2 >S ζ∗ (cid:12) (cid:12)(cid:107)WW(cid:48)(cid:107) 2 ≤M∗(cid:3) P((cid:107)WW(cid:48)(cid:107) 2 ≤M∗) +P (cid:2) (cid:107)W(cid:48)ε(cid:107)2 2 >S ζ∗ (cid:12) (cid:12)(cid:107)WW(cid:48)(cid:107) 2 >M (cid:3) P((cid:107)WW(cid:48)(cid:107) 2 >M∗). 12

Since(cid:107)M(cid:107) ≤mand(cid:107)M(cid:48)(cid:107) ≤L+1+2K asallelementsofM in(3)smallerthan1inmagnitude, ∞ ∞ (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:88) Z i (cid:48)Z i (cid:13) (cid:13) (cid:13) = (cid:88) (cid:107)Z i (cid:48)Z i (cid:107) ∞ ≤M 1 |G g | (cid:112) qT, (cid:13)i∈Gg (cid:13) ∞ i∈Gg (cid:13) (cid:13) (cid:13) (cid:13) √ (cid:13) (cid:13) (cid:13) (cid:88) Z i (cid:48)X˜ i (cid:13) (cid:13) (cid:13) ≤ (cid:88) (cid:107)Z i (cid:48)X i (cid:107) ∞ (cid:107)M(cid:48)(cid:107) ∞ ≤M 3 |G g | mT(L+1+2K), (cid:13)i∈Gg (cid:13) ∞ i∈Gg (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:88) X˜ i (cid:48)Z i (cid:13) (cid:13) (cid:13) ≤(cid:107)M(cid:107) ∞ (cid:88) (cid:107)Z i (cid:48)X i (cid:107) ∞ ≤M 4 |G g |m (cid:112) qT, and (cid:13)i∈Gg (cid:13) ∞ i∈Gg (cid:13) (cid:13) (cid:13) (cid:13) √ (cid:13) (cid:13) (cid:13) (cid:88) X˜ i (cid:48)X˜ i (cid:13) (cid:13) (cid:13) ≤(cid:107)M(cid:107) ∞ (cid:88) (cid:107)X i (cid:48)X i (cid:107) ∞ (cid:107)M(cid:48)(cid:107) ∞ ≤M 2 |G g |m mT(L+1+2K) (cid:13)i∈Gg (cid:13) ∞ i∈Gg holdwithprobabilityatleast1−(cid:15)forany(cid:15)>0definedinAssumption2. Therefore,withprobability at most 1−(cid:15), (cid:107)WW(cid:48)(cid:107) =(cid:107)W(cid:48)W(cid:107) =(cid:107)diag(W(cid:48)W ,···,W(cid:48)W )(cid:107) ≤sup(cid:107)W(cid:48)W (cid:107) 2 2 1 1 n n 2 i i 2 i (cid:13) (cid:13) √ √ (cid:13) (cid:13)Z i (cid:48)Z i Z i (cid:48)X˜ i (cid:13) (cid:13) ≤ psup(cid:107)W(cid:48)W (cid:107) = psup(cid:13) (cid:13) i i ∞ (cid:13) (cid:13) i i (cid:13)X˜(cid:48)Z X˜(cid:48)X˜ (cid:13) (cid:13) i i i i(cid:13) ∞ √ √ ≤M˜m TB p. q,m Since tr(WW(cid:48))=tr(W(cid:48)W)≤np(cid:107)W(cid:48)W(cid:107) and tr((WW(cid:48))2)=tr((W(cid:48)W)2)≤np(cid:107)W(cid:48)W(cid:107)2, 2 2 (cid:112) (cid:112) tr(WW(cid:48))+2 tr[(WW(cid:48))2]ζ∗+2(cid:107)WW(cid:48)(cid:107) ζ∗ ≤(np+2 npζ∗+2ζ∗)(cid:107)WW(cid:48)(cid:107) . 2 2 √ √ Since ||WW(cid:48)|| is bounded in probability, for any (cid:15) > 0, there exists some M∗ = M˜m TB p 2 q,m such that P[(cid:107)WW(cid:48)(cid:107) >M∗]≤(cid:15). Therefore 2 P (cid:2) (cid:107)W(cid:48)ε(cid:107)2 2 >S ζ∗ (cid:12) (cid:12)W,(cid:107)WW(cid:48)(cid:107) 2 ≤M∗(cid:3) ≤e−ζ∗ , 1−(cid:15)<P((cid:107)WW(cid:48)(cid:107) 2 ≤M∗)≤1, P (cid:2) (cid:107)W(cid:48)ε(cid:107)2 2 >S ζ∗ (cid:12) (cid:12)W,(cid:107)WW(cid:48)(cid:107) 2 >M∗(cid:3) ≤1, P((cid:107)WW(cid:48)(cid:107) 2 >M∗)≤(cid:15), √ and P (cid:2) (cid:107)W(cid:48)ε(cid:107)2 >S (cid:3) ≤e−ζ∗ +(cid:15) where S =2c˜(np+2 npζ∗+2ζ∗)M∗. 2 ζ∗ ζ∗ Without loss of generality, let ζ˜∗ = min{ζ∗,−log((cid:15))} for a large constant ζ∗ > 1 and for small 13

positiveconstant(cid:15)≤1,thene−ζ∗+(cid:15)=e−ζ∗+elog((cid:15)) =e−ζ˜∗(1+e−|ζ∗+log((cid:15))|)≤2e−ζ˜∗ =elog(2)−ζ˜∗. Take ι∗ = ζ˜∗ −log(2), then P (cid:2) (cid:107)W(cid:48)ε(cid:107)2 >S (cid:3) ≤ e−ι∗. For large enough ζ˜∗, log(2) is negligible. 2 ζ∗ Similarly, S in P (cid:2) (cid:107)Γ(cid:48)W(cid:48)ε(cid:107)2 >S (cid:3) ≤e−ι can be found as the following. ζ 2 ζ A straightforward calculation derives that (cid:32) (cid:33) (cid:88) (cid:88) Γ(cid:48)W(cid:48)WΓ=diag W(cid:48)W ,..., W(cid:48)W . i i i i i∈G1 i∈GG It follows that, with probability 1−(cid:15), (cid:13) (cid:13) (cid:13) (cid:13) (cid:107)Γ(cid:48)W(cid:48)WΓ(cid:107) ∞ = 1≤ m g a ≤ x G (cid:13) (cid:13) (cid:13) (cid:13)i (cid:88) ∈Gg W i (cid:48)W i (cid:13) (cid:13) (cid:13) (cid:13) ∞ ≤ 1≤ m g a ≤ x G i (cid:88) ∈Gg (cid:107)W i (cid:48)W i (cid:107) ∞ ≤g max 1≤ su i≤ p n (cid:107)W i (cid:48)W i (cid:107) ∞ √ ≤g mM˜ TB , max q,m and therefore, (cid:107)Γ(cid:48)W(cid:48)WΓ(cid:107) ≤ (cid:112) Gp(cid:107)Γ(cid:48)W(cid:48)WΓ(cid:107) ≤g mM˜(cid:112) GpTB . 2 ∞ max q,m √ For any (cid:15) > 0, there exists some M = g mM˜ TB , such that P[(cid:107)WΓΓ(cid:48)W(cid:48)(cid:107) > M] ≤ (cid:15), max q,m 2 then P (cid:2) (cid:107)Γ(cid:48)W(cid:48)ε(cid:107)2 2 >S ζ (cid:12) (cid:12)W,(cid:107)WΓΓ(cid:48)W(cid:48)(cid:107)2 2 ≤M (cid:3) ≤e−ζ, 1−(cid:15)<P((cid:107)WΓΓ(cid:48)W(cid:48)(cid:107) 2 ≤M)≤1, P (cid:2) (cid:107)W(cid:48)ε(cid:107) 2 >S ζ (cid:12) (cid:12)W,(cid:107)WΓΓ(cid:48)W(cid:48)(cid:107) 2 >M (cid:3) ≤1, P((cid:107)WΓΓ(cid:48)W(cid:48)(cid:107) 2 >M)≤(cid:15). √ Therefore, P (cid:2) (cid:107)Γ(cid:48)W(cid:48)ε(cid:107)2 >S (cid:3) ≤ e−ζ +(cid:15) where S = 2c˜(Gp+2 Gpζ +2ζ)M. Similarly, take 2 ζ ζ ι=min{ζ,−log((cid:15))}−log(2), then P (cid:2) (cid:107)Γ(cid:48)W(cid:48)ε(cid:107)2 >S (cid:3) ≤e−ι. 2 ι B.2 Convergence of the Oracle Estimator Theorem 1 and Corollary 1 are proved in this section. 14

Proof of Theorem 1. The definition of Γ and y=Wγor+ε lead to γˆor−γ0 = Γ(Γ(cid:48)W(cid:48)WΓ)−1Γ(cid:48)W(cid:48)ε   (cid:80) W(cid:48)ε  i∈G1 i i  = Γ (cid:8) diag (cid:0)(cid:80) i∈G1 W i (cid:48)W i ,..., (cid:80) i∈GG W i (cid:48)W i (cid:1)(cid:9)−1   . . .    ,   (cid:80) W(cid:48)ε i∈GG i i where for any g ∈{1,...,G},     (cid:80) Z(cid:48)Z ( (cid:80) Z(cid:48)X )M(cid:48) (cid:80) Z(cid:48)ε (cid:88) W(cid:48)W = i∈Gg i i i∈Gg i i  and (cid:88) W(cid:48)ε = i∈Gg i i . i i   i i   i∈Gg M( (cid:80) i∈Gg X i (cid:48)Z i ) M( (cid:80) i∈Gg X i (cid:48)X i )M(cid:48) i∈Gg M( (cid:80) i∈Gg X i (cid:48)ε i ) Assumption 2 implies that λ (Γ(cid:48)W(cid:48)WΓ)≥cg T, min min so that (cid:112) (cid:112) (cid:107)(Γ(cid:48)W(cid:48)WΓ)−1(cid:107) ≤ Gp(cid:107)(Γ(cid:48)W(cid:48)WΓ)−1(cid:107) ≤ Gp(cg T)−1. (B.2) ∞ 2 min Forallp-norms,(cid:107)A⊗B(cid:107)=(cid:107)A(cid:107)(cid:107)B(cid:107)holds(forexample,seep. 433ofLangvilleandStewart[2004]), (cid:107)Γ(cid:107) ≤(cid:107)Π(cid:107) (cid:107)I (cid:107) =1. (B.3) ∞ ∞ p ∞ Lemma 3, equations (B.2) and (B.3), and the triangle inequality imply that for any ι>0, (cid:107)γor−γ0(cid:107) ≤(cid:107)Γ(cid:107) (cid:107)(Γ(cid:48)W(cid:48)WΓ)−1(cid:107) (cid:107)Γ(cid:48)W(cid:48)ε(cid:107) (cid:98) ∞ ∞ ∞ ∞ ≤(Gp)1/2(cg T)−1(cid:107)Γ(cid:48)W(cid:48)ε(cid:107) ≤(Gp)1/2(cg T)−1S1/2, min 2 min ζ with probability at least 1−eι. Therefore, √ φ := 2c˜(mM˜g max )1/2(Gp)3/4 B1/2(Gp+2 (cid:112) Gp (cid:112) ζ+2ζ)1/2, (B.4) n,T,G,ζ c g T3/4 q,m min 15

where B is defined in Lemma 3. Therefore, with probability at least 1−e−ι, q,m (cid:107)γor−γ0(cid:107) ≤φ . (cid:98) ∞ n,T,G,ζ This proves the first part of Theorem 1. The remaining proof is for the asymptotic normality of γor. Let V = W (Π ⊗I ) be a T ×Gp matrix, where Π is the i-th row of the matrix Π, (cid:98) i i i· p i· V =WΓ=(V(cid:48),···,V(cid:48))(cid:48). Then, for any c ∈RGp with (cid:107)c (cid:107) =1, 1 n n n 2 n n T (cid:88) (cid:88) (cid:88) c(cid:48) (γˆor−γ0)= c(cid:48) (V(cid:48)V)−1V(cid:48)ε = c(cid:48) (V(cid:48)V)−1 v(cid:48) ε . n n i i n it it i=1 i=1 t=1 Since {ε } is assumed to be an i.i.d. subgaussian distributed sequence with mean 0 and variance i proxy 2c˜, then E(ε )=0. Hence, i E (cid:2) c(cid:48) (γˆor−γ0) (cid:3) =0. n Suppose that Assumption 2 and 4 hold where λ (V(cid:48)V)=λ (Γ(cid:48)W(cid:48)WΓ)≤c∗|G |T ≤c∗g T max max g max and Var(ε )=O(2c˜), then it Var(ε ) σ2 :=Var[c(cid:48) (γˆor−γ0)]≥ it . γ n c∗g T max Moreover, for any (cid:15)>0, applying Cauchy-Schwarz inequality, n (cid:88) E (cid:0) (c(cid:48) (V(cid:48)V)−1V(cid:48)ε )21{|c(cid:48) (V(cid:48)V)−1V ε |>(cid:15)σ } (cid:1) n i i n i i γ i=1 n ≤ (cid:88)(cid:8) E(c(cid:48) (V(cid:48)V)−1V(cid:48)ε )4(cid:9)1/2(cid:8) E (cid:0)1{|c(cid:48) (V(cid:48)V)−1V(cid:48)ε |>(cid:15)σ }2(cid:1)(cid:9)1/2 n i i n i i γ i=1 (B.5) n = (cid:88)(cid:8) E(c(cid:48) (V(cid:48)V)−1V(cid:48)ε )4(cid:9)1/2(cid:8) E (cid:0)1{|c(cid:48) (V(cid:48)V)−1V(cid:48)ε |>(cid:15)σ } (cid:1)(cid:9)1/2 n i i n i i γ i=1 n = (cid:88)(cid:8) E(c(cid:48) (V(cid:48)V)−1V(cid:48)ε )4(cid:9)1/2(cid:8) P(|c(cid:48) (V(cid:48)V)−1V(cid:48)ε |>(cid:15)σ ) (cid:9)1/2 . n i i n i i γ i=1 16

The first term can be derived as (cid:2) E(c(cid:48) (V(cid:48)V)−1V(cid:48)ε )4(cid:3)1/2 = (cid:2) E(c(cid:48) (V(cid:48)V)−1V(cid:48)ε ε(cid:48)V(cid:48)(V(cid:48)V)−1c )2(cid:3)1/2 n i i n i i i i n = (cid:2) {c(cid:48) (V(cid:48)V)−1V }2E(ε ε(cid:48))2{V(cid:48)(V(cid:48)V)−1c (cid:9)2 ]1/2 n i i i i n =c(cid:48) (V(cid:48)V)−1V [E(ε ε(cid:48))2]1/2V(cid:48)(V(cid:48)V)−1c n i i i i n ≤(cid:107)c(cid:48) n (V(cid:48)V)−1V i (cid:107)2 2 (cid:13) (cid:13)E(ε i ε(cid:48) i )2(cid:13) (cid:13) 1 2 /2 . √ For any n×n matrix A, (cid:107)A(cid:107) ≤ n(cid:107)A(cid:107) . Since E(εk)≤(2σ2)k/2kΓ(k/2) for k ≥1, then 2 ∞ it √ √ (cid:32) T T (cid:33) √ (cid:13) (cid:13)E(ε i ε(cid:48) i )2(cid:13) (cid:13) 2 ≤ T (cid:13) (cid:13)E(ε i ε(cid:48) i )2(cid:13) (cid:13) ∞ = T τ= m 1, a ·· x ·,T E ε iτ (cid:88) ε it (cid:88) ε2 it ≤ T(16+T)4c˜2. t=1 t=1 According to Assumption 2, (cid:107)V (cid:107) is bounded and let the upper bound be some constant c , then i ∞ 2 √ (cid:107)V (cid:107) ≤ Gpc . Following 2, (cid:107)(V(cid:48)V)−1(cid:107) ≥(cg T)−1, i 2 2 2 min (cid:8) E(c(cid:48) (V(cid:48)V)−1V ε )4(cid:9)1/2 ≤(cid:107)c(cid:48) (cid:107)2(cid:107)(V(cid:48)V)−1(cid:107)2(cid:107)V (cid:107)2T1/4(16+T)1/22c˜2 n i it n 2 2 i 2 c2Gp(16+T)1/22c˜ ≤ 2 . c2g2 T3/4 min Then, by Chebyshev’s inequality, the second term of (B.5) can be derived as E[c(cid:48) (V(cid:48)V)−1V ε ]2 P(|c(cid:48) (V(cid:48)V)−1V ε |>(cid:15)σ )≤ n i i , (B.6) n i i γ (cid:15)2σ2 γ where E(c(cid:48) (V(cid:48)V)−1V ε )2 =E(c(cid:48) (V(cid:48)V)−1V ε ε(cid:48)V(cid:48)(V(cid:48)V)−1c ) n i i n i i i i n c2Gp2c˜ ≤(cid:107)c (cid:107)2(cid:107)(V(cid:48)V)−1(cid:107)2(cid:107)V (cid:107)2(cid:107)E(ε ε(cid:48))(cid:107) ≤ 2 , n 2 2 i 2 i i 2 c2g2 T2 min then, (B.6) becomes c2Gp2c˜ P(|c(cid:48) (V(cid:48)V)−1V ε |>(cid:15)σ )≤ 2 . n i i γ c2g2 T2(cid:15)2σ2 min γ 17

Therefore, the following inequality can be derived. n σ−2 (cid:88) E (cid:0) (c(cid:48) (V(cid:48)V)−1V ε )21{|c(cid:48) (V(cid:48)V)−1V ε |>(cid:15)σ } (cid:1) γ n i i n i i γ i=1 √ ≤σ−2 (cid:88) n c2 2 Gp(16+T)1/22c˜c 2 (Gp)1/2 2c˜ = c3 2 p3/2(2c˜)3/2G3/2(16+T)1/2n γ c2g2 T3/4 cg T(cid:15)σ c3(cid:15)g3 T7/4σ3 i=1 min min ϕ min γ (B.7) (2c˜)3/2(n/g )3/2n(16+T)1/2 c˜3n5/2(16+T)1/2 ≤C min =C σ γ 3g m 3 in T7/4 σ ϕ 3g m 9/ i 2 n T7/4 (cid:32) (cid:33) n5/2(16+T)1/2c∗3/2g3/2T3/2 g3/2n5/2T1/4 max max =C =O . g9/2T7/4 g9/2 min min g3 Suppose that min (cid:29)n5/3T1/6, then (B.7) further implies that g max n σ−2 (cid:88) E (cid:0) (c(cid:48) (V(cid:48)V)−1V ε )21{|c(cid:48) (V(cid:48)V)−1V ε |>(cid:15)σ } (cid:1) =O(1). γ n i i n i i γ i=1 By the Lindeberg-Feller Central Limit Theorem, c(cid:48) (γˆor−γ0)→N(0,σ2). n γ Proof of Corollary 1. In the following proof, let m and q be fixed for simplification. It further √ indicates that p is fixed. Let C = 2c˜m1/2p3/4B1/2, (B.4) can be simplified as q,m c q,m φ =C g m 1/ a 2 x G3/4 (Gp+2 (cid:112) Gp (cid:112) ζ+2ζ)1/2. n,T,G q,m g T3/4 min The rest of the proof suggests a large enough ζ for each situation that allows φ and ι to n,T,G,ζ approach infinity. We often use these somewhat trivial inequalities g ≤ n and G ≤ n/g in max min the following proofs, particularly when n→∞. 1. Consider T → ∞ with n fixed. Let ζ → ∞ and ζ = o(T3/2). Since G ≤ n (cid:28) ζ, then √ √ (Gp+2 Gp ζ+2ζ)1/2 =O(2ζ1/2). Therefore, φ =C T−3/4O(ζ1/2) T − → → ∞ 0, n,T,G 1 where C 1 =2C q,m g m 1/ g a 2 x m G in 3/4 , which is free of T. 18

2. Consider n→∞ with T fixed. (a) Consider G(cid:28)ζ →∞. √ √ i. When G is fixed, then (Gp+2 Gp ζ +2ζ)1/2 = O(2ζ1/2). For some constant α˜ 0 <1/2, let g min =O(n1/2+α˜0), ζ =o(n2α˜0) and ζ →∞, then n1/2 φ ≤C O(ζ1/2) n − → → ∞ 0, n,T,G 3g min where C =2C G3/4, which is free of n. 3 q,mT3/4 ii. When G → ∞, for some constant α˜ 2 < 2/7, let g min = O(n5/7+α˜2), ζ = o(n7α˜2/2) √ √ andζ →∞,then(Gp+2 Gpζ+2ζ)1/2 =O((p+2 p+2)1/2ζ1/2). SinceG≤n/g , min then n1/2G3/4 n5/4 φ ≤C O(ζ1/2)≤C O(ζ1/2) n, − G → →∞ 0, n,T,G 4 g min 4 g7/4 min √ where C =C 1 (p+2 p+2)1/2, which is free of n and G. 4 q,mT3/4 (b) Consider G → ∞. Let g min = O(n7/9+α˜1) for some α˜ 1 < 2/9, ζ = O(G) and ζ → ∞, √ √ √ then Gp+2 Gp ζ+2ζ =O((p+2 p+2)G)=O(G). Therefore, n1/2G3/4 φ ≤C O(G1/2) n − → → ∞ 0, n,T,G 2 g min √ where C =C 1 (p+2 p+2)1/2, which is free of n. 2 q,mT3/4 3. Consider T,n→∞. (a) Consider G(cid:28)ζ →∞, √ i. When G is fixed, then (Gp+2 Gpζ +2ζ)1/2 = O(2ζ1/2). Let g min = O(n1/2+α˜0) for some positive constant α˜ 0 <1/2 and ζ =o(n2α˜0T3/2), ζ →∞, then n1/2 φ ≤C O(ζ1/2) n, − T → →∞ 0, n,T,G 6 g T3/4 min where C =2C G3/4. 6 q,m 19

ii. When G → ∞, for some positive constant α˜ 2 < 2/7, let g min = O(n5/7+α˜2) and √ G≤n/g min , ζ =o(n7α˜2/2T3/2) and ζ →∞, then (Gp+2 Gpζ +2ζ)1/2 =O((p+ √ 2 p+2)1/2ζ1/2). Since G≤n/g , then min n1/2G3/4 n5/4 φ ≤C O(ζ1/2)≤C O(ζ1/2) n,T − ,G → →∞ 0, n,T,G 7 g min T3/4 7 g7/4T3/4 min √ where C =C (p+2 p+2)1/2, which is freen of n,T and G. 7 q,m (b) Consider G → ∞. Let g min = O(n7/9+α˜1) for some constant α˜ 1 < 2/9, ζ = O(G) and √ √ ζ →∞, then Gp+2 Gpζ+2ζ =O((p+2 p+2)G)=O(G). Since G≤n/g , min n1/2G3/4 n7/4 φ ≤C O(G1/2)≤C O(1) n,T − ,G → →∞ 0, n,T,G 5 g min T3/4 5 g9/4T3/4 min √ where C =C (p+2 p+2p)1/2, which is free from n,T and G. 5 q,m Combining items 2 and 3 above, we can summarize the choice of ζ as follows: Case 1. The number n is fixed. Let ζ =o(T3/2) as T →∞; Case 2. The number n→∞. Whether T is fixed or T →∞, (a) when G is fixed, and g min = O(n1/2+α˜4) for some constant α˜ 4 < 1/2. Let ζ = o(n2α˜4T3/2) approaching infinity; (b) when G→∞, i. supposeg min =O(n7/9+α˜3)forsomeconstantα˜ 3 <2/9. Letζ =O(G)approaching infinity; ii. supposeg min =O(n5/7+α˜5)forsomeconstantα˜ 5 <2/7. Letζ =o(n7α˜5/2T3/2)(cid:29)G approaching infinity. B.3 Convergence of the Calculated Estimator (G ≥ 2) Proof of Theorem 2. This can be done similarly to the proof of Theorem 4.2 in Ma and Huang [2016]. 20

Define M := {γ ∈ Rnp : γ = γ ,∀i,j ∈ G ,g = 1,···,G} and the scaled penalty function G i j g as ρ˜ ((cid:107)γ −γ (cid:107))=λ−1ρ ((cid:107)γ −γ (cid:107),λ ). Let the least-squares objective function and the penalty θ i j 1 θ i j 1 function be 1 (cid:88) L(γ)= (cid:107)y−Wγ(cid:107)2, P(γ)=λ ρ˜ ((cid:107)γ −γ (cid:107) ) 2 2 1 θ i j 2 i<j (B.8) 1 (cid:88) LG(ϕ)= (cid:107)y−WΓϕ(cid:107)2, PG(ϕ)=λ |G (cid:107)G |ρ˜ ((cid:107)ϕ −ϕ (cid:107) ). 2 2 1 g g(cid:48) θ g g(cid:48) 2 g<g(cid:48) Let Q γ)=L(γ)+P(γ), Q γ)G(ϕ)=LG(ϕ)+PG(ϕ) and define ( ( (cid:5) F : M → RGp. The g-th vector component of F(γ) equals to the common value of γ for G i i∈G . g (cid:5) F∗ : Rnp → RGp. F∗(γ) = {|G |−1(cid:80) γ(cid:48),g = 1,···,G}(cid:48), which implies the average of g i∈Gg i each cluster vectors. It results in that F(γ)=F∗(γ) if γ ∈M . Hence, for every γ ∈M , P(γ)=PG(F(γ)), and for G G every ϕ∈RGp, P(F−1(ϕ))=PG(ϕ). Hence, Q(γ)=QG(F(γ)), QG(ϕ)=Q(F−1(ϕ)). (B.9) Theorem 1 results in that for some ι>0, P(sup(cid:107)γor−γ0(cid:107) ≤psup(cid:107)γor−γ0(cid:107) =p(cid:107)γor−γ0(cid:107) ≤pφ )≥1−eι, (cid:98)i i 2 (cid:98)i i ∞ (cid:98) ∞ n,T,G,ζ i i there exists an event E in which sup (cid:107)γor −γ0(cid:107) ≤ pφ = φ˜ , such that P(EC) ≤ e−ι. 1 i (cid:98)i i 2 n,T,G n,T,G 1 Consider the neighborhood of the true parameter γ0, Θ:={γ ∈Rnp :sup(cid:107)γ −γ0(cid:107) ≤φ˜ }. i i 2 n,T,G i It implies that γor ∈ Θ on the event E . For any γ ∈ Rnp, let γ∗ = F−1(F∗(γ)), then γ∗ = (cid:98) 1 i 1 (cid:80) γ which implies that γ∗ is a vector with duplicated group average of γ . Through two |Gg| i∈Gg i i steps as the following, the statement can be proved that with probability approximating to 1, γor (cid:98) 21

is a strictly local minimizer of Q(γ). i. In E , Q(γ∗)>Q(γor) for any γ ∈Θ and γ∗ (cid:54)=γor. This indicates that the oracle estimator 1 (cid:98) (cid:98) γor is the minimizer over all duplicated group average γ∗. (cid:98) ii. There exists an event E such that for large enough ι∗, P(EC) ≤ e−ι∗. In E ∩E , there 2 2 1 2 exists a neighborhood Θ of γor such that Q(γ)≥Q(γ∗) for all γ∗ ∈Θ ∩Θ for sufficiently n (cid:98) n large n. It means that for all γ, the duplicated group average γ∗ is the minimizer. Then, it results in Q(γ) > Q(γor) for any γ ∈ Θ ∩Θ and γ (cid:54)= γor in E ∩E . Hence, over (cid:98) n (cid:98) 1 2 E ∩E , for large enough ι and ι∗, γor is a strictly local minimizer of Q(γ) with the probability 1 2 (cid:98) P(E ∩E )≥1−e−ι−e−ι∗. 1 2 First, show PG(F∗(γ))=C for any γ ∈Θ, where C is a constant which does not depend on γ. It implies that when γ is close enough to the true parameter γ0, the penalty term would not affect the objective function with respect to different values of γ. Let F∗(γ)=ϕ. Consider the triangle inequality (cid:107)ϕ −ϕ (cid:107) ≥(cid:107)ϕ0−ϕ0 (cid:107) −2sup (cid:107)ϕ −ϕ0(cid:107) . Since γ ∈Θ, then g g(cid:48) 2 g g(cid:48) 2 g g g 2 (cid:13) (cid:13)2 (cid:13) (cid:13) su g p(cid:107)ϕ g −ϕ0 g (cid:107)2 2 =su g p (cid:13) (cid:13) (cid:13) |G g |−1 (cid:88) γ i −ϕ0 g (cid:13) (cid:13) (cid:13) (cid:13) i∈Gg (cid:13) 2 (cid:13) (cid:13)2 (cid:13) (cid:13)2 (cid:13) (cid:13) (cid:13) (cid:13) =su g p (cid:13) (cid:13) (cid:13) |G g |−1 (cid:88) (γ i −γ0 i ) (cid:13) (cid:13) (cid:13) =su g p|G g |−2(cid:13) (cid:13) (cid:13) (cid:88) (γ i −γ0 i ) (cid:13) (cid:13) (cid:13) (B.10) (cid:13) i∈Gg (cid:13) 2 (cid:13)i∈Gg (cid:13) 2 ≤|G g |−1sup (cid:88) (cid:13) (cid:13)(γ i −γ0 i ) (cid:13) (cid:13) 2 2 ≤sup (cid:13) (cid:13)(γ i −γ0 i ) (cid:13) (cid:13) 2 2 ≤φ˜2 n,T,G , g i i∈Gg Since b :=min (cid:107)ϕ0−ϕ0 (cid:107), then for all g (cid:54)=g(cid:48) and b >aλ+2φ˜ , n,T,G g(cid:54)=g(cid:48) g g(cid:48) n,T,G n,T,G (cid:107)ϕ0−ϕ0 (cid:107) ≥(cid:107)ϕ0−ϕ0 (cid:107) −2sup(cid:107)ϕ −ϕ0(cid:107) ≥b −2φ˜ >aλ , g g(cid:48) 2 g g(cid:48) 2 g g 2 n,T,G n,T,G 1 g forsomea>0. ThenbyAssumption6, ρ((cid:107)ϕ −ϕ (cid:107) )isaconstant, andfurthermore, PG(F∗(ϕ)) g g(cid:48) 2 is a constant. Therefore, PG(F∗(γ)) = C, and QG(F∗(γ)) = LG(T∗(γ))+C for all γ ∈ Θ. Since ϕor is the unique global minimizer of LG(ϕ), then LG(T∗(γ)) > LG(ϕor) for all T∗(γ) (cid:54)= ϕor and (cid:98) n (cid:98) (cid:98) hence QG(T∗(γ))>QG(ϕor) for all T∗(γ)(cid:54)=ϕor. By the property of the clustering algorithm, for (cid:98) (cid:98) 22

the g-th group, ϕor =|G |−1(cid:80) γor, which implies that, along with the definition of operation (cid:98)g g i∈Gg (cid:98)i F, ϕor equals to the g-th component of F(γor) for all i≤g ≤G. Then, by (B.9), (cid:98)g (cid:98) QG(ϕor)=QG(T(γor))=Q(γor). (cid:98) (cid:98) (cid:98) Furthermore, QG(T∗(γ))=Q(T−1(T∗(γ)))=Q(γ∗). Therefore, Q(γ∗)>Q(γor) for all γ∗ (cid:54)=γor. n (cid:98) (cid:98) Second, for a positive sequence r , let Θ :={γ :sup (cid:107)γ −γor(cid:107) ≤r }. For any γ ∈Θ ∩Θ, n n i i i (cid:98)i 2 n n by the first order Taylor’s expansion, dQ(γm) dL(γm) (cid:88) n ∂P(γm) Q(γ)−Q(γ∗)= (γ−γ∗)= (γ−γ∗)+ (γ−γ∗), dγ(cid:48) dγ(cid:48) ∂γ(cid:48) i=1 i dL(γm) ∂P(γm) and let S = (γ−γ∗) and S = (cid:80)n (γ −γ∗). Since 1 dγ(cid:48) i 2 i=1 ∂γ(cid:48) i i i i dL(γ) 1 = (−2y(cid:48)W +2γ(cid:48)W(cid:48)W)=−(y(cid:48)−γ(cid:48)W)W and γ 2 i n ∂P(γ) (cid:88) 1 =λ ρ˜(cid:48)((cid:107)γ −γ (cid:107) ) 2(γ −γ ) ∂γ 1 θ i j 2 2(cid:107)γ −γ (cid:107) i j i i=1 i j 2 =λ (cid:88) n ρ˜(cid:48)((cid:107)γ −γ (cid:107) ) γ i −γ j , 1 θ i j 2 (cid:107)γ −γ (cid:107) i=1 i j 2 we have (cid:88) n ∂P(γm) S =−(y(cid:48)−γm(cid:48)W)W(γ −γ∗) and S = (γ −γ∗). 1 2 ∂γ(cid:48) i i i=1 i 23

Let γm =ϑγ+(1−ϑ)γ∗ for some constant ϑ∈(0,1). Then, (cid:88) S =λ ρ˜(cid:48)((cid:107)γm−γm(cid:107) )(cid:107)γm−γm(cid:107)−1(γm−γm)(cid:48)(γ −γ∗) 2 1 θ i j 2 i j 2 i j i i i<j (cid:88) +λ ρ˜(cid:48)((cid:107)γm−γm(cid:107) )(cid:107)γm−γm(cid:107)−1(γm−γm)(cid:48)(γ −γ∗) 1 θ i j 2 i j 2 i j i i i>j (cid:88) =λ 1 ρ˜(cid:48) θ ((cid:107)γm i −γm j (cid:107) 2 )(cid:107)γm i −γm j (cid:107)− 2 1(γm i −γm j )(cid:48)(γ i −γ∗ i ) (B.11) i<j (cid:88) +λ ρ˜(cid:48)((cid:107)γm−γm(cid:107) )(cid:107)γm−γm(cid:107)−1(γm−γm)(cid:48)(γ −γ∗) 1 θ j i 2 j i 2 j i j j i<j (cid:88) =λ ρ˜(cid:48)((cid:107)γm−γm(cid:107) )(cid:107)γm−γm(cid:107)−1(γm−γm)(cid:48)[(γ −γ∗)−(γ −γ∗)]. 1 θ i j 2 i j 2 i j i i j j i<j Consider separating S into two parts, i,j ∈ G , and i ∈ G , j ∈ G for g (cid:54)= g(cid:48). When i,j ∈ G , 2 g g g(cid:48) g since γ∗ =T−1(T∗(γ))∈M , then γ∗ =γ∗. Thus, the RHS of (B.11) becomes G i j G (cid:88) (cid:88) S =λ ρ˜(cid:48)((cid:107)γm−γm(cid:107) )(cid:107)γm−γm(cid:107)−1(γm−γm)(cid:48)(γ −γ ) 2 1 θ i j 2 i j 2 i j i j g=1i,j∈G,i<j (cid:88) (cid:88) +λ ρ˜(cid:48)((cid:107)γm−γm(cid:107) )(cid:107)γm−γm(cid:107)−1(γm−γm)(cid:48)[(γ −γ∗)−(γ −γ∗)]. 1 θ i j 2 i j 2 i j i i j j g<g(cid:48)i∈Gg,j∈G g(cid:48) (B.12) Furthermore, by (B.10), for any γ ∈ Θ ∩Θ, F∗(γ) = ϕ, and therefore, for all i ∈ G , γ∗ = ϕ . n g i g This lead to sup(cid:107)γ∗−γ0(cid:107)2 =sup(cid:107)ϕ −ϕ0(cid:107)2 ≤φ˜2 , (B.13) i i 2 g g 2 n,T,G i g where the inequality in (B.13) is obtained by (B.10). Since γm =ϑγ +(1−ϑ)γ∗, by the triangle i i i inequality, sup(cid:107)γm−γ0(cid:107) =sup(cid:107)ϑγ +(1−ϑ)γ∗−γ0(cid:107) i i 2 i i i 2 i i =sup(cid:107)ϑγ +(1−ϑ)γ∗−(ϑ+1−ϑ)γ0(cid:107) i i i 2 i ≤ϑsup(cid:107)γ −γ0(cid:107) +(1−ϑ)sup(cid:107)γ∗−γ0(cid:107) i i 2 i i 2 i i ≤ϑφ˜ +(1−ϑ)φ˜ =φ˜ . n,T,G n,T,G n,T,G 24

Hence, for g (cid:54)=g(cid:48), i∈G , j ∈G , g g(cid:48) (cid:107)γm−γm(cid:107) =(cid:107)γm−γ0−γm+γ0(cid:107) ≥(cid:107)γ0−γ0(cid:107) −2 max (cid:107)γm−γ0(cid:107) i j 2 i i j j 2 i j 2 k k 2 1≤k≤n ≥ min (cid:107)γ0−γ0(cid:107) −2 max (cid:107)γm−γ0(cid:107) ≥b −2φ˜ >aλ . i j 2 k k 2 n,T,G n,T,G 1 i∈Gg,j(cid:48)∈G g(cid:48) 1≤k≤n Since ρ˜ (x) is constant for all x≥aλ , then ρ˜(cid:48)((cid:107)γm−γm(cid:107) )=0. Therefore, following γm−γm = θ 1 θ i j 2 i j ϑ(γ −γ ) for i,j ∈G , (B.12) becomes i j g S =λ (cid:88) G (cid:88) ρ˜(cid:48) θ ((cid:107)γm i −γm j (cid:107) 2 ) (γm−γm)(cid:48)(γ −γ ) 2 1 (cid:107)γm−γm(cid:107) i j i j g=1i,j∈G,i<j i j 2 +λ (cid:88) (cid:88) ρ˜(cid:48) θ ((cid:107)γm i −γm j (cid:107) 2 ) (γm−γm)(cid:48)[(γ −γ∗)−(γ −γ∗)] 1 (cid:107)γm−γm(cid:107) i j i i j j g<g(cid:48)i∈Gg,j∈G g(cid:48) i j 2 =λ (cid:88) G (cid:88) ρ˜(cid:48) θ ((cid:107)γm i −γm j (cid:107) 2 ) (γm−γm)(cid:48)(γ −γ ) 1 (cid:107)γm−γm(cid:107) i j i j g=1i,j∈Gg,i<j i j 2 =λ (cid:88) G (cid:88) ρ˜(cid:48) θ ((cid:107)γm i −γm j (cid:107) 2 ) ϑ(γ −γ )(cid:48)(γ −γ ) 1 (cid:107)ϑ(γ −γ )(cid:107) i j i j g=1i,j∈Gg,i<j i j 2 G (cid:88) (cid:88) =λ ρ˜(cid:48)((cid:107)γm−γm(cid:107) )(cid:107)γ −γ (cid:107) . 1 θ i j 2 i j 2 g=1i,j∈Gg,i<j Furthermore, similarly to (B.10), for all i ∈ G , γ∗ = ϕ , sup (cid:107)γ∗−γor(cid:107)2 = sup (cid:107)ϕ −ϕor(cid:107)2 ≤ g i g i i (cid:98)i 2 g g (cid:98)g 2 sup (cid:107)γ −γor(cid:107)2. Then, since γ∗ =γ∗, i i (cid:98)i 2 i j sup(cid:107)γm−γm(cid:107) =sup(cid:107)γm−γ∗−γm+γ∗(cid:107) i j 2 i i j j 2 i i ≤(cid:107)γ∗−γ∗(cid:107) +2sup(cid:107)γm−γ∗(cid:107) ≤2sup(cid:107)γm−γ∗(cid:107) i j 2 i i 2 i i 2 i i =2sup(cid:107)ϑγ +(1−ϑ)γ∗−γ∗(cid:107) i i i 2 i =2ϑsup(cid:107)γ −γ∗(cid:107) ≤2sup(cid:107)γ −γ∗(cid:107) i i 2 i i 2 i i ≤2(sup(cid:107)γ −γor(cid:107) +sup(cid:107)γ∗−γor(cid:107) ) i (cid:98)i 2 i (cid:98)i 2 i i ≤4sup(cid:107)γ −γor(cid:107) ≤4r . i (cid:98)i 2 n i 25

Hence, ρ˜(cid:48)((cid:107)γm − γm(cid:107) ) ≥ ρ˜(cid:48)(4r ), because ρ(x) is nondecreasing and concave as assumed in θ i j 2 θ n Assumption 3. Then, G (cid:88) (cid:88) S ≥λ ρ˜(cid:48)(4r )(cid:107)γ −γ (cid:107) . (B.14) 2 1 θ n i j 2 g=1i,j∈Gk,i<j Let U =(U(cid:48),···,U(cid:48))(cid:48) =[(y−Wγm)(cid:48)W](cid:48), then 1 n   γ −γ∗ 1 1     S =−U(cid:48)(γ−γ∗)=−(U(cid:48),···,U(cid:48))(cid:48)   γ 2 −γ∗ 2  =− (cid:88) n U(cid:48)(γ −γ∗) 1 1 n  .  i i i   . .   i=1   γ −γ∗ n n   G (cid:88) (cid:88) 1 (cid:88) =− |G | U i (cid:48) |G g |γ i − γ j g g=1i∈Gg j∈Gg =− (cid:88) G (cid:88) 1 U(cid:48) (cid:88) (cid:0) γ −γ (cid:1) =− (cid:88) G (cid:88) U i (cid:48)(γ i −γ j ) |G | i i j |G | g g g=1i∈Gg j∈Gg g=1i,j∈Gg =− (cid:88) G (cid:88) U i (cid:48)(γ i −γ j ) + (cid:88) G (cid:88) U j (cid:48)(γ i −γ j ) 2|G | 2|G | g g g=1i,j∈Gg g=1i,j∈Gg =− (cid:88) G (cid:88) (U j −U i )(cid:48)(γ j −γ i ) 2|G | g g=1i,j∈Gg =− (cid:88) G (cid:88) (U j −U i )(cid:48)(γ j −γ i ) . (B.15) |G | g g=1i,j∈Gg,i<j Inaddition, U =W(cid:48)(y −W γm)=W(cid:48)(W γ0+ε −W γm)=W(cid:48)(ε +W (γ0−γm)),andthen, i i i i i i i i i i i i i i i i sup(cid:107)U (cid:107) ≤sup{(cid:107)W(cid:48)ε (cid:107) +(cid:107)W(cid:48)W (γ0−γm)(cid:107) } i 2 i i 2 i i i i 2 i i √ ≤sup(cid:107)W(cid:48)ε (cid:107) +sup p(cid:107)W(cid:48)W (cid:107) φ˜ i i 2 i i ∞ n,T,G i i ≤sup(cid:107)W(cid:48)ε (cid:107) +m (cid:112) pT(q1/2+m1/2(L+1+2K))φ˜ i i 2 n,T,G i ≤sup √ p(cid:107)W(cid:48)ε (cid:107) +m (cid:112) pT(q1/2+m1/2(L+1+2K))φ˜ i i ∞ n,T,G i ≤ √ p(cid:107)W(cid:48)ε(cid:107) +m (cid:112) pT(q1/2+m1/2(L+1+2K))φ˜ 2 n,T,G = √ p(cid:107)W(cid:48)ε(cid:107) +m (cid:112) pTB φ˜ , 2 q,m n,T,G 26

(cid:104) √ √ √ (cid:105) whereB =q1/2+m1/2(L+1+2K). ByLemma3,P (cid:107)W(cid:48)ε(cid:107)2 >2c˜(np+2 npζ∗+2ζ∗)mM˜ TB p ≤ q,m 2 q,m e−ι∗, where B = (q1/2+m1/2(L+1+2K)), p = q+L+1+2K, M˜ = max(M ,M ,M ,M ) q,m 1 2 3 4 and c˜given in Assumption 2 and 4. ι∗ is defined in Lemma 3. Then, over the event E , 2 (cid:12)(U −U )(cid:48)(γ −γ )(cid:12) (cid:12) (cid:12) (cid:12) j i |G | j i (cid:12) (cid:12) (cid:12) ≤g m −1 in (cid:107)U j −U i (cid:107) 2 (cid:107)γ j −γ i (cid:107) 2 ≤g m −1 in 2sup(cid:107)U i (cid:107) 2 (cid:107)γ i −γ j (cid:107) 2 g i ≤2g−1 T1/4(mp)1/2(cid:107)γ −γ (cid:107) min i j 2 (cid:16) p1/4B˜1/2(np+2 (cid:112) npζ∗+2ζ∗)1/2+T1/4m1/2B φ˜ (cid:17) . (B.16) q,m q,m n,T,G Therefore, by (B.14), (B.15) and (B.16), Q(γ)−Q(γ∗) G (cid:88) (cid:88) ≥ (cid:107)γ −γ (cid:107) i j 2 g=1i,j∈Gg,i<j (cid:110) λ ρ˜(cid:48)(4r )−2g−1 T1/4(mp)1/2(p1/4B˜1/2(np+2 (cid:112) npζ∗+2ζ∗)1/2 1 θ n min q,m (cid:111) +T1/4m1/2B φ˜ ) q,m n,T,G G (cid:88) (cid:88) ≥ (cid:107)γ −γ (cid:107) i j 2 g=1i,j∈Gg,i<j (cid:110) λ ρ˜(cid:48)(4r )−B g−1 T1/4(np+2 (cid:112) npζ∗+2ζ∗)1/2−B g−1 T1/2φ˜ (cid:111) , 1 θ n 1 min 2 min n,T,G where B =2(mpB˜ )1/2p1/4 and B =2mp1/2B . 1 q,m 2 q,m Let r = o(1), then ρ˜(cid:48)(4r ) → 1. Suppose that the following condition is true over the event n θ n E ∩E , 1 2 (cid:112) B g−1 (np+2 npζ∗+2ζ∗)1/2T1/4 →0, B pg−1 T1/2φ →0, (B.17) 1 min 2 min n,T,G then P (Q(γ)−Q(γ∗)≥0)≥1−eι−eι∗. Once (B.17) holds, Q(γ)−Q(γ∗)≥0 with probability approaching to 1 as ι,ι∗ →∞. Note that ζ∗ =ζ∗ can be chosen as any sequence of numbers, as long as ζ∗ →∞ to ensure n,T,G ι∗ →∞. In the following argument, conditions on n, T, G, and other numbers that satisfies (B.17) are spelled out: 27

1. As T →∞ with n fixed, the proposed estimator does not converge to the oracle estimator. 2. As n→∞ with T fixed, if conditions in Corollary 1 are satisfied, the second part of (B.17) is true. It is enough discuss the conditions for first part of (B.17). Choose ζ∗ such that ζ∗ ≤n √ √ and ζ∗ →∞ as n→∞. Let g (cid:29)(p+2 p+2)1/2n1/2. Since (np+2 npζ∗+2ζ∗)1/2 = min √ (p+2 p+2)1/2O(n1/2), (cid:112) √ B g−1 (np+2 npζ∗+2ζ∗)1/2T1/4 ≤B T1/4g−1 (p+2 p+2)1/2O(n1/2)→0. 1 min 1 min 3-1. Let T,n→∞. Consider the first part of (B.17). Choose ζ∗ such that ζ∗ ≤n and ζ∗ →∞ as √ n→∞. Let g (cid:29)(p+2 p+2)1/2n1/2T1/4. Then min (cid:112) √ B g−1 (np+2 npζ∗+2ζ∗)1/2T1/4 ≤B g−1 (p+2 p+2)1/2n1/2T1/4 →0. 1 min 1 min 3-2. Let T,n→∞. Consider the second part of (B.17). (a) Suppose G is fixed. Choose ζ such that ζ =o(n4α˜1T1/2) and ζ →∞ as n,T →∞. Let √ g min =O(n1/4+α˜1)forsomepositiveconstantα˜ 1 <3/4. Then,(Gp+2 Gpζ+2ζ)1/2 = O(2ζ1/2), and n1/2 B pg−1 T1/2φ ≤B pC O(ζ1/2) n, − T → →∞ 0, 2 min n,T,G 2 6 g2 T1/4 min where C =2C G3/4. 6 q,m (b) Suppose G → ∞. Choose ζ such that ζ ≤ G and ζ → ∞ as n,T,G → ∞. Let √ √ n7/13 (cid:28)g <n/G. Then,G(cid:28) T1/13 andGp+2 Gpζ+2ζ ≤(p+2 p+2)G=O(G). T1/13 min n6/13 Further, since G≤n/g , min n1/2G3/4T1/2 B pg−1 T1/2φ ≤B pC O(G1/2) 2 min n,T,G 2 5 g2 T3/4 min n7/4 n,T,G→∞ ≤B pC O(1) −→ 0, 2 5 g13/4T1/4 min √ where C =C (p+2 p+2p)1/2, which is free from n,T and G. 5 q,m 28

(c) Suppose G→∞. Let g min =O(n5/11+α˜7) for a positive constant α˜ 7 <6/11. Choose ζ √ √ such that G(cid:28)ζ and ζ =o(n11α˜7/2T1/2). Then, (Gp+2 Gpζ+2ζ)1/2 =o((p+2 p+ 2)1/2ζ1/2). Since G≤n/g , min n5/4 B pg−1 T1/2φ ≤B pC O(ζ1/2) n,T − ,G → →∞ 0, 2 min n,T,G 2 7 g11/4T1/4 min √ where C =C (p+2 p+2)1/2, which is free of n,T and G. 7 q,m CombiningtheabovecalculationsandtheproofofCorollary1, theconditionsfor(B.17)canbe summarized as follows: √ 1. Suppose n → ∞ with T fixed. Let (p+2 p+2)1/2n1/2 (cid:28) g min = O(n7/9+α˜0) ≤ n/2, then (B.17) holds; 2. Supposen,T →∞andGisfixed. Letg min =O(n1/2+α˜4)forsomeconstantα˜ 4 <1/2. Then, (B.17) holds by choosing ζ and ζ∗ such that ζ = o(min(n1+4α˜4T1/2,n2α˜4T3/2)) approaching infinity and ζ∗ ≤n approaching infinity; 3. Suppose n,T,G→∞. (cid:110) √ (cid:111) (a) Let max T n7 1 / / 1 1 3 3 ,(p+2 p+2)1/2n1/2 (cid:28) g min = O(n7/9+α˜3) for some constant α˜ 3 < 2/9. Then, (B.17) holds by choosing ζ =O(G) and ζ∗ ≤n approaching infinity; (b) Let g min = O(n5/7+α˜5) for some constant α˜ 5 < 2/7. Then, (B.17) holds by choosing ζ =o(min{n10/7+11/2α˜5T1/2,n7α˜5/2T3/2}). References S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends(cid:13)R in Machine learning, 3(1):1–122, 2011. 29

J. Breitung and C. Roling. Forecasting inflation rates using daily data: A nonparametric midas approach. Journal of Forecasting, 34(7):588–603, 2015. D. Hsu, S. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab., 17:1–6, 2012. doi: 10.1214/ECP.v17-2079. A. N. Langville and W. J. Stewart. The kronecker product and stochastic automata networks. Journal of Computational and Applied Mathematics, 167(2):429 – 447, 2004. ISSN 0377-0427. S. Ma and J. Huang. Estimating subgroup-specific treatment effects via concave fusion. arXiv preprint arXiv:1607.03717, 2016. S. Ma and J. Huang. A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517):410–423, 2017. L. Su, Z. Shi, and P. C. B. Phillips. Identifying latent structures in panel data. Econometrica, 84 (6):2215–2264, 2016. H. Wang, Z. Shi, and C.-S. Leung. Admm-mcp framework for sparse recovery with global convergence. IEEE Transactions on Signal Processing, Sep 2018, Forthcoming. X. Zhu and A. Qu. Cluster analysis of longitudinal profiles with subgroups. Electronic Journal of Statistics, 12:171–193, 01 2018. 30

Cite this document
APA
Yeonwoo Rho, Yun Liu, & and Hie Joo Ahn (2020). Revealing Cluster Structures Based on Mixed Sampling Frequencies (FEDS 2020-082). Board of Governors of the Federal Reserve System, Finance and Economics Discussion Series. https://whenthefedspeaks.com/doc/feds_2020-082
BibTeX
@techreport{wtfs_feds_2020_082,
  author = {Yeonwoo Rho and Yun Liu and and Hie Joo Ahn},
  title = {Revealing Cluster Structures Based on Mixed Sampling Frequencies},
  type = {Finance and Economics Discussion Series},
  number = {2020-082},
  institution = {Board of Governors of the Federal Reserve System},
  year = {2020},
  url = {https://whenthefedspeaks.com/doc/feds_2020-082},
  abstract = {This paper proposes a new nonparametric mixed data sampling (MIDAS) model and develops a framework to infer clusters in a panel regression with mixed frequency data. The nonparametric MIDAS estimation method is more flexible and substantially simpler to implement than competing approaches. We show that the proposed clustering algorithm successfully recovers true membership in the cross-section, both in theory and in simulations, without requiring prior knowledge of the number of clusters. This methodology is applied to a mixed-frequency Okun’s law model for state-level data in the U.S. and uncovers four meaningful clusters based on the dynamic features of state-level labor markets. Accessible materials (.zip)},
}