Empirical Study of Stock Selection Strategy Based on Machine Learning

This article is a quantitative study on stock selection based on machine learning. It uses the JoinQuant platform to obtain cross-sectional data of the CSI All Share, CSI 300, and CSI 500 index sample stocks for each trading day from January 1, 2014, to July 31, 2018, as the test sample. Additionally, stocks that are suspended or have limit up or down on the same day are excluded. The reason for choosing the CSI All Share, CSI 300, and CSI 500 index sample stocks as data samples is mainly because they can comprehensively reflect the stock price performance of large, medium, and small-cap companies in the Chinese A-share market.

1.1 Factor Selection and Model Construction#

1.1.1 Factor Selection#

The combination of different factors will have different impacts on the excess returns of stocks; therefore, the key to the success of quantitative investment strategies, especially multi-factor models, lies in the selection of factors. To effectively explain stock returns, the selection of candidate factors must consider sufficient dimensions while also possessing reasonable economic significance. Additionally, the acquisition cost of factors must also be taken into account. There are many studies on quantitative factors, among which the well-known "101 Formulaic Alphas" published by WorldQuant is notable. This article, after testing the correlation of factors and other aspects, considers factors from four aspects: valuation, capital structure, profitability, and growth, selecting the candidate factors as shown in Table 1.

Table 1 Factors and Their Descriptions#

No	Major Factor	Factor Name	Factor Description
1	Valuation	EP	Earnings yield, the reciprocal of the price-to-earnings ratio (TTM)
2	Valuation	BP	Book-to-market ratio, the reciprocal of the price-to-earnings ratio
3	Valuation	PS	Price-to-sales ratio
4	Valuation	DP	Dividend yield, dividends payable divided by total market value
5	Valuation	RD	R&D ratio, research and development expenses divided by total market value
6	Valuation	CFP	Cash yield, the reciprocal of the price-to-cash ratio
7	Capital Structure	log_NC	Logarithm of net assets
8	Capital Structure	LEV	Financial leverage, assets/liabilities
9	Capital Structure	CMV	Logarithm of circulating market value
10	Capital Structure	FACR	Fixed asset ratio, fixed assets/total assets
11	Profitability	NI_p	Net profit margin, the absolute value of (net profit/total operating income)
12	Profitability	NI_n	Net profit margin, the absolute value of ((negative) net profit/total operating income)
13	Profitability	GPM	Gross profit margin
14	Profitability	ROE	Return on equity
15	Profitability	ROA	Return on assets
16	Profitability	OPTP	Proportion of main business
17	Growth	PEG	Price-to-earnings growth ratio, calculated as price-to-earnings ratio/(net profit growth rate*100)
18	Growth	g	Year-on-year growth rate of operating income
19	Growth	G_p	Year-on-year growth rate of net profit

1.1.2 Model Construction#

At a certain point in time, the market value of a stock can be explained by multiple factors. According to the basic idea of multi-factor models, we take market value as the sample label in the cross-section and calculate its factor exposure based on the factors shown in Table 1, serving as the original features of the sample.

The smaller the residuals obtained from regressing multiple factors in Table 1, the more severely the stock market value deviates downward from its theoretical value, which means that the likelihood of the stock rising in the future is greater, i.e., the return is higher. The model is constructed as follows:

1.2 Data Preprocessing#

This article obtains the required market value and factor data through the JoinQuant platform.

Before model construction and analysis, to improve data quality, preprocessing is necessary. There are various methods for data preprocessing, including data cleaning, data integration, data transformation, and data reduction. This article mainly conducts the following preprocessing steps.

1.2.1 Handling Missing Values#

For the initial data with missing values, this article uniformly fills them with 0.

1.2.2 Handling Outliers#

Since excessively large or small factor data may affect the analysis results, especially during regression, we need to handle outliers in the factors. The method is to adjust the outlier values in the factor values to upper and lower limits (Winsorization), where the upper and lower limits are determined by the criteria for identifying outliers, thereby reducing the influence of outliers. There are three criteria for identifying outliers: MAD, 3σ, and percentile method. The specific method for the MAD criterion is as follows:

Find the median of all factors.
Obtain the absolute deviation of each factor from the median, i.e., |.
Get the median of the absolute deviations, MAD.
Determine parameter n and adjust the factor values according to the following formula.

This article uniformly adopts the MAD (n=5) standard for outlier processing. A comparison before and after outlier processing is illustrated using the factor g on the cross-sectional data of July 11, 2018, as shown in Figure 1.

Outlier Handling of Factor

Figure 1 Comparison of Outlier Handling of Factor g (Source: JoinQuant)

1.2.3 Standardization#

Since different factors describe objects with different units, this may lead to significant differences in the values of different factors. For example, if we use the unstandardized market value as factor exposure, if company A's market value is 100 times that of company B, can we say that the impact of the market value factor's return on A's return is 100 times that on B's return? Clearly, this is not the case. Therefore, before using factors to establish strategies, it is necessary to standardize the factors. There are many methods for standardization; this article adopts the z-score method, as shown in Figure 2, adjusting the mean of factor values to 0 and the standard deviation to 1. The processed data is transformed from dimensional to dimensionless, making the data more concentrated, or allowing different indicators to be compared and regressed.

Description of z-score Method

Figure 2 Description of z-score Method (Source: Quantitative Investment Training Camp WeChat Official Account)

Using the factor log_NC on the cross-sectional data of July 11, 2018, as an example, the comparison before and after standardization is shown in Figure 12.

1.2.4 Neutralization#

When using factors for stock selection, sometimes the influence of other factors can lead to selected stocks having some undesirable biases. For example, the price-to-book ratio may have a high correlation with market value; if we use the price-to-book ratio without market value neutralization, the stock selection results may be concentrated. On the other hand, emerging industries and declining industries generally have certain characteristics regarding price-to-earnings ratios, meaning that industry also affects valuation factors, leading to results with some excess preferences. To solve this problem caused by errors due to different industries and market capitalizations, we need to neutralize the factors to eliminate the influence of other factors when using a certain factor, making the selected stocks more diversified.

Standardization Processing

Figure 3 Comparison of Standardization Processing of Factor log_NC (Source: JoinQuant)

For factors, market risk (e.g., bull and bear markets) and industry risk (companies in the same industry are similarly affected) are the main considerations for neutralization. There are two ways to handle these:

Include both market factors and industry factors in the model.
Only include industry factors while incorporating market factors into the industry factors.

The difference between the first and second methods is that the first method calculates the industry factor's return as the excess return of the industry relative to the market, while the second method calculates the return as the absolute return of the industry. For validating the effectiveness of style factors, there is no difference between the two methods; for regression, the former is a regression with an intercept term, while the latter is a regression that passes through the origin. This article adopts the second method, and the model adjustment is as follows:

If the company belongs to the industry, the value of the industry dummy variable is set to 1; otherwise, it is set to 0. This article does not proportionally split the industry to which the company belongs, meaning that a company can only belong to one specific industry. The industry classification in this article adopts the first-level classification of JoinQuant, as shown in Table 2:

Table 2 Overview of Industry Classification#

Industry Code	Industry Name	Start Date
HY001	Energy	1999/12/30
HY002	Materials	1999/12/30
HY003	Industrials	1999/12/30
HY004	Consumer Discretionary	1999/12/30
HY005	Consumer Staples	1999/12/30
HY006	Healthcare	1999/12/30
HY007	Financials	1999/12/30
HY008	Information Technology	1999/12/30
HY009	Telecommunication Services	1999/12/30
HY010	Utilities	1999/12/30
HY011	Real Estate	1999/12/30

2.1 Backtesting Parameter Settings#

This article adopts historical backtesting for empirical analysis. The differences in backtesting parameter settings can lead to significant variations in test results, and objectively setting backtesting parameters relates to the authenticity of the trading strategy's effectiveness and the final choice of the strategy. The backtesting parameter settings and global settings in this article are as follows:

(1) Investment Amount

Assuming the investment amount is: 1 million.

(2) Backtesting Period

Unless specified otherwise, the default backtesting period is: January 1, 2014—July 31, 2018.

(3) Commission and Stamp Duty

To make the backtesting results more closely resemble real trading costs, commission and stamp duty ratios are set in the empirical analysis.

The two major changes in stamp duty over the past decade include:

From April 24, 2008, it was adjusted from 3‰ to 1‰.
From September 19, 2008, it changed from bilateral collection to unilateral collection, with the tax rate remaining at 1‰. The transferor pays the stock transaction stamp duty at a rate of 1‰, and the transferee is no longer charged.

Regarding commissions, due to differences among brokers and clients, commission rates vary. This article has set them separately.

(4) Slippage

Slippage refers to the difference between the order price and the actual transaction price. Since slippage has a minimal impact on the final result, this article sets slippage to 0.

(5) Position Management

Based on the remaining funds at each purchase, the average funds are allocated to the number of stocks purchased, buying all available funds.

(6) Usable Stock Pool

Unless otherwise specified, the usable stock pool in this article defaults to the CSI All Share.
In actual situations, stocks that are suspended on the same day cannot be traded, so stocks that are suspended on that day are excluded from the overall backtesting.
This article does not buy stocks under a limit down condition and does not sell stocks under a limit up condition.

(7) Benchmark

For the usable stock pool, the daily price of the CSI All Share is selected as the benchmark for judging the effectiveness of the strategy and for calculating a series of risk values.

(8) Test Set Extraction

The factor feature values on the trading day serve as the original features of the sample, and the logarithm of the market value on that day serves as the label of the sample.

(9) Training Set (Including Validation Set) Composition

Taking day T as an example, every 21 trading days serve as an interval. Unless otherwise specified, the default is to use the features and labels from T-63 to T as the training set (including the validation set), and 3-fold cross-validation is used.

(10) Machine Learning Algorithm Parameter Settings

In the absence of specific specifications, the parameter settings for various machine learning algorithms are as shown in Table 3, using fixed parameters.

Table 3 Parameter Settings for Various Machine Learning Algorithms#

Artificial Intelligence Algorithm	Parameter Settings
Linear Regression	No parameters
Ridge Regression	alpha:100
SVR	C:100,gamma:1
Random Forest	n_estimators:500

(11) Others

The backtesting uses market orders and assumes that there are no issues with buying or selling.
The financial report and market value data used are from the day before the backtesting date to avoid future functions.
Since this article focuses on stock selection strategies, stop-loss and take-profit strategies or timing strategies are not used as auxiliary measures to ensure that parameters outside the stock selection strategy remain consistent.

2.2 Empirical Validation of Model and Factor Effectiveness#

Before conducting an in-depth study of the strategy, the effectiveness of the strategy must first be studied and empirically validated to ensure that subsequent research is conducted under correct premises. The empirical validation method for the effectiveness of the strategy is as follows:

Using the linear regression algorithm of artificial intelligence, perform linear regression of the factor feature values in the model against the market value (taking the logarithm) label, taking the difference between the actual value and the predicted value as the new factor feature value.
Sort the stocks in ascending order.
Divide the stocks into 10 groups, rebalancing every 10 days and every 30 days.

Based on the above method, the backtesting conditions are shown in Table 4 for grouped backtesting. The results of the grouped backtesting are shown in Table 5, Figure 4, and Figure 5.

Table 4 Backtesting Conditions#

Stock Pool	Rebalancing Period	Algorithm	Number of Holdings	Factor Combination
CSI 300 CSI 500 CSI All Share	10 days, 30 days	Linear Regression	10% (grouped)	All Factors

Table 5 Grouped Backtesting Returns and Rankings#

Figure 4 Excess Returns of Different Groups (CSI All Share, Rebalancing Period: 10 Days)

Figure 5 Backtesting Results of Different Groups (CSI All Share, Rebalancing Period: 10 Days)

From Table 5, Figure 4, and Figure 5, we can see:

According to the assumed model, the CSI 300 does not yield a significant monotonic strategy return, while the CSI All Share can achieve a relatively clear monotonic strategy return. The CSI 500 is positioned between the CSI 300 and the CSI All Share. This indicates that the strategy is not suitable for investing in the CSI 300 but can be used for investing in the CSI All Share or the CSI 500.
The CSI All Share not only achieves a relatively clear monotonic strategy return but also obtains a relatively monotonic Sharpe ratio and information ratio.
The CSI All Share cannot achieve a monotonic maximum drawdown, indicating that the model does not have an effective mechanism for risk control.

Using the CSI All Share as the stock pool, rebalancing every 30 days, the grouped backtesting is conducted using ridge regression, SVR, and random forest algorithms. The backtesting conditions are shown in Table 6, and the backtesting results are shown in Table 7.

Table 6 Backtesting Conditions#

Stock Pool	Rebalancing Period	Algorithm	Number of Holdings	Factor Combination
CSI All Share	30 days	Ridge Regression, SVR, Random Forest	10% (grouped)	All Factors

Table 7 Grouped Backtesting Rankings#

Group	Ridge Regression	Ranking	SVR	Ranking	Random Forest	Ranking
1	111.91%	2	159.14%	1	96.34%	1
2	115.73%	1	96.40%	2	62.81%	3
3	84.34%	3	79.04%	3	57.74%	5
4	69.23%	4	59.84%	4	47.91%	8
5	49.12%	5	49.03%	5	54.28%	7
6	34.19%	8	22.80%	8	56.70%	6
7	31.04%	9	30.92%	6	69.18%	2
8	35.06%	7	17.04%	10	64.71%	4
9	15.25%	10	30.65%	7	24.80%	9
10	37.7%	6	17.44%	9	19.82%	10

This article selects a 30-day rebalancing period because: first, some of the trading data selected in this article is essentially monthly data; if the period is days or smaller, such adjustments will be meaningless; second, for the selection of financial data, we choose quarterly data, and according to previous theories, the period should be adjusted to quarterly; however, using a quarterly period would increase trading costs and could miss many investment opportunities, which may also affect the return on trading. Additionally, for financial data, we have applied lagging processing, so using a quarterly period is not advisable; finally, for some indicators, monthly data may yield better results, such as price-to-earnings ratios and price-to-book ratios, as monthly data makes the model testing more effective and eliminates the occurrence of some special values.

From Table 7, it can be seen that, from the perspective of returns, the ridge regression, SVR, and random forest algorithms, similar to linear regression, can also achieve relatively clear monotonic strategy returns. Among them, the trend of SVR is the most significant. The trend of random forest is not prominent; the return watershed at the beginning and end is relatively clear, while the watershed in the middle groups is more ambiguous.

2.3 Empirical Comparison of Different Factor Combinations#

All factors are divided into three different combinations, as shown in Table 8.

Table 8 Different Factor Combinations#

No	Factor Name	Combination 1	Combination 2	Combination 3 (All Factors)
1	EP	●	●	●
2	BP		●	●
3	PS			●
4	DP			●
5	RD	●	●	●
6	CFP			●
7	log_NC	●	●	●
8	LEV	●	●	●
9	CMV		●	●
10	FACR			●
11	NI_p	●	●	●
12	NI_n	●	●	●
13	GPM		●	●
14	ROE		●	●
15	ROA			●
16	OPTP			●
17	PEG	●	●	●
18	g			●
19	G_p			●
	Total Count	7	11	19

Using different algorithms, the top 50 stocks are selected after sorting. Rebalancing occurs every 30 days. Detailed backtesting conditions are shown in Table 9.

Table 9 Backtesting Conditions#

Stock Pool	Rebalancing Period	Algorithm	Number of Holdings	Factor Combination
CSI All Share	30 days	Linear Regression, Ridge Regression, SVR, Random Forest	50	Combination 1, Combination 2, Combination 3 (All Factors)

Based on the above method, backtesting is conducted using different algorithms, and the results are shown in Table 10.

Table 10 Backtesting Results of Different Algorithms#

Figure 6 Excess Return Chart of Different Algorithms (Factor Combination 3)

Figure 7 Backtesting Result Indicators of Different Algorithms (Factor Combination 3)

From Table 10, Figure 6, and Figure 7, we can see:

The strategy that balances return and risk best is the SVR algorithm with the all-factor combination, which achieves a return rate of 243%, a Sharpe ratio of 0.987, and an information ratio of 1.742. These three indicators perform the best among all investment combinations, while its maximum drawdown is on par with or even lower than other combinations.
In the long term, this investment strategy can outperform the benchmark across different algorithms. Overall, the SVR algorithm shows the best return performance. The lowest return of SVR is still slightly higher than the highest return of ridge regression and linear regression. The random forest follows, while ridge regression and linear regression are comparable.
As the number of factors increases, the returns of linear regression and ridge regression decrease, while the returns of SVR increase. The performance of random forest is unstable. The performance of the Sharpe ratio and information ratio is similar to that of the returns.
The maximum drawdown of different factor combinations under different algorithms does not differ significantly, indicating that no algorithm or factor combination has a particular advantage. This suggests that the investment strategy itself has shortcomings in risk control.
From the scoring of the test set and validation set, the random forest model has the highest fitting degree and prediction accuracy. The fitting degree of SVR is unstable. The fitting degrees of linear regression and ridge regression are basically consistent.
As the number of factors increases, the scores of linear regression, ridge regression, and random forest increase, indicating that the fitting degree improves with the increase in the number of factors. The scoring of SVR shows no obvious relationship with the number of factors. SVR also has the largest difference between validation set scores and test set scores.

2.4 Empirical Comparison of Different Number of Holdings#

In the current foreign quantitative investment community, investors generally choose a portfolio size of 50 to 60 stocks, with major foreign investment funds holding 60 stocks. This is because investors believe that as the portfolio size increases, non-systematic risk will correspondingly decrease until it approaches zero. However, considering that an excessively large scale will lead to increased costs and decreased returns, according to the law of diminishing marginal utility, increasing the number of holdings when the scale is small will reduce risk, but as the scale continues to grow and surpass a certain threshold, the rate of risk reduction will slow down. Therefore, the scale cannot be infinitely large. Moreover, as the scale increases, the efficiency of the portfolio will decline, and the large amount of original capital and high transaction costs will become prominent. This article analyzes and empirically compares the different performances of returns and risks under different numbers of holdings, setting the number of holdings to 5, 10, 30, and 50 based on factor combination 3 (all factors), with detailed backtesting conditions shown in Table 11 and backtesting results shown in Table 12.

Table 11 Backtesting Conditions#

Stock Pool	Rebalancing Period	Algorithm	Number of Holdings	Factor Combination	Training Set
CSI All Share	30 days	Linear Regression, Ridge Regression, SVR, Random Forest	5, 10, 30, 50	Combination 3 (All Factors)	[T-84, T]

Table 12 Comparison of Different Numbers of Holdings#

From Table 12, we can see:

The best performance in terms of returns is the SVR algorithm (with 5 holdings), achieving the highest return rate of 698%, with the Sharpe ratio and information ratio also reaching their peaks at 1.61 and 2.03, respectively. However, the maximum drawdown of this algorithm is also very high, exceeding 50%, and the strategy's volatility is much higher than the benchmark's volatility. If considering that the maximum drawdown should not exceed 50%, the SVR algorithm (with 30 holdings) is the most ideal, achieving a return rate of 263%, generally higher than other situations, with a Sharpe ratio of 1.07, also generally higher than other situations. Under this number of holdings, the maximum drawdown decreases to 44%, close to the average.
Under the same number of holdings, the ranking of return rates and Sharpe ratios generally maintains the order of SVR > Random Forest > Ridge Regression > Linear Regression.
As the number of holdings decreases, the return rates of each algorithm increase, with SVR showing the most significant increase. This indicates that the more diversified the holdings, the easier it is to spread the returns. Conversely, the volatility of the strategy behaves just the opposite; the more concentrated the holdings, the greater the volatility.
Linear regression and ridge regression show a gradual reduction in maximum drawdown as the number of holdings decreases. Among them, ridge regression achieves the minimum maximum drawdown when the number of holdings is 5, with the maximum profit-loss ratio. In contrast, SVR and random forest show a significant increase in maximum drawdown as the number of holdings decreases.
Combining the indicators of returns and risks, the optimal number of holdings is 5 when using linear regression and ridge regression algorithms, while the optimal number of holdings is 30 when using SVR and random forest algorithms.

2.5 Empirical Comparison of Different Market Styles#

Different investment strategies are constructed for different market styles. Previous studies have been conducted by many scholars, such as empirical analysis of the cyclical division of the Chinese stock market, momentum and reversal strategies in bull and bear markets, and research on the correlation of bull and bear market cycles. Recently, research applying artificial intelligence and machine learning to market selection has also emerged, such as comparative studies on the adaptability of machine learning to A-shares. Based on the above research, this article sets different backtesting periods according to the switching of different market styles, as shown in Table 13.

Table 13 Backtesting Periods for Different Market Styles#

Period Number	Backtesting Period	Period Length	Market Style
Period 1	April 1, 2014—September 30, 2014	6 months	Consolidation—Uptrend
Period 2	October 1, 2014—April 30, 2015	7 months	Uptrend—Uptrend
Period 3	March 1, 2015—September 30, 2015	7 months	Uptrend—Downtrend
Period 4	July 1, 2015—November 30, 2015	5 months	Downtrend—Uptrend
Period 5	August 1, 2017—March 31, 2018	8 months	Consolidation—Downtrend
Period 6	October 1, 2017—December 31, 2017	3 months	Consolidation—Consolidation
Period 7	September 1, 2015—January 31, 2016	5 months	Downtrend—Downtrend

The backtesting conditions are shown in Table 14, and the backtesting results are shown in Table 15.

Table 14 Backtesting Conditions#

Stock Pool	Rebalancing Period	Algorithm	Algorithm Parameters	Number of Holdings	Factor Combination	Backtesting Period
CSI All Share	30 days	Linear Regression, Ridge Regression, SVR, Random Forest	Fixed Parameters	5	All Factors	Period 1, Period 2, Period 3, Period 4, Period 5, Period 6, Period 7

Table 15 Backtesting Results#

To compare the returns and risks of different algorithms more clearly, Table 15 is simplified, as shown in Tables 16 and 17.

Table 16 Annualized Return Rankings#

Table 17 Maximum Drawdown Rankings#

From Tables 16 and 17, we can see:

In periods without significant market style switches, overall, linear models outperform SVR and random forest algorithms. During long-term consolidation phases, SVR and random forest algorithms are essentially ineffective. Caution should be exercised when using machine learning algorithms for quantitative investment during prolonged consolidation, and other technical analysis theories should be combined for investment decisions. In a continuously declining market environment, the SVR algorithm can also achieve excess returns.
In periods with significant market style switches, SVR and random forest algorithms generally outperform linear models, with SVR particularly outstanding. However, during the transition from consolidation to decline, SVR and random forest algorithms are essentially ineffective.
In terms of risk, during periods without significant market style switches, linear models generally outperform SVR and random forest algorithms. During sustained uptrends, SVR and random forest algorithms can essentially catch up with linear models; in other situations, their risks are much higher than those of linear models.
In periods with significant market style switches, no algorithm demonstrates particularly high risk prediction capability.
Overall, during periods without significant market style switches, linear models outperform SVR and random forest algorithms; conversely, SVR algorithms are superior (except during the transition from consolidation to decline).

2.6 Empirical Comparison of Different Parameters#

Using the CSI All Share as the usable stock pool and setting the number of holdings to 5, both fixed parameters and grid search (with standard 3-fold cross-validation) methods are used to verify the generalization ability of different parameter models and adjust supervised model parameters to achieve optimal generalization performance. The backtesting conditions are shown in Table 18, algorithm parameters are shown in Table 19, and backtesting results are shown in Table 20.

Table 18 Backtesting Conditions#

Stock Pool	Rebalancing Period	Algorithm	Algorithm Parameters	Number of Holdings	Factor Combination
CSI All Share	30 days	Ridge Regression, SVR, Random Forest	Fixed Parameters, Grid Search	5	Combination 1, Combination 2, All Factors

Table 19 Overview of Algorithm Parameters#

Algorithm	Fixed Parameters	Grid Search
Ridge Regression	(Refer to Table 6)	alpha:[1,10,100]
SVR	(Refer to Table 6)	C:[10,100],gamma:[0.1,1,10]
Random Forest	(Refer to Table 6)	n_estimators:[100,500,1000]

Table 20 Returns and Scores of Different Parameters#

From Table 20, we can see:

Ridge regression shows higher returns with fixed parameter settings than with grid search. SVR and random forest show unstable performance.
The model fitting degree (test set scores) of ridge regression and random forest under different parameter settings does not show significant improvement with grid search compared to fixed parameters. However, the test set scores of SVR with grid search are significantly better than those with fixed parameters.
Overall, using grid search can enhance the model fitting degree of SVR. However, using grid search does not lead to a significant increase in returns and may even result in a decrease.

2.7 Empirical Comparison of Different Training Set Lengths#

Adjusting the length of the rolling training set, the backtesting conditions are shown in Table 21, and backtesting is conducted. The backtesting results are shown in Table 22.

Table 21 Backtesting Conditions#

Table 22 Backtesting Results of Different Algorithms#

From Table 22, we can see:

The best performance in terms of returns is the SVR algorithm (training set length: 3). This strategy also has the best Sharpe ratio and information ratio.
In terms of returns, increasing the training set length does not necessarily lead to an increase in return rates. The optimal training set length for linear regression, ridge regression, and SVR is 3, while for random forest, it is 9.
In terms of risk, increasing the training set length does not necessarily reduce risk. The maximum drawdown of linear regression and ridge regression shows a downward trend as the training set length increases. The performance of SVR and random forest is unstable.

3.1 Research Summary#

This article constructs a stock selection strategy based on four machine learning algorithms, mainly using the constituent stocks of the CSI All Share as the stock pool. By selecting stocks with investment value through machine learning algorithms, the expectation is that this portfolio can achieve stable excess returns over a period in the future. This research enriches the construction methods of stock selection strategies and provides some references for how machine learning can be applied to investment decisions.

This article selects the factor cross-sectional data of the CSI All Share constituent stocks during the trading days from October 2013 to July 2018, with the data from October 2013 to June 2018 serving as the training set (including the validation set) and the data from January 2014 to July 2018 serving as the test set. The empirical research process is mainly divided into four parts: factor selection, strategy construction, data preprocessing, and empirical analysis.

The quantitative stock selection model constructed in this article achieves a cumulative return rate of up to 698% (SVR algorithm, with 5 holdings) between January 2014 and July 2018, with an annualized return rate of 59%, far exceeding the performance of the benchmark (CSI All Share) during the same period (return rate: 42%), indicating that the strategy has a good stock selection effect.

Through grouped backtesting comparative analysis, it can be found that under the CSI All Share, the strategy performance shows a significant decreasing trend with changes in quantile groups, indicating that the model can effectively distinguish between strong and weak stocks when investing in CSI All Share constituent stocks.

Comparative analysis with linear models (linear regression and ridge regression) shows that the investment strategy in this article can continuously adapt to changes in market conditions through non-linear models (SVR and random forest), better uncovering stocks with excess returns. Non-linear models perform well in terms of return rates, Sharpe ratios, and information ratios, and the investment strategy based on non-linear models can stably outperform linear models; from the perspective of drawdown, non-linear models do not have a significant advantage over linear models, and sometimes the drawdown is on par with or even greater than that of linear models.

Compared to SVR, although random forest can achieve higher excess returns in certain cases, overall, the return rate of random forest is lower than that of SVR. However, under the factor conditions in this article, the predictive ability of random forest is undoubtedly superior to that of SVR.

The regularization advantage of ridge regression does not show a significant improvement over linear regression in the strategy. We analyze that there are two possible reasons. First, as the amount of available data for the model increases, the performance of both models improves, ultimately allowing linear regression to catch up with ridge regression. Conversely, this validates that if there is enough training data, regularization becomes less important, and ridge regression and linear regression will have similar performance. Second, due to the preprocessing steps taken to remove extreme values and standardize, while reducing the multicollinearity of factors, the probability of extreme samples appearing is also reduced, further diminishing the value of regularization. Therefore, in the strategy of this article, regularization does not significantly help indicators such as return rates.

As the number of factors increases, regardless of the algorithm, there is an inverse relationship between fitting degree and return rates. The opposite relationship can be explained as follows: since the quantitative investment strategy in this article targets stocks that deviate the most below the predicted values, algorithms with lower fitting degrees are more likely to identify stocks with significant deviations between actual and predicted values, thus allowing for more precise purchases.

Different factor combinations and different numbers of holdings have a certain impact on strategy performance. The linear model shows a decrease in return rates as the number of factors increases. Non-linear models exhibit greater return capabilities as the number of factors increases. Fewer holdings lead to higher return rates, but the associated risks also increase.

In this article, we classify market styles into 9 categories and empirically test 7 different market styles, namely sustained consolidation, sustained uptrend, sustained downtrend, consolidation to uptrend switch, consolidation to downtrend switch, uptrend to downtrend switch, and downtrend to uptrend switch. Each market style tests the performance of artificial intelligence algorithms regarding return rates and maximum drawdowns. The test results show that during periods without significant market style switches, linear model algorithms outperform SVR and random forest algorithms; conversely, SVR performs best, and its returns far exceed those of other algorithms.

Finally, regarding algorithm optimization, whether using grid search to optimize algorithm parameters or increasing the length of the rolling training set to optimize model generalization, neither leads to a significant increase in return rates and may even cause declines. The return rate of the SVR algorithm is highly sensitive to whether grid search is used or the parameter values. When using the SVR algorithm, it is essential to thoroughly discuss the rationality of its parameters. From the perspectives of return rates and Sharpe ratios, the optimal rolling training set length for linear models and SVR algorithms is 3 trading months, while for random forest algorithms, it is 9 trading months.

In summary, the stock selection strategy based on machine learning should be used in conjunction with the best strategies corresponding to different algorithms as shown in Table 23.

Table 23 Best Strategies Corresponding to Different Algorithms#

3.2 Limitations and Future Improvement Directions#

Research on how machine learning can be applied to quantitative stock selection has always been a hot topic in the investment field. This article attempts to establish a more effective quantitative stock selection strategy by introducing machine learning stock selection based on traditional multi-factor models, but there are still the following shortcomings:

First, the strategy in this article shows deficiencies in risk control, especially during the rapid market style switches in the 2015 bull and bear markets. The quantitative investment strategy did not consider risk controls such as stop-loss during high-risk market conditions, leading to significant maximum drawdowns. If conditions for taking profits and losses or timing models for stock buying and selling could be introduced, or if hedging mechanisms could be utilized to hedge risks, it would likely help control risks. Due to time constraints, this article did not conduct in-depth research in this area.

Second, the range of factors used in the multi-factor model is relatively narrow, mainly based on fundamentals. Factors outside of fundamentals, such as technical factors, have not been considered.

Finally, due to practical constraints, the evaluation of the quantitative investment strategy in this article is primarily based on historical data backtesting results, without subsequent real trading simulations and real-time trading tracking. The strategy still needs further validation in real trading.

Special thanks: This article is developed based on JoinQuant, and the source code used can be referenced at the following link:

https://www.joinquant.com/view/community/detail/7a63b350815f79bfd4d83ab22d0f291a