1. Training Set, Validation Set, and Test Set#
In the process of artificial intelligence and machine learning, the original data can be divided into three parts: the training set, the validation set, and the test set, as shown in Figure 1. Where:
- Training Set: Used to build the machine learning model.
- Validation Set: The validation set can be included in the training set or separately divided. When it is separately divided, it is often used for cross-validation of the trained machine learning model or for selecting model parameters.
- Test Set: For the model, it is unknown data used to evaluate the final model.
Figure 1 Training Set, Validation Set, and Test Set
2. Generalization, Overfitting, and Underfitting#
If an algorithm can make accurate predictions for unseen data, it is said to generalize from the training set to the test set. Generally, we always want to build an algorithm with the highest possible generalization accuracy. The only measure to judge the performance of an algorithm on new data is the evaluation on the test set. Generally, simpler models have better generalization ability to new data. Building a model that is too complex for the existing information is called overfitting. In this case, when fitting the model, too much attention is paid to the details of the training set, resulting in a model that performs well on the training set but cannot generalize to the test set or new data, which indicates overfitting. On the contrary, if the model is too simple, it may not capture all the content and variations in the data, and the model's performance on the initial training set is poor. This phenomenon is called underfitting.
The more complex the model, the better the prediction results on the training set, but due to the complexity of the model, we focus too much on each individual data point in the training set, and the model cannot generalize well to new data. There is an optimal position between overfitting and underfitting, as shown in Figure 2, where the best generalization model can be obtained. This is the ultimate goal of model exploration.
Figure 2 Relationship between Model Complexity and Prediction Error
(Source: CNBLOG[2])
3. Cross-Validation#
Cross-validation, also known as cyclic estimation, is a statistical method for evaluating generalization performance. It is more stable and comprehensive than the method of single partitioning of training and test sets. In cross-validation, the data is divided into training and test sets multiple times, or the test set is kept unchanged while the training set is divided into training and validation sets.
The common cross-validation method is k-fold cross-validation, where k is a number specified by the user. When performing k-fold cross-validation, the training set is first divided into approximately equal parts, each part is called a fold. Then a series of models are trained. The Kth fold is used as the validation set to evaluate accuracy, and the other folds (1 to K-1) are used as the training set to train the first model. Then the K-1, K-2, ..., 2, 1 folds are used as the validation set, and the other folds are used as the training set to train different models. Accuracy is calculated for each iteration, and finally, K accuracy values are obtained. Taking K=10 as an example, ten-fold cross-validation is shown in Figure 3.
Figure 3 Example of Ten-fold Cross-Validation
(Source: CSDN[3])
The advantage of cross-validation is that each example appears exactly once in the test set: each example is in one fold, and each fold appears once in the test set. Therefore, the model needs to have good generalization ability for all samples in the dataset in order to achieve high scores in all cross-validations (usually taking the average). Another advantage of cross-validation compared to single partitioning of data is that it uses data more efficiently. Taking ten-fold cross-validation as an example, we can use 90% of the data to fit the model. The disadvantage of cross-validation is that it increases computational cost. Because K models need to be trained instead of a single model, the speed of cross-validation is about K times slower than single partitioning. Also, it is important to note that cross-validation is not a method for building models. The purpose of cross-validation is only to evaluate the generalization performance of a given algorithm on a specific dataset, and it does not return a model itself.
4. Grid Search#
The generalization ability of a model can be improved by adjusting its parameters. Finding the values of the important parameters of a model (which provide the best generalization performance) is a challenging task, but it is necessary for almost all models and datasets. There are many researches on parameter tuning, such as non-parametric noise estimation and genetic algorithm parameter optimization. The most commonly used method is grid search, which mainly refers to trying all possible combinations of the parameters we care about. Taking SVR as an example, it has two important parameters: kernel width gamma and regularization parameter C. Suppose we want to try C values of 0.001, 0.01, 0.1, 1, 10, and 100, and gamma values of 0.001, 0.01, 0.1, 1, and 10. Therefore, there are a total of 30 parameter combinations, as shown in Figure 4, and all possible combinations form the parameter grid for SVR.
Figure 4 Parameter Grid for SVR
To obtain a better estimate of generalization performance, grid search is often combined with cross-validation. The major drawback of grid search with cross-validation is that it requires a lot of time to train models. Taking the grid in Figure 4 as an example, evaluating the accuracy of SVR with specific values of C and gamma using 5-fold cross-validation requires training 30*4=120 models.
5. Rolling Forecast#
As the name suggests, the rolling forecast method selects a large amount of sample data in a rolling manner. Its essence is to continuously update the model by adding the latest data to the modeling process through a "time window". The prerequisite for using this algorithm is that the importance of new data is greater than that of old data, and the significance of new data is that it is more suitable for market changes.
For this algorithm, there are mainly two research directions. The first method is to fix the time window, so that the size of the time window will not change with the addition of new data and new samples. As a result, old data and old samples will be constantly cleared to provide sufficient space for the insertion of new data, so the size of the window is very important for this algorithm and needs to be optimized. The second method is that the size of the time window can change with the addition of new data and new samples. In this case, due to the continuous addition of new data, the time window needs to be adjusted in a timely manner according to the characteristics of the new data and new samples. The characteristic of this mechanism is that old data still exists in the algorithm, and only the latest data is added to the model training. Although this method seems to consider the degree of conformity of the model to market demand comprehensively, it still faces two major defects. The first one is that due to the existence of old data, the computational cost faced by the model will increase as new data is added, resulting in lower training efficiency. The second one is that some old data does not meet market demand, and the presence of old data will affect the accuracy of the algorithm during training. Therefore, this algorithm needs to be considered comprehensively during use.
The research results of past multi-factor models have shown that the vast majority of models have not been upgraded and optimized according to technological and era updates. The establishment of these models is based on market operating principles that will not change or will not change in the short term. Therefore, the model builders will not update the model data and samples frequently, which is reasonable. However, according to the actual situation, the market situation is changing every second, and the market rules that were previously followed may not be applicable in the next development stage. Therefore, such models cannot fully keep up with the pace of market development, and the builders must adjust and update the data and samples in the model in a timely manner. Therefore, it is recommended to use the rolling time window mechanism to help builders avoid such problems.
References:
[1] https://blog.csdn.net/lhx878619717/article/details/49079785
[2] https://www.cnblogs.com/sthinker/p/6837597.html
[3] https://blog.csdn.net/lhx878619717/article/details/49079785