What Can Random Forest Models Tell Us About The Stock Market?

5 min readSep 25, 2020

1) What are Random Forest Models?

Random Forest models are a form of machine learning that, generally speaking, produces great results with very little hyper-parameter tuning.

The idea behind them is actually incredibly simple. Basically, Random Forest Models are a collection of decision trees that ask “questions” about the data to predict values.

The above decision tree is a model I’ve generated using stock and economic data. The top label represents the question being asked (basically asking if a value is larger or smaller than a certain number) and depending on the answer you move either left or right down the decision tree. The “value” represents the predicted stock price.

The model continues this process of asking questions and moving either right or left down the model until it hits a final specific answer on what the closing stock price should be.

2) How Accurate Is The Model At Predicting Price?

The model is trained on about 350,000 rows of data with about 106 different features ranging from company fundamentals, daily stock volume, and economic indicators like unemployment and GDP growth.

Running the model we get the scores:

The baseline score is basically the probability of guessing the correct price by simply picking the most common price. Since the baseline is nearly 0% this means we can’t guess the answer by sheer luck.

The training score is mostly used to determine whether or not or model is over-fitting to the data we use for training. Since our training score is quite a bit larger than the validation score then it is probably over-fit a little.

The validation score is what tells us how good our model is at making predictions when it is presented with new and unseen data. The score is calculated through a statistical measure called R-Squared, which basically shows us how much variation occurs between our predicted values and actual values.

The validation R-Squared score is about .57~, which would be considering moderately significant at predicting stock prices. Ideally, you would something between .7 or .9, but this is still pretty good considering it is predicting prices almost exclusively on fundamentals data and economic data.

The model also has a mean absolute error of 30, which means our predictions are off on average about 30 dollars. Though considering stock prices range from $1 to $10,000+ it is difficult to interpret how significant this error is.

3) Explaining The Model Through Feature Importance

Now, even though the model isn’t 100% accurate it can still help us figure out what are the most important features for determining stock prices.

Through some math we can assign scores to the models features that will tell us which features are best at explaining stock prices.

Based on the chart, the model tells us that the most important features for determining price is Earnings Per Share by quite a large margin. What is not clear, however, is whether Earning Per Share is negatively or positively impacting share prices. All the feature importances can tell us is what features are significant, but not in which direction they are significant.

Now, through Shap plots we can explain what features are negatively and positively impacting stock prices.

To explain what is going on in the chart, basically red means larger feature values and blue means smaller feature values. Then the SHAP value is the predicted stock price. Dots to the right of the center line mean larger expected stock prices and left means smaller expected stock prices.

Through the shap plot was can see a lot of the details on how a range of values can influence stock prices up or down.

4) Testing Out The Models Feature Importances

If the models feature importances are accurate then we should be able to see distinct patterns if we separate companies by how high/low their feature importance values are.

Above we can see that companies with lower liabilities generally outperform higher liability companies (except near the end caused by 1 company ballooning in price). This is mostly consistent with the shap plot displayed further above.

Here we have separated companies by asset depreciation and we can clearly see that lower depreciation companies outperform higher depreciation companies. However, our model placed a very low importance on depreciation which suggests to me that the model isn’t accurately detecting the relationship.

5) Should The Model Be Ignoring Economic Data?

I find it strange that the model places such low confidence in economic data. Is economic data really that unimportant as a predictor?

Scatter plot between building permits and closing price

We can see here there is a very slight positive linear relationship between building permits granted and average stock prices. Though, pretty insignificant overall.

However, when comparing the slope against a high importance feature then we can see a much stronger positive correlation. So maybe the model is right in not using economics data to predict stock prices.

6) Conclusion

Overall I’d say the Random Forest model has done a good job at finding general patterns to look for in stock prices.

Using feature importance you can draw your attention to the clearly strongest predictors like fundamental data while ignoring weaker predictors like economics data.

And through shap plots you can get a clearer picture of how things are negatively or positively impacting prices.

7) Bonus: Moving Averages Make For Surprisingly Strong Predictors

With the inclusion of a 40 day moving average, my model completely over-fits to one metric, but the performance is surprisingly good.

Model Importance with inclusion of 40 day moving average

Though it seems silly to predict prices almost entirely though moving averages it is hard to argue with the results. I get an R-Squared score of .998 which is absolutely fantastic. It would seem that the strongest predictor of stock prices is, well, previous stock prices.