BTC/USD Price Prediction Using Linear Regression
DISCLAIMER!
Before proceeding, please make sure that you note the following important information:
NOT FINANCIAL ADVICE!
My content is intended to be used and must be used for informational and educational purposes only. I am not an attorney, CPA, or financial advisor, nor am I holding myself out to be, and the information contained on this blog/notebook is not a substitute for financial advice, None of the information contained here constitutes an offer (or solicitation of an offer) to buy or sell any security or financial instrument to make any investment or to participate in any particular trading strategy. Always seek advice from a professional who is aware of the facts and circumstances of your individual situation. Or, Independently research and verify any information that you find on my blog/notebook and wish to rely upon in making any investment decision or otherwise. I accept no liability whatsoever for any loss or damage you may incur
import pandas as pd
import numpy as np
import yfinance as yf
from scipy import stats
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
from datetime import datetime
import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
from IPython.display import HTML
import warnings
warnings.simplefilter("ignore")
#BTC price and volume
#you will import Gold price when you start creating the indipendant variables
# Read data
Df = yf.download('BTC-USD', '2012-01-01', '2022-05-31', interval= '1mo', auto_adjust=True)
# Only keep close columns
Df = Df[['Close','Volume']]
# Drop rows with missing values
Df = Df.dropna()
Before any modeling can be done, there are a few steps needed to prepare the data before feeding it to the model, at least by arranging the data set in a way it makes sense
df = Df.reset_index()
for i in ['Close', 'Volume']:
df[i] = df[i].astype('float64')
df
Because I'm working with monthly data, I'll drop the days in the date to avoid confusion
date_format = "%Y/%m"
df['Date'] = df['Date'].dt.strftime(date_format)
df
Bitcoin is often referred to as "digital gold" by its backers hence I'll add the gold price and volume data as potential independent variables and I'll explore further to see its relationship and whether or not it will be a good predictor
gold = yf.download('GLD', '2014-10-01', '2022-05-31', interval= '1mo', auto_adjust=True)
gold = gold[['Close','Volume']]
gld = gold.reset_index()
for i in ['Close', 'Volume']:
gld[i] = gld[i].astype('float64')
gld
The other two independent variables will be the moving averages and volume
moving averages are often used by technical analysts to keep track of price trends for specific securities. I'll use the 3 and 6 month exponential moving averages but whether it's simple, weighted, or exponential in general it doesn't really make much of a difference (but this could be a good hypothesis to test)
Volume is also a well-known indicator of price movement, Trading volume is the total number of shares/units of a security traded during a given period of time.
df['ema3'] = df['Close'].ewm(span=3, adjust=False).mean()
df['ema6'] = df['Close'].ewm(span=6, adjust=False).mean()
df
df['Gold Close'] = gld['Close']
df['Gold Volume'] = gld['Volume']
df
Now i'm going to generate the dependant/target variable that i'm going to try and predict
df['Next Month Close'] = df['Close'].shift(-1)
df
btc_close = df['Close'] #But i'll save the close price just incase i need it
#Then i'll remove the previous month btc close price so that i'm left with only the relevant data that i need
df.drop(columns='Close', inplace=True)
Df = df.dropna()
#Now i should have a good clean dataframe ready for some EDA
Df
Intuitley I know that traders like to use the ema lines and volume to predict BTC price. But as I mentioned before here I get the chance to explore whether gold price and its volume can help predict BTC price.
The Pearson correlation coefficient and p value
Pearson Correlation:
Correlation between sets of data is a measure of how well they are related. The most common measure of correlation in stats is the Pearson Correlation.The full name is the Pearson Product Moment Correlation (PPMC). It shows the linear relationship between two sets of data. In simple terms, it answers the question, Can I draw a line graph to represent the data?
It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables, where:
P-Value:
A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or lower is generally considered statistically significant which means that we are 95% confident that the correlation between the variables is significant.
By convention, when the
- p-value is < 0.001: we say there is strong evidence that the correlation is significant.
- the p-value is < 0.05: there is moderate evidence that the correlation is significant.
- the p-value is < 0.1: there is weak evidence that the correlation is significant.
- the p-value is > 0.1: there is no evidence that the correlation is significant.
Two things keeps to keep in mind when interprating the results:
- The null hypothesis is that the two variables are uncorrelated .
- The p value is in scientific notation, it's decimal form is e.g 4.2e-7 = 0.00000042.
#I will start form the second last row to avoid errors bcz of nan value
pearson_coef, p_value = stats.pearsonr(df['Gold Close'][:90], df['Next Month Close'][:90])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
In this case,
The p-value is < 0.001 hence, there is strong evidence that the correlation between gold price and BTC price is statistically significant, and the linear relationship is quite strong (0.77, close to 1)
pearson_coef, p_value = stats.pearsonr(df['Gold Volume'][:90], df['Next Month Close'][:90])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
The p-value is < 0.001 hence, there is moderate evidence that the correlation between gold volume and BTC price is statistically significant, and there is no linear relationship (0.08, almost 0)
Visually we can see that there is almost no linear relationship between gold volume and btc price
fig = px.scatter(
df, x='Gold Volume', y='Next Month Close', opacity=0.65,
trendline='ols', trendline_color_override='firebrick'
)
fig.update_layout(template = "plotly_dark")
fig.show()
So we now know that we can use gold price but not its volume, it would have destroyed value and it wouldn't have added anything to the model if anything it would have probably ruined our prediction
df.drop(columns='Gold Volume', inplace=True)
df.head()
What about the the other indipendant variables?
print('EMA 3')
pearson_coef, p_value = stats.pearsonr(df['ema3'][:90], df['Next Month Close'][:90])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
print('\nEMA 6')
pearson_coef, p_value = stats.pearsonr(df['ema6'][:90], df['Next Month Close'][:90])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
print('\nVolume')
pearson_coef, p_value = stats.pearsonr(df['Volume'][:90], df['Next Month Close'][:90])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
The other indipendant variables are all statistically significant, and their linear relationship are very strong with p-values of < 0.001
NOTE:
Causation is the relationship between cause and effect between two variables.
It is important to know the difference between correlation and causation. Correlation does not imply causation. Determining correlation is much simpler than determining causation as causation may require independent experimentation.
Before we continue let's clarify the objective again: I'm using the 3 and 6 month ema, BTC previous month volume, and Gold close price of the preceding month to predict what BTC close price of the impending month
x = Df[['ema3','ema6','Volume','Gold Close']]
# Define the dependent variable
y = Df['Next Month Close']
I am going to spilt the data, 80% of the data will be used to train the model and 20% will be used to test the prediction made from that 80%
t = .8
t = int(t*len(Df))
# Train dataset
x_train = x[:t]
y_train = y[:t]
# Test dataset
x_test = x[t:]
y_test = y[t:]
print("number of test samples :", y_test.shape[0])
print("number of training samples:",y_train.shape[0])
reg = linear_model.LinearRegression()
reg.fit(x_train,y_train)
The constant came back negative which is confusing but i'll get back to this later
reg.intercept_
coeff_df = pd.DataFrame(reg.coef_.T, x.columns, columns=['Coefficient'])
coeff_df
print("Linear Regression model")
print("BTC Price (y) = %.2f (constant) + %.2f * EM3 (x1) + %.2f * EMA6 (x2) + %.4f * Volume (x3) + %.2f * Gold Close (x4)" % (reg.intercept_,reg.coef_[0], reg.coef_[1],reg.coef_[2],reg.coef_[3]))
In this step, I will evaluate the model's accuracy but before that happens I'm going to make the predictions first
predicted_price = reg.predict(x_test)
The R square is 0.05 which means the model’s predicitive power is poor in fact it is worse than what I expected it predicts little to nothing of the target variable
test_r2_score = r2_score(y_test,predicted_price)
print('The test R-square is: ', test_r2_score)
I will also look at other evaluation methods
test_r2_score = r2_score(y_test,predicted_price)
print('The test R-square is: ', test_r2_score)
test_mse = mean_squared_error(y_test, predicted_price)
print('The test mean square error of target variable and predicted value is: ', test_mse)
test_mae = mean_absolute_error(y_test, predicted_price)
print('The test mean absolute error of target variable and predicted value is: ', test_mae)
test_rmse=np.sqrt(test_mse)
print('The test root mean square error of target variable and predicted value is: ', test_rmse)
Mean Square Error (MSE) is the average difference of actual values and predicted values There is no correct value for MSE. Simply put, the lower the value the better, and 0 means the model is perfect.
Mean Absolute Error (MAE) is the sum of the absolute difference between actual and predicted values in this case the average difference is $8821
I'm going to evaluate further and try to see what other insights I can gather from the predicted price I'll start by creating a data frame and add the predicted price and actual price so that I can plot the prices side by side
btc = pd.DataFrame()
#btc['Close Previous Month'] = btc_close[t:]
btc['Date'] = Df['Date'][t:]
btc['Predicted Close'] = predicted_price
btc['Actual Close'] = btc_close[t:].shift(-1)#btc['Close Previous Month'].shift(-1)
btc
fig = px.line(btc, x="Date", y=btc.columns,
title='Predicted Close Vs Actual Close')
fig.update_xaxes(
rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count=1, label="1m", step="month", stepmode="backward"),
dict(count=6, label="6m", step="month", stepmode="backward"),
dict(count=1, label="YTD", step="year", stepmode="todate"),
dict(count=1, label="1y", step="year", stepmode="backward"),
dict(step="all")
])
)
)
fig.update_layout(template = "plotly_dark")
We have already seen the mean of these residuals (mean squared error), now I'll look at the residuals of each month in absolute and in % to see how far off are the predictions for each month (Remember, the residual is the difference between the observed value and the estimated value)
btc['Residual'] = btc_close[t:] - btc['Actual Close'] #The difference in absolute $ terms
btc['Residual in %'] = np.absolute(btc['Residual']/btc['Actual Close']*100) #The difference in %
btc
As you can see from the residual, the difference is pretty large but this is due to BTC being very volatile hence anything between 7-10% difference (this is subjective based on my observations from BTC trading) could be good but to expect a residual of less than 5% consistently would be very unlikely from an asset class this volatile
Conclusion
If the model had been at least 50-60% accurate (have an R square of 0.50-0.60), I would have proceeded with backtesting and then take the model live by predicting the close price of this month (June 2022)
The linear regression is not a good model to use when predicting BTC/USD prices, maybe it would've been more efficient in predicting the returns instead. There were many red flags and based on the R square alone I would never take this model live and risk real money on it
The MSE was way too high and very far from 0 since MSE is a measure of how close a fitted line is to data points
Another red flag was the constant being negative, This means when the independent variables are 0 the mean price of BTC will be -1262. A negative constant doesn't mean the regression made a mistake but rather it's the data being modeled, realistically any security price can go to 0 but no security price can fall below 0 and turn negative, The your position value of that asset can turn negative but not the actual asset price. which is also why I think predicting returns instead of price would have been more accurate and much more realistic
This is also a good example to showcase how a machine learning model is only as useful as the features selected and in-order to select the right features depends on the knowledge one has of that data set!