DISCLAIMER!

Before proceeding, please make sure that you note the following important information:

NOT FINANCIAL ADVICE!

My content is intended to be used and must be used for informational and educational purposes only. I am not an attorney, CPA, or financial advisor, nor am I holding myself out to be, and the information contained on this blog/notebook is not a substitute for financial advice, None of the information contained here constitutes an offer (or solicitation of an offer) to buy or sell any security or financial instrument to make any investment or to participate in any particular trading strategy. Always seek advice from a professional who is aware of the facts and circumstances of your individual situation. Or, Independently research and verify any information that you find on my blog/notebook and wish to rely upon in making any investment decision or otherwise. I accept no liability whatsoever for any loss or damage you may incur

import pandas as pd
import numpy as np
import yfinance as yf
from scipy import stats
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
from datetime import datetime
import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
from IPython.display import HTML
import warnings
warnings.simplefilter("ignore")

Import Data

#BTC price and volume

#you will import Gold price when you start creating the indipendant variables

# Read data
Df = yf.download('BTC-USD', '2012-01-01', '2022-05-31', interval= '1mo', auto_adjust=True)

# Only keep close columns
Df = Df[['Close','Volume']]

# Drop rows with missing values
Df = Df.dropna()

[*********************100%***********************]  1 of 1 completed

Data Cleaning and Wrangling

Before any modeling can be done, there are a few steps needed to prepare the data before feeding it to the model, at least by arranging the data set in a way it makes sense

df = Df.reset_index()
for i in ['Close', 'Volume']: 
      df[i]  =  df[i].astype('float64')
df

Because I'm working with monthly data, I'll drop the days in the date to avoid confusion

date_format = "%Y/%m"
df['Date'] = df['Date'].dt.strftime(date_format)
df

Bitcoin is often referred to as "digital gold" by its backers hence I'll add the gold price and volume data as potential independent variables and I'll explore further to see its relationship and whether or not it will be a good predictor

gold = yf.download('GLD', '2014-10-01', '2022-05-31', interval= '1mo', auto_adjust=True)
gold = gold[['Close','Volume']]
gld = gold.reset_index()
for i in ['Close', 'Volume']: 
      gld[i]  =  gld[i].astype('float64')
gld

[*********************100%***********************]  1 of 1 completed

The other two independent variables will be the moving averages and volume

moving averages are often used by technical analysts to keep track of price trends for specific securities. I'll use the 3 and 6 month exponential moving averages but whether it's simple, weighted, or exponential in general it doesn't really make much of a difference (but this could be a good hypothesis to test)

Volume is also a well-known indicator of price movement, Trading volume is the total number of shares/units of a security traded during a given period of time.

df['ema3'] = df['Close'].ewm(span=3, adjust=False).mean()
df['ema6'] = df['Close'].ewm(span=6, adjust=False).mean()
df

df['Gold Close'] = gld['Close']
df['Gold Volume'] = gld['Volume']
df

Now i'm going to generate the dependant/target variable that i'm going to try and predict

df['Next Month Close'] = df['Close'].shift(-1)
df

btc_close = df['Close'] #But i'll save the close price just incase i need it 

#Then i'll remove the previous month btc close price so that i'm left with only the relevant data that i need
df.drop(columns='Close', inplace=True)

Df = df.dropna()

#Now i should have a good clean dataframe ready for some EDA
Df

Explanatory Data Analysis

Intuitley I know that traders like to use the ema lines and volume to predict BTC price. But as I mentioned before here I get the chance to explore whether gold price and its volume can help predict BTC price.

The Pearson correlation coefficient and p value

Pearson Correlation:

Correlation between sets of data is a measure of how well they are related. The most common measure of correlation in stats is the Pearson Correlation.The full name is the Pearson Product Moment Correlation (PPMC). It shows the linear relationship between two sets of data. In simple terms, it answers the question, Can I draw a line graph to represent the data?

It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables, where:

1: Perfect positive linear correlation.

0: No linear correlation, the two variables most likely do not affect each other.

-1: Perfect negative linear correlation.

P-Value:

A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or lower is generally considered statistically significant which means that we are 95% confident that the correlation between the variables is significant.

By convention, when the

p-value is < 0.001: we say there is strong evidence that the correlation is significant.
the p-value is < 0.05: there is moderate evidence that the correlation is significant.
the p-value is < 0.1: there is weak evidence that the correlation is significant.
the p-value is > 0.1: there is no evidence that the correlation is significant.

Two things keeps to keep in mind when interprating the results:

The null hypothesis is that the two variables are uncorrelated .
The p value is in scientific notation, it's decimal form is e.g 4.2e-7 = 0.00000042.

#I will start form the second last row to avoid errors bcz of nan value 
pearson_coef, p_value = stats.pearsonr(df['Gold Close'][:90], df['Next Month Close'][:90])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

The Pearson Correlation Coefficient is 0.7706711768570538  with a P-value of P = 6.555477090301474e-19

In this case,

The p-value is < 0.001 hence, there is strong evidence that the correlation between gold price and BTC price is statistically significant, and the linear relationship is quite strong (0.77, close to 1)

pearson_coef, p_value = stats.pearsonr(df['Gold Volume'][:90], df['Next Month Close'][:90])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

The Pearson Correlation Coefficient is 0.08476385450593574  with a P-value of P = 0.42700282567060693

The p-value is < 0.001 hence, there is moderate evidence that the correlation between gold volume and BTC price is statistically significant, and there is no linear relationship (0.08, almost 0)

Visually we can see that there is almost no linear relationship between gold volume and btc price

fig = px.scatter(
    df, x='Gold Volume', y='Next Month Close', opacity=0.65,
    trendline='ols', trendline_color_override='firebrick'
)
fig.update_layout(template = "plotly_dark")
fig.show()

So we now know that we can use gold price but not its volume, it would have destroyed value and it wouldn't have added anything to the model if anything it would have probably ruined our prediction

df.drop(columns='Gold Volume', inplace=True)
df.head()

What about the the other indipendant variables?

print('EMA 3')
pearson_coef, p_value = stats.pearsonr(df['ema3'][:90], df['Next Month Close'][:90])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

print('\nEMA 6')
pearson_coef, p_value = stats.pearsonr(df['ema6'][:90], df['Next Month Close'][:90])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

print('\nVolume')
pearson_coef, p_value = stats.pearsonr(df['Volume'][:90], df['Next Month Close'][:90])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

EMA 3
The Pearson Correlation Coefficient is 0.9518278081533298  with a P-value of P = 5.913915911148407e-47

EMA 6
The Pearson Correlation Coefficient is 0.9357719658423489  with a P-value of P = 1.3124438502636034e-41

Volume
The Pearson Correlation Coefficient is 0.7995337431531997  with a P-value of P = 3.465784507142274e-21

The other indipendant variables are all statistically significant, and their linear relationship are very strong with p-values of < 0.001

NOTE:

Causation is the relationship between cause and effect between two variables.

It is important to know the difference between correlation and causation. Correlation does not imply causation. Determining correlation is much simpler than determining causation as causation may require independent experimentation.

Model Development

Before we continue let's clarify the objective again: I'm using the 3 and 6 month ema, BTC previous month volume, and Gold close price of the preceding month to predict what BTC close price of the impending month

x = Df[['ema3','ema6','Volume','Gold Close']]

# Define the dependent variable
y = Df['Next Month Close']

I am going to spilt the data, 80% of the data will be used to train the model and 20% will be used to test the prediction made from that 80%

t = .8
t = int(t*len(Df))

# Train dataset
x_train = x[:t]
y_train = y[:t]

# Test dataset
x_test = x[t:]
y_test = y[t:]

print("number of test samples :", y_test.shape[0])
print("number of training samples:",y_train.shape[0])

number of test samples : 19
number of training samples: 72

reg = linear_model.LinearRegression()
reg.fit(x_train,y_train)

LinearRegression()

The constant came back negative which is confusing but i'll get back to this later

reg.intercept_

-1262.8899087871669

coeff_df = pd.DataFrame(reg.coef_.T, x.columns, columns=['Coefficient']) 
coeff_df

The Multiple linear regression formula:

$$ y = β_{0} \;+ \;β_{1} * \;X_{1} + \;β_{2} * \;X_{2} + \;β_{3} * \;X_{3} + \;β_{4} * \;X_{4}$$

print("Linear Regression model")
print("BTC Price (y) = %.2f (constant) + %.2f * EM3 (x1) + %.2f * EMA6 (x2) + %.4f * Volume (x3) + %.2f * Gold Close (x4)" % (reg.intercept_,reg.coef_[0], reg.coef_[1],reg.coef_[2],reg.coef_[3]))

Linear Regression model
BTC Price (y) = -1262.89 (constant) + 1.78 * EM3 (x1) + -0.95 * EMA6 (x2) + 0.0000 * Volume (x3) + 15.22 * Gold Close (x4)

Model Evaluation

In this step, I will evaluate the model's accuracy but before that happens I'm going to make the predictions first

predicted_price = reg.predict(x_test)

The R square is 0.05 which means the model’s predicitive power is poor in fact it is worse than what I expected it predicts little to nothing of the target variable

test_r2_score = r2_score(y_test,predicted_price)
print('The test R-square is: ', test_r2_score)

The test R-square is:  0.053315838215583056

I will also look at other evaluation methods

test_r2_score = r2_score(y_test,predicted_price)
print('The test R-square is: ', test_r2_score)

test_mse = mean_squared_error(y_test, predicted_price)
print('The test mean square error of target variable and predicted value is: ', test_mse)

test_mae = mean_absolute_error(y_test, predicted_price)
print('The test mean absolute error of target variable and predicted value is: ', test_mae)

test_rmse=np.sqrt(test_mse)
print('The test root mean square error of target variable and predicted value is: ', test_rmse)

The test R-square is:  0.053315838215583056
The test mean square error of target variable and predicted value is:  107191024.05803142
The test mean absolute error of target variable and predicted value is:  8821.373009960726
The test root mean square error of target variable and predicted value is:  10353.309811747711

Mean Square Error (MSE) is the average difference of actual values and predicted values There is no correct value for MSE. Simply put, the lower the value the better, and 0 means the model is perfect.

Mean Absolute Error (MAE) is the sum of the absolute difference between actual and predicted values in this case the average difference is $8821

I'm going to evaluate further and try to see what other insights I can gather from the predicted price I'll start by creating a data frame and add the predicted price and actual price so that I can plot the prices side by side

btc = pd.DataFrame()

#btc['Close Previous Month'] = btc_close[t:]
btc['Date'] = Df['Date'][t:]
btc['Predicted Close'] = predicted_price
btc['Actual Close'] = btc_close[t:].shift(-1)#btc['Close Previous Month'].shift(-1)
btc

fig = px.line(btc, x="Date", y=btc.columns,              
              title='Predicted Close Vs Actual Close')
fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=1, label="1m", step="month", stepmode="backward"),
            dict(count=6, label="6m", step="month", stepmode="backward"),
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=1, label="1y", step="year", stepmode="backward"),
            dict(step="all")
        ])
    )
)
fig.update_layout(template = "plotly_dark")

We have already seen the mean of these residuals (mean squared error), now I'll look at the residuals of each month in absolute and in % to see how far off are the predictions for each month (Remember, the residual is the difference between the observed value and the estimated value)

btc['Residual'] = btc_close[t:] - btc['Actual Close'] #The difference in absolute $ terms
btc['Residual in %'] = np.absolute(btc['Residual']/btc['Actual Close']*100) #The difference in % 
btc

As you can see from the residual, the difference is pretty large but this is due to BTC being very volatile hence anything between 7-10% difference (this is subjective based on my observations from BTC trading) could be good but to expect a residual of less than 5% consistently would be very unlikely from an asset class this volatile

Conclusion

If the model had been at least 50-60% accurate (have an R square of 0.50-0.60), I would have proceeded with backtesting and then take the model live by predicting the close price of this month (June 2022)

The linear regression is not a good model to use when predicting BTC/USD prices, maybe it would've been more efficient in predicting the returns instead. There were many red flags and based on the R square alone I would never take this model live and risk real money on it

The MSE was way too high and very far from 0 since MSE is a measure of how close a fitted line is to data points

Another red flag was the constant being negative, This means when the independent variables are 0 the mean price of BTC will be -1262. A negative constant doesn't mean the regression made a mistake but rather it's the data being modeled, realistically any security price can go to 0 but no security price can fall below 0 and turn negative, The your position value of that asset can turn negative but not the actual asset price. which is also why I think predicting returns instead of price would have been more accurate and much more realistic

This is also a good example to showcase how a machine learning model is only as useful as the features selected and in-order to select the right features depends on the knowledge one has of that data set!

	Date	Close	Volume
0	2014-10-01	338.321014	9.029944e+08
1	2014-11-01	378.046997	6.597334e+08
2	2014-12-01	320.192993	5.531023e+08
3	2015-01-01	217.464005	1.098812e+09
4	2015-02-01	254.263000	7.115187e+08
...	...	...	...
87	2022-01-01	38483.125000	9.239790e+11
88	2022-02-01	43193.234375	6.713360e+11
89	2022-03-01	45538.675781	8.309438e+11
90	2022-04-01	37714.875000	8.301159e+11
91	2022-05-01	31792.310547	1.105689e+12

	Date	Close	Volume
0	2014/10	338.321014	9.029944e+08
1	2014/11	378.046997	6.597334e+08
2	2014/12	320.192993	5.531023e+08
3	2015/01	217.464005	1.098812e+09
4	2015/02	254.263000	7.115187e+08
...	...	...	...
87	2022/01	38483.125000	9.239790e+11
88	2022/02	43193.234375	6.713360e+11
89	2022/03	45538.675781	8.309438e+11
90	2022/04	37714.875000	8.301159e+11
91	2022/05	31792.310547	1.105689e+12

	Date	Close	Volume
0	2014-10-01	112.660004	155183900.0
1	2014-11-01	112.110001	147594200.0
2	2014-12-01	113.580002	153722200.0
3	2015-01-01	123.449997	198034100.0
4	2015-02-01	116.160004	125686200.0
...	...	...	...
87	2022-01-01	168.089996	211125100.0
88	2022-02-01	178.380005	254601300.0
89	2022-03-01	180.649994	377087100.0
90	2022-04-01	176.910004	195346400.0
91	2022-05-01	171.139999	179902200.0

	Date	Close	Volume	ema3	ema6
0	2014/10	338.321014	9.029944e+08	338.321014	338.321014
1	2014/11	378.046997	6.597334e+08	358.184006	349.671295
2	2014/12	320.192993	5.531023e+08	339.188499	341.248923
3	2015/01	217.464005	1.098812e+09	278.326252	305.881804
4	2015/02	254.263000	7.115187e+08	266.294626	291.133574
...	...	...	...	...	...
87	2022/01	38483.125000	9.239790e+11	44519.703176	46247.967191
88	2022/02	43193.234375	6.713360e+11	43856.468776	45375.186387
89	2022/03	45538.675781	8.309438e+11	44697.572278	45421.897642
90	2022/04	37714.875000	8.301159e+11	41206.223639	43219.891173
91	2022/05	31792.310547	1.105689e+12	36499.267093	39954.868137

	Date	Close	Volume	ema3	ema6	Gold Close	Gold Volume
0	2014/10	338.321014	9.029944e+08	338.321014	338.321014	112.660004	155183900.0
1	2014/11	378.046997	6.597334e+08	358.184006	349.671295	112.110001	147594200.0
2	2014/12	320.192993	5.531023e+08	339.188499	341.248923	113.580002	153722200.0
3	2015/01	217.464005	1.098812e+09	278.326252	305.881804	123.449997	198034100.0
4	2015/02	254.263000	7.115187e+08	266.294626	291.133574	116.160004	125686200.0
...	...	...	...	...	...	...	...
87	2022/01	38483.125000	9.239790e+11	44519.703176	46247.967191	168.089996	211125100.0
88	2022/02	43193.234375	6.713360e+11	43856.468776	45375.186387	178.380005	254601300.0
89	2022/03	45538.675781	8.309438e+11	44697.572278	45421.897642	180.649994	377087100.0
90	2022/04	37714.875000	8.301159e+11	41206.223639	43219.891173	176.910004	195346400.0
91	2022/05	31792.310547	1.105689e+12	36499.267093	39954.868137	171.139999	179902200.0

	Coefficient
ema3	1.777304e+00
ema6	-9.488261e-01
Volume	9.555041e-10
Gold Close	1.521999e+01

	Date	Predicted Close	Actual Close
72	2020/10	13607.631908	19625.835938
73	2020/11	17737.624104	29001.720703
74	2020/12	25451.933856	33114.359375
75	2021/01	31623.812022	45137.769531
76	2021/02	40797.895059	58918.832031
77	2021/03	52011.667637	57750.175781
78	2021/04	55687.583652	37332.855469
79	2021/05	43935.783612	35040.835938
80	2021/06	35882.237824	41626.195312
81	2021/07	36483.673993	47166.687500
82	2021/08	40423.918840	43790.894531
83	2021/09	39815.250546	61318.957031
84	2021/10	50330.042450	57005.425781
85	2021/11	51684.668062	46306.445312
86	2021/12	45279.181353	38483.125000
87	2022/01	37422.063673	43193.234375
88	2022/02	36986.624235	45538.675781
89	2022/03	38624.254997	37714.875000
90	2022/04	34450.675253	31792.310547