DISCLAIMER!

Before proceeding, please make sure that you note the following important information:

NOT FINANCIAL ADVICE!

My content is intended to be used and must be used for informational and educational purposes only. I am not an attorney, CPA, or financial advisor, nor am I holding myself out to be, and the information contained on this blog/notebook is not a substitute for financial advice, None of the information contained here constitutes an offer (or solicitation of an offer) to buy or sell any security or financial instrument to make any investment or to participate in any particular trading strategy. Always seek advice from a professional who is aware of the facts and circumstances of your individual situation. Or, Independently research and verify any information that you find on my blog/notebook and wish to rely upon in making any investment decision or otherwise. I accept no liability whatsoever for any loss or damage you may incur

import pandas as pd
import numpy as np
import yfinance as yf
from scipy import stats
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
from datetime import datetime
import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
from IPython.display import HTML
import warnings
warnings.simplefilter("ignore")

Import Data

#BTC price and volume

#you will import Gold price when you start creating the indipendant variables

# Read data
Df = yf.download('BTC-USD', '2012-01-01', '2022-05-31', interval= '1mo', auto_adjust=True)

# Only keep close columns
Df = Df[['Close','Volume']]

# Drop rows with missing values
Df = Df.dropna()
[*********************100%***********************]  1 of 1 completed

Data Cleaning and Wrangling

Before any modeling can be done, there are a few steps needed to prepare the data before feeding it to the model, at least by arranging the data set in a way it makes sense

df = Df.reset_index()
for i in ['Close', 'Volume']: 
      df[i]  =  df[i].astype('float64')
df
Date Close Volume
0 2014-10-01 338.321014 9.029944e+08
1 2014-11-01 378.046997 6.597334e+08
2 2014-12-01 320.192993 5.531023e+08
3 2015-01-01 217.464005 1.098812e+09
4 2015-02-01 254.263000 7.115187e+08
... ... ... ...
87 2022-01-01 38483.125000 9.239790e+11
88 2022-02-01 43193.234375 6.713360e+11
89 2022-03-01 45538.675781 8.309438e+11
90 2022-04-01 37714.875000 8.301159e+11
91 2022-05-01 31792.310547 1.105689e+12

92 rows × 3 columns

Because I'm working with monthly data, I'll drop the days in the date to avoid confusion

date_format = "%Y/%m"
df['Date'] = df['Date'].dt.strftime(date_format)
df
Date Close Volume
0 2014/10 338.321014 9.029944e+08
1 2014/11 378.046997 6.597334e+08
2 2014/12 320.192993 5.531023e+08
3 2015/01 217.464005 1.098812e+09
4 2015/02 254.263000 7.115187e+08
... ... ... ...
87 2022/01 38483.125000 9.239790e+11
88 2022/02 43193.234375 6.713360e+11
89 2022/03 45538.675781 8.309438e+11
90 2022/04 37714.875000 8.301159e+11
91 2022/05 31792.310547 1.105689e+12

92 rows × 3 columns

Bitcoin is often referred to as "digital gold" by its backers hence I'll add the gold price and volume data as potential independent variables and I'll explore further to see its relationship and whether or not it will be a good predictor

gold = yf.download('GLD', '2014-10-01', '2022-05-31', interval= '1mo', auto_adjust=True)
gold = gold[['Close','Volume']]
gld = gold.reset_index()
for i in ['Close', 'Volume']: 
      gld[i]  =  gld[i].astype('float64')
gld
[*********************100%***********************]  1 of 1 completed
Date Close Volume
0 2014-10-01 112.660004 155183900.0
1 2014-11-01 112.110001 147594200.0
2 2014-12-01 113.580002 153722200.0
3 2015-01-01 123.449997 198034100.0
4 2015-02-01 116.160004 125686200.0
... ... ... ...
87 2022-01-01 168.089996 211125100.0
88 2022-02-01 178.380005 254601300.0
89 2022-03-01 180.649994 377087100.0
90 2022-04-01 176.910004 195346400.0
91 2022-05-01 171.139999 179902200.0

92 rows × 3 columns

The other two independent variables will be the moving averages and volume

moving averages are often used by technical analysts to keep track of price trends for specific securities. I'll use the 3 and 6 month exponential moving averages but whether it's simple, weighted, or exponential in general it doesn't really make much of a difference (but this could be a good hypothesis to test)

Volume is also a well-known indicator of price movement, Trading volume is the total number of shares/units of a security traded during a given period of time.

df['ema3'] = df['Close'].ewm(span=3, adjust=False).mean()
df['ema6'] = df['Close'].ewm(span=6, adjust=False).mean()
df
Date Close Volume ema3 ema6
0 2014/10 338.321014 9.029944e+08 338.321014 338.321014
1 2014/11 378.046997 6.597334e+08 358.184006 349.671295
2 2014/12 320.192993 5.531023e+08 339.188499 341.248923
3 2015/01 217.464005 1.098812e+09 278.326252 305.881804
4 2015/02 254.263000 7.115187e+08 266.294626 291.133574
... ... ... ... ... ...
87 2022/01 38483.125000 9.239790e+11 44519.703176 46247.967191
88 2022/02 43193.234375 6.713360e+11 43856.468776 45375.186387
89 2022/03 45538.675781 8.309438e+11 44697.572278 45421.897642
90 2022/04 37714.875000 8.301159e+11 41206.223639 43219.891173
91 2022/05 31792.310547 1.105689e+12 36499.267093 39954.868137

92 rows × 5 columns

df['Gold Close'] = gld['Close']
df['Gold Volume'] = gld['Volume']
df
Date Close Volume ema3 ema6 Gold Close Gold Volume
0 2014/10 338.321014 9.029944e+08 338.321014 338.321014 112.660004 155183900.0
1 2014/11 378.046997 6.597334e+08 358.184006 349.671295 112.110001 147594200.0
2 2014/12 320.192993 5.531023e+08 339.188499 341.248923 113.580002 153722200.0
3 2015/01 217.464005 1.098812e+09 278.326252 305.881804 123.449997 198034100.0
4 2015/02 254.263000 7.115187e+08 266.294626 291.133574 116.160004 125686200.0
... ... ... ... ... ... ... ...
87 2022/01 38483.125000 9.239790e+11 44519.703176 46247.967191 168.089996 211125100.0
88 2022/02 43193.234375 6.713360e+11 43856.468776 45375.186387 178.380005 254601300.0
89 2022/03 45538.675781 8.309438e+11 44697.572278 45421.897642 180.649994 377087100.0
90 2022/04 37714.875000 8.301159e+11 41206.223639 43219.891173 176.910004 195346400.0
91 2022/05 31792.310547 1.105689e+12 36499.267093 39954.868137 171.139999 179902200.0

92 rows × 7 columns

Now i'm going to generate the dependant/target variable that i'm going to try and predict

df['Next Month Close'] = df['Close'].shift(-1)
df
Date Close Volume ema3 ema6 Gold Close Gold Volume Next Month Close
0 2014/10 338.321014 9.029944e+08 338.321014 338.321014 112.660004 155183900.0 378.046997
1 2014/11 378.046997 6.597334e+08 358.184006 349.671295 112.110001 147594200.0 320.192993
2 2014/12 320.192993 5.531023e+08 339.188499 341.248923 113.580002 153722200.0 217.464005
3 2015/01 217.464005 1.098812e+09 278.326252 305.881804 123.449997 198034100.0 254.263000
4 2015/02 254.263000 7.115187e+08 266.294626 291.133574 116.160004 125686200.0 244.223999
... ... ... ... ... ... ... ... ...
87 2022/01 38483.125000 9.239790e+11 44519.703176 46247.967191 168.089996 211125100.0 43193.234375
88 2022/02 43193.234375 6.713360e+11 43856.468776 45375.186387 178.380005 254601300.0 45538.675781
89 2022/03 45538.675781 8.309438e+11 44697.572278 45421.897642 180.649994 377087100.0 37714.875000
90 2022/04 37714.875000 8.301159e+11 41206.223639 43219.891173 176.910004 195346400.0 31792.310547
91 2022/05 31792.310547 1.105689e+12 36499.267093 39954.868137 171.139999 179902200.0 NaN

92 rows × 8 columns

btc_close = df['Close'] #But i'll save the close price just incase i need it 

#Then i'll remove the previous month btc close price so that i'm left with only the relevant data that i need
df.drop(columns='Close', inplace=True)
Df = df.dropna()

#Now i should have a good clean dataframe ready for some EDA
Df
Date Volume ema3 ema6 Gold Close Gold Volume Next Month Close
0 2014/10 9.029944e+08 338.321014 338.321014 112.660004 155183900.0 378.046997
1 2014/11 6.597334e+08 358.184006 349.671295 112.110001 147594200.0 320.192993
2 2014/12 5.531023e+08 339.188499 341.248923 113.580002 153722200.0 217.464005
3 2015/01 1.098812e+09 278.326252 305.881804 123.449997 198034100.0 254.263000
4 2015/02 7.115187e+08 266.294626 291.133574 116.160004 125686200.0 244.223999
... ... ... ... ... ... ... ...
86 2021/12 9.570472e+11 50556.281353 49353.904068 170.960007 151214100.0 38483.125000
87 2022/01 9.239790e+11 44519.703176 46247.967191 168.089996 211125100.0 43193.234375
88 2022/02 6.713360e+11 43856.468776 45375.186387 178.380005 254601300.0 45538.675781
89 2022/03 8.309438e+11 44697.572278 45421.897642 180.649994 377087100.0 37714.875000
90 2022/04 8.301159e+11 41206.223639 43219.891173 176.910004 195346400.0 31792.310547

91 rows × 7 columns

Explanatory Data Analysis

Intuitley I know that traders like to use the ema lines and volume to predict BTC price. But as I mentioned before here I get the chance to explore whether gold price and its volume can help predict BTC price.

The Pearson correlation coefficient and p value

Pearson Correlation:

Correlation between sets of data is a measure of how well they are related. The most common measure of correlation in stats is the Pearson Correlation.The full name is the Pearson Product Moment Correlation (PPMC). It shows the linear relationship between two sets of data. In simple terms, it answers the question, Can I draw a line graph to represent the data?

It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables, where:

  • 1: Perfect positive linear correlation.
  • 0: No linear correlation, the two variables most likely do not affect each other.
  • -1: Perfect negative linear correlation.
  • P-Value:

    A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or lower is generally considered statistically significant which means that we are 95% confident that the correlation between the variables is significant.

    By convention, when the

    • p-value is < 0.001: we say there is strong evidence that the correlation is significant.
    • the p-value is < 0.05: there is moderate evidence that the correlation is significant.
    • the p-value is < 0.1: there is weak evidence that the correlation is significant.
    • the p-value is > 0.1: there is no evidence that the correlation is significant.

    Two things keeps to keep in mind when interprating the results:

    • The null hypothesis is that the two variables are uncorrelated .
    • The p value is in scientific notation, it's decimal form is e.g 4.2e-7 = 0.00000042.
    #I will start form the second last row to avoid errors bcz of nan value 
    pearson_coef, p_value = stats.pearsonr(df['Gold Close'][:90], df['Next Month Close'][:90])
    print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
    
    The Pearson Correlation Coefficient is 0.7706711768570538  with a P-value of P = 6.555477090301474e-19
    

    In this case,

    The p-value is < 0.001 hence, there is strong evidence that the correlation between gold price and BTC price is statistically significant, and the linear relationship is quite strong (0.77, close to 1)

    pearson_coef, p_value = stats.pearsonr(df['Gold Volume'][:90], df['Next Month Close'][:90])
    print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
    
    The Pearson Correlation Coefficient is 0.08476385450593574  with a P-value of P = 0.42700282567060693
    

    The p-value is < 0.001 hence, there is moderate evidence that the correlation between gold volume and BTC price is statistically significant, and there is no linear relationship (0.08, almost 0)

    Visually we can see that there is almost no linear relationship between gold volume and btc price

    fig = px.scatter(
        df, x='Gold Volume', y='Next Month Close', opacity=0.65,
        trendline='ols', trendline_color_override='firebrick'
    )
    fig.update_layout(template = "plotly_dark")
    fig.show()
    

    So we now know that we can use gold price but not its volume, it would have destroyed value and it wouldn't have added anything to the model if anything it would have probably ruined our prediction

    df.drop(columns='Gold Volume', inplace=True)
    df.head()
    
    Date Volume ema3 ema6 Gold Close Next Month Close
    0 2014/10 9.029944e+08 338.321014 338.321014 112.660004 378.046997
    1 2014/11 6.597334e+08 358.184006 349.671295 112.110001 320.192993
    2 2014/12 5.531023e+08 339.188499 341.248923 113.580002 217.464005
    3 2015/01 1.098812e+09 278.326252 305.881804 123.449997 254.263000
    4 2015/02 7.115187e+08 266.294626 291.133574 116.160004 244.223999

    What about the the other indipendant variables?

    print('EMA 3')
    pearson_coef, p_value = stats.pearsonr(df['ema3'][:90], df['Next Month Close'][:90])
    print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
    
    print('\nEMA 6')
    pearson_coef, p_value = stats.pearsonr(df['ema6'][:90], df['Next Month Close'][:90])
    print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
    
    print('\nVolume')
    pearson_coef, p_value = stats.pearsonr(df['Volume'][:90], df['Next Month Close'][:90])
    print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
    
    EMA 3
    The Pearson Correlation Coefficient is 0.9518278081533298  with a P-value of P = 5.913915911148407e-47
    
    EMA 6
    The Pearson Correlation Coefficient is 0.9357719658423489  with a P-value of P = 1.3124438502636034e-41
    
    Volume
    The Pearson Correlation Coefficient is 0.7995337431531997  with a P-value of P = 3.465784507142274e-21
    

    The other indipendant variables are all statistically significant, and their linear relationship are very strong with p-values of < 0.001

    NOTE:

    Causation is the relationship between cause and effect between two variables.

    It is important to know the difference between correlation and causation. Correlation does not imply causation. Determining correlation is much simpler than determining causation as causation may require independent experimentation.

    Model Development

    Before we continue let's clarify the objective again: I'm using the 3 and 6 month ema, BTC previous month volume, and Gold close price of the preceding month to predict what BTC close price of the impending month

    x = Df[['ema3','ema6','Volume','Gold Close']]
    
    # Define the dependent variable
    y = Df['Next Month Close']
    

    I am going to spilt the data, 80% of the data will be used to train the model and 20% will be used to test the prediction made from that 80%

    t = .8
    t = int(t*len(Df))
    
    # Train dataset
    x_train = x[:t]
    y_train = y[:t]
    
    # Test dataset
    x_test = x[t:]
    y_test = y[t:]
    
    print("number of test samples :", y_test.shape[0])
    print("number of training samples:",y_train.shape[0])
    
    number of test samples : 19
    number of training samples: 72
    
    reg = linear_model.LinearRegression()
    reg.fit(x_train,y_train)
    
    LinearRegression()

    The constant came back negative which is confusing but i'll get back to this later

    reg.intercept_ 
    
    -1262.8899087871669
    coeff_df = pd.DataFrame(reg.coef_.T, x.columns, columns=['Coefficient']) 
    coeff_df 
    
    Coefficient
    ema3 1.777304e+00
    ema6 -9.488261e-01
    Volume 9.555041e-10
    Gold Close 1.521999e+01

    The Multiple linear regression formula:

    $$ y = β_{0} \;+ \;β_{1} * \;X_{1} + \;β_{2} * \;X_{2} + \;β_{3} * \;X_{3} + \;β_{4} * \;X_{4}$$

    print("Linear Regression model")
    print("BTC Price (y) = %.2f (constant) + %.2f * EM3 (x1) + %.2f * EMA6 (x2) + %.4f * Volume (x3) + %.2f * Gold Close (x4)" % (reg.intercept_,reg.coef_[0], reg.coef_[1],reg.coef_[2],reg.coef_[3]))
    
    Linear Regression model
    BTC Price (y) = -1262.89 (constant) + 1.78 * EM3 (x1) + -0.95 * EMA6 (x2) + 0.0000 * Volume (x3) + 15.22 * Gold Close (x4)
    

    Model Evaluation

    In this step, I will evaluate the model's accuracy but before that happens I'm going to make the predictions first

    predicted_price = reg.predict(x_test)
    

    The R square is 0.05 which means the model’s predicitive power is poor in fact it is worse than what I expected it predicts little to nothing of the target variable

    test_r2_score = r2_score(y_test,predicted_price)
    print('The test R-square is: ', test_r2_score)
    
    The test R-square is:  0.053315838215583056
    

    I will also look at other evaluation methods

    test_r2_score = r2_score(y_test,predicted_price)
    print('The test R-square is: ', test_r2_score)
    
    test_mse = mean_squared_error(y_test, predicted_price)
    print('The test mean square error of target variable and predicted value is: ', test_mse)
    
    test_mae = mean_absolute_error(y_test, predicted_price)
    print('The test mean absolute error of target variable and predicted value is: ', test_mae)
    
    test_rmse=np.sqrt(test_mse)
    print('The test root mean square error of target variable and predicted value is: ', test_rmse)
    
    The test R-square is:  0.053315838215583056
    The test mean square error of target variable and predicted value is:  107191024.05803142
    The test mean absolute error of target variable and predicted value is:  8821.373009960726
    The test root mean square error of target variable and predicted value is:  10353.309811747711
    

    Mean Square Error (MSE) is the average difference of actual values and predicted values There is no correct value for MSE. Simply put, the lower the value the better, and 0 means the model is perfect.

    Mean Absolute Error (MAE) is the sum of the absolute difference between actual and predicted values in this case the average difference is $8821

    I'm going to evaluate further and try to see what other insights I can gather from the predicted price I'll start by creating a data frame and add the predicted price and actual price so that I can plot the prices side by side

    btc = pd.DataFrame()
    
    #btc['Close Previous Month'] = btc_close[t:]
    btc['Date'] = Df['Date'][t:]
    btc['Predicted Close'] = predicted_price
    btc['Actual Close'] = btc_close[t:].shift(-1)#btc['Close Previous Month'].shift(-1)
    btc
    
    Date Predicted Close Actual Close
    72 2020/10 13607.631908 19625.835938
    73 2020/11 17737.624104 29001.720703
    74 2020/12 25451.933856 33114.359375
    75 2021/01 31623.812022 45137.769531
    76 2021/02 40797.895059 58918.832031
    77 2021/03 52011.667637 57750.175781
    78 2021/04 55687.583652 37332.855469
    79 2021/05 43935.783612 35040.835938
    80 2021/06 35882.237824 41626.195312
    81 2021/07 36483.673993 47166.687500
    82 2021/08 40423.918840 43790.894531
    83 2021/09 39815.250546 61318.957031
    84 2021/10 50330.042450 57005.425781
    85 2021/11 51684.668062 46306.445312
    86 2021/12 45279.181353 38483.125000
    87 2022/01 37422.063673 43193.234375
    88 2022/02 36986.624235 45538.675781
    89 2022/03 38624.254997 37714.875000
    90 2022/04 34450.675253 31792.310547
    fig = px.line(btc, x="Date", y=btc.columns,              
                  title='Predicted Close Vs Actual Close')
    fig.update_xaxes(
        rangeslider_visible=True,
        rangeselector=dict(
            buttons=list([
                dict(count=1, label="1m", step="month", stepmode="backward"),
                dict(count=6, label="6m", step="month", stepmode="backward"),
                dict(count=1, label="YTD", step="year", stepmode="todate"),
                dict(count=1, label="1y", step="year", stepmode="backward"),
                dict(step="all")
            ])
        )
    )
    fig.update_layout(template = "plotly_dark")
    

    We have already seen the mean of these residuals (mean squared error), now I'll look at the residuals of each month in absolute and in % to see how far off are the predictions for each month (Remember, the residual is the difference between the observed value and the estimated value)

    btc['Residual'] = btc_close[t:] - btc['Actual Close'] #The difference in absolute $ terms
    btc['Residual in %'] = np.absolute(btc['Residual']/btc['Actual Close']*100) #The difference in % 
    btc
    
    Date Predicted Close Actual Close Residual Residual in %
    72 2020/10 13607.631908 19625.835938 -5844.840820 29.781360
    73 2020/11 17737.624104 29001.720703 -9375.884766 32.328719
    74 2020/12 25451.933856 33114.359375 -4112.638672 12.419502
    75 2021/01 31623.812022 45137.769531 -12023.410156 26.637138
    76 2021/02 40797.895059 58918.832031 -13781.062500 23.389911
    77 2021/03 52011.667637 57750.175781 1168.656250 2.023641
    78 2021/04 55687.583652 37332.855469 20417.320312 54.689951
    79 2021/05 43935.783612 35040.835938 2292.019531 6.540996
    80 2021/06 35882.237824 41626.195312 -6585.359375 15.820229
    81 2021/07 36483.673993 47166.687500 -5540.492188 11.746621
    82 2021/08 40423.918840 43790.894531 3375.792969 7.708892
    83 2021/09 39815.250546 61318.957031 -17528.062500 28.585063
    84 2021/10 50330.042450 57005.425781 4313.531250 7.566878
    85 2021/11 51684.668062 46306.445312 10698.980469 23.104733
    86 2021/12 45279.181353 38483.125000 7823.320312 20.329223
    87 2022/01 37422.063673 43193.234375 -4710.109375 10.904739
    88 2022/02 36986.624235 45538.675781 -2345.441406 5.150438
    89 2022/03 38624.254997 37714.875000 7823.800781 20.744602
    90 2022/04 34450.675253 31792.310547 5922.564453 18.628921

    As you can see from the residual, the difference is pretty large but this is due to BTC being very volatile hence anything between 7-10% difference (this is subjective based on my observations from BTC trading) could be good but to expect a residual of less than 5% consistently would be very unlikely from an asset class this volatile

    Conclusion

    If the model had been at least 50-60% accurate (have an R square of 0.50-0.60), I would have proceeded with backtesting and then take the model live by predicting the close price of this month (June 2022)

    The linear regression is not a good model to use when predicting BTC/USD prices, maybe it would've been more efficient in predicting the returns instead. There were many red flags and based on the R square alone I would never take this model live and risk real money on it

    The MSE was way too high and very far from 0 since MSE is a measure of how close a fitted line is to data points

    Another red flag was the constant being negative, This means when the independent variables are 0 the mean price of BTC will be -1262. A negative constant doesn't mean the regression made a mistake but rather it's the data being modeled, realistically any security price can go to 0 but no security price can fall below 0 and turn negative, The your position value of that asset can turn negative but not the actual asset price. which is also why I think predicting returns instead of price would have been more accurate and much more realistic

    This is also a good example to showcase how a machine learning model is only as useful as the features selected and in-order to select the right features depends on the knowledge one has of that data set!