Can We Greatly Predict the Price Movement of Bitcoin?

Bitcoin is a speculative instrument (I hesitate to call it an asset directly as the financial community has not settled on calling it an asset yet) that was created in 2009. It belongs to a family of speculative instruments called cryptocurrencies which contain many similar properties: trading occurs over a blockchain database, there is no physical counterpart of this instrument, etc.

In its early days, Bitcoin (hence forth we will refer to it as BTC) was a very obscure name and was rarely referred to in the mainstream. However, since 2015, BTC became far more popular as an investment vehicle and many became rich off of it. Since then, many have tried to predict BTC movements and have tried to become rich off of it. However, since the pre-pandemic peak on Dec 17, 2017, BTC became less favourable for the common investor as it smelled like a speculative trap. Since BTC did not produce anything in the actual economy, it is not backed by any economic phenomena and hence in the pre-pandemic era, it seemed as though it followed a similar trajectory as the Tulip Mania of the 1600s in the Netherlands.

However, since the pandemic began, we have seen massive BTC price inflation, and hence more people are once again interested in using BTC as an investment vehicle to become rich. In this notebook, we explore the properties of BTC (as in can we predict prices or their movement so as to generate trading profits) and eventually try to make a predictive network that can assist us in becoming rich as well.

This notebook also serves as fantastic introduction to the data science pipeline, and particularly the applications of this pipeline in a finance setting, as defined by:

  1. Data Curation
  2. Data Management and Representation
  3. Exploratory Data Analysis (and Hypothesis Testing)
  4. Machine Learning
  5. Generating Insights

Thereby, it is up to you, the reader, as to what to take away from this notebook. It serves a dual purpose: exploring the nature of BTC and introducing you to the overall data science pipeline.

Set Up

In this notebook, we will be using the following libraries. Because Tensorflow is typically not pre-downloaded in most machines, I have included code below that when executed will download Tensorflow to the current Jupyter Kernel.

Importing the Libraries

pandas Pandas is a data storage and manipulation library. Data is stored in 2-D dataframes which are similar to excel sheets. For us, we are using pandas dataframes to store the Date, Open Price, High Price, Low Price, Close Price, and Volume data.

numpy Numpy is a scientific computing library. Numpy stores data in n-dimensional arrays which are very similar to Matlab's matrices/vectors. Being a scientific computing library, Numpy optimizes computation speed. For us, using the scientific computing stack of numpy is ideal to train any ML model since a lot of computations are bound to occur.

matplotlib Matplotlib is a plotting library. For us, we use it to plot any data that we need to observe visually.

sklearn Scikit Learn is a popular Machine Learning library that contains, for us, a large number of pre-designed models that whose hyperparameters we can tune and deploy with ease.

tensorflow Tensorflow is a popular Deep learning library that contains various functions and modules we can use to design deep networks with relative ease.

datetime Datetime is library that allows us to convert string dates into date objects which has greater functionaliity (we can directly compare 2 datetime objects which will return the younger date etc.)

statsmodels Statsmodels is a popular statistical library whose OLS functionality we use to generate linear regressions.

Data Collection and Curation

Nature of Our Datasets

Bitcoin data is relatively sparse. Some datasets have data between 2014-2017, some with 2020-2021, but generally speaking there isn't 1 dataset that encompasses all of the price/volume data of Bitcoin we need. Hence, I used 2 datasets that overlap to get a more rich aggregate dataset such that we have a full dataset going back to 2013.

  1. Dataset 1: BTC-USD.csv Our first dataset (dataset1) stored in BTC-USD.csv contains the price and volume data of Bitcoin between April 10, 2020 to May 15, 2021.

BTC-USD.csv can be retrieved from https://finance.yahoo.com/quote/BTC-USD/history?period1=1410825600&period2=1621123200&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true

  1. Dataset 2: BTCUSD_day.csv Our second dataset (dataset2) stored in BTCUSD_day.csv contains price and volume data of Bitcoin between April 28, 2013 to April 10, 2020.

BTCUSD_day.csv can be retrived from https://www.kaggle.com/prasoonkottarathil/btcinusd?select=BTCUSD_day.csv

Loading the Data

Cleaning the Data

Although our data is relatively well curated with Open, High, Low, and Close prices along with volume data, there are still lingering issues with our datasets.

In this section, we clean up the data s.t. after we can explore the data with ease and eventually train a model to predict prices. This section is broken down into multiple components listed out here:

Issue 1: Volume Mismatch

As we can see above,

dataset1 consists of the following columns: Date, Open, High, Low, Close, Adjusted Close, and Volume

while,

dataset2 consists of the following columns: Date, Symbol, Open, High, Low, Close, Volume BTC, Volume USD

There are 2 volumes present in dataset2, while 1 volume present in dataset1. This is naturally a problem as any predictive model we build will need a singluar definition of volume. Hence, we need to "standardize" which volume is the correct volume.

If we observe closer, we can see that Volume as defined in dataset1 corresponds to Volume USD as defined in dataset2. Hence, we can clean up this data by removing the BTC volume in dataset2

Issue 2: Extraneous Columns and Reverse Order

There are 2 primary issues with datset2:

The only issue with dataset1 is that we have an extraneous column of Adj. Close which we need to remove.

As we can now see, the extraneous column of Symbol has been removed and we have reset the order to oldest-latest for dataset2

Below, we now drop the Adj Close column from dataset1

Issue 3: Last Row Dataset 2

As we can see below, we have 2 copies of the data from April 10th, 2020. Because dataset2's April 10th, 2020 row has volume = 0, we will remove that row as that is illogical.

NOTE: Volume = 0 is possible for any trading instrument, however it is highly unlikely given that at the time people were regularly trading BTC hence, it is most likely corrupted data we see with Volume = 0 in dataset2. Hence we can go forward and only take dataset1's April 10th, 2020 row.

As you can see above, we have dropped the 1646th index which corresponded to the April 10th, 2020 row for dataset2.

Issue 4: Merging the Two Datasets

As we want to eventually train a full fledged model, we need to merge our datasets s.t. we only have 1 dataset which we can pass into a model.

Having only 1 dataset is also useful for initial observations of the data's distribution and will help us greatly in our Exploratory Data Analysis.

Issue 5: Ensuring Type Compatibility

We want to ensure that our Open, High, Low, Close prices and Volume are all numerical values. As shown below, we can see that they are all of numpy.float64 types.

A key fact above is that currently, the Date column is filled with some object type of data, most likely string. We confirm this via:

By this fact, we do want to convert these string values into datetime objects. Datetime objects allow us to access year, month, and day as individual fields, which will save us a lot of headache later as in current form we will have to parse the string to be able access those fields. Effectively, instead of having to parse the string every time to figure out the month, if we preprocess right now and convert all of these strings to datetime objects, it will be easier for us later down the line when we want to know the day, month, or year. Hence, we will convert these string-based dates to datetime objects.

Issue 6: Missing Data

Arguably the most important out of all the issues

We know that there is bound to be data missing in our data set. Rarely in data science do we perfect well curated data, as shown in the last 5 issues that we corrected. Hence we need to explore the missing data and how it is represented as well as ask critically why the data is missing.

The key questions we need to ask regarding the missing data are

Querying the Missing Data

Lets answer the first 3 questions:

To answer any of these questions, we need to query this missing data. To be able to query the missing data, we need to know what form this missing data takes. A hint we got in earlier sections is that our missing data might be have taken 2 forms:

Given these properties, we can check where the data is NaN and where the data is 0

We can see above that the data is NaN for 5 dates across all columns, and the data is 0 for volume for 3 dates. We also see that at any point only volume data is missing with 0 as replacement.

Hence we can answer all of our pertinent questions for this section:

Hypothesizing Why the Data is Missing

Lets answer our last question now: Are there any underlying correllations in the missing data? (Classifying our missing data as MAR, MCAR, or MNAR)

The NaN Missing Data

When speculating why this trading data is missing, the immediate reason we can suspect is that the missing data is linked to specific holidays.

We can observe this via the string of missing trading data that was replaced by NaN between October 9th, 2020 to October 13th, 2020. October 12th, 2020, another missing entry in our data, was Columbus day, and in a lot of markets, they had considered the entire weekend all the way to that Friday to be Columbus day weekend. Hence, there seems to be a strong correllation, and in fact a plausible causation, that the missing data in that interval is due to Columbus Day weekend leading to most traders taking a vacation hence the lack of trading data.

This reasoning is not uncommon in trading data, and often not uncommon in general Time Series Data. An unrelated example would be that the New York Stock Exchange has operating hours of 9:30 - 4:00. Hence if we were analysing price data of a specific stock every minute during a day, we would get quite a few rows with missing data as the stock market is not actively trading at, for example, 3:00 am. This missingness is unrelated with the actual observed variables, but are actually affected by unobserved variables.

Going off this hypothesis, it seems as though the missing data that is NaN is uncorrellated with observed trading data, as price and volume are not causing the missingness. Hence, we would classify this missing data as Missing Completely at Random (MCAR) as the probability that data is missing is uncorrellated with our values in the data set. Hence we cannot model and can mostly ignore this missing data.

The 0 Missing Data

It is unclear exactly why this data is missing. Through cross-referencing other data sources, there is trading volume present on the 3 missing days: 08/26/2017, 08/12/2018, and 10/16/2018.

We can hypothesize that this might just be corrupt data, as there are only 3 occurrences of volume = 0. Although this is a simplestic hypothesis as to why this data is missing, it might be impossible to determine if there is a non-random distribution of why this data is missing as there are very few data points with missing volume values.

Hence, as we can classify this volume data to be Missing Completely at Random (MCAR) as well.

Dealing with the Missing Data

Now that we know we can classify all the missing data as MCAR, we can deterimine what to do with this data. There are 3 typical approaches to deal with missing data:

  1. Remove the missing data
  2. Impute the missing data (replace the missing data)
  3. Encode the missing data (tell our model to ignore certain components of the missing data)

Dealing with the NaN Data

Because the NaN data has all columns with missing data, we can directly remove the missing data as it is likely impossible to encode or impute since we don't have any direct variable value present in the missing data by which we could impute the data (direct being defined as we don't have variable in the missing data that has a value since all of it is missing).

Dealing with the 0 Data

We can get the missing volume data via online data bases. As such, we will use yahoo and manually impute the data from a manual online query over trying to download another file to fill in the data.

Full Definition of the Data

Since we have cleaned up the data, lets now give an aggregate definition of the data which we can refer back to moving forward.

  1. Date: Represents the date on which this data is recorded
  2. Open: Represents the starting/opening price when trading began on a specific day
  3. High: Represents the highest price recorded during trading on a specific trading day
  4. Low: Represents the lowest price recorded during trading on a specific trading day
  5. Close: Represents the ending/closing price when trading began on a specific day
  6. Volume: Represents the aggregate amount of BTC (measured in USD) that was traded on a specific day

Exploratory Data Analysis

Lets now try to observe the underlying data distribution. This will provide us insights as to what type of ML system to use, the potential biases, and how to approach overall prediction.

Visualizing Evolution of BTC Prices

Lets first look at the general progression of BTC Prices over time. Since there are 4 measurements of prices: Open, Low, High, Close, we will plot all 4 as individual lines.

A key fact see here, which might seem obvious for the average trader, is that these prices are heavily correllated with each other. This makes sense as the Open Price is built on the prior day's Close Price, and any High and Low Prices that might be achieved on that are dependent upon the Open Price as we are not going to see massive stochasisity directly from the price data.

Although this may seem like an obvious fact, it is always wise to ensure that any expected correllations you might see in your data actually manifest in EDA.

Exploring Both Volume and Price

As we can see above, all 4 price metrics roughly match each other. Hence we can choose a singular representative price metric for all 4. We will be choosing Close price to be our representative price metric, although you may choose any.

Hence, lets now look at a graph with both volume and price plotted to get a fuller view of the underlying trends.

The key thing to note here is that although there was a price spike in 2018, a lot of the upward price movement in BTC is very much correllated with increased trading activity seen in BTC in the last year (since early 2020). As we are aware, part of this is likely due to the COVID-19 Pandemic as the start of this Volume trend began around March-April-May of 2020 as per the graph.

Lets explore this relationship between Volume and Price by splitting the data set into 2: one before March 11th, 2020 (the day the WHO declared COVID to be a pandemic) and one after. We do this because it is clear that this trading activity picked up during COVID-quarantine, which is an exogenous variable within our data. Thereby, we should analyze the underlying distribution of Volume and Price data via splitting this data set s.t. we do not observe the effect of the COVID-19 Exogenous Variable in our statistical data exploration.

Observing Volume vs. Close Price BEFORE WHO Declaration of COVID-19 to be a Pandemic

The easiest way to observe a relationship between variables is to estimate a linear regression between the 2 variables. Hence, we will try that.

NOTE: The linear regression shown above is clearly deviating from a best fit line due to a few outliers.

The common conviction among many statisticians seeing this would be to remove the outliers and re-estimate the linear regression. However, we will not do this. That is because these outliers are important to the price and volume movement of BTC and warrant further study. We know by looking at the volume data, BTC was relatively illiquid (meaning that it was not easily bought and sold) up until the COVID-19 Pandemic, and hence these volume spikes have certain casual factors we need to study.

Lets instead try to plot this in a time series graph.

NOTE: the volume spike between 2018-07 and 2019-01

It seems as though that volume spike is correllated with a drop off in BTC price. Here we can see that there seems to be somewhat of an underlying correllation between Volume and Price.

Generating the Volume vs. Close Price Regression AFTER WHO Declaration of COVID-19 to be a Pandemic

As earlier, lets look at the relationship between Volume and Price through the lens of a regression first.

NOTE: This regression above is clearly deviating from a best fit line due to a few outliers.

As earlier, we will not try to re-estimate the linear regression by removing the outliers since these deviations require study due to their magnitude. As before, we will instead plot a time series graph of volume and price.

NOTE: The large volume spike near 2021-03 timestamp

This graph seems to suggest that the correllation between volume and price pre-pandemic continues into the pandemic world. The large spike in trading activity was immediately followed by a rally in BTC price. This further signifies that the correllation between volume and price is temporal-invariant with respect to its existence. This means that throughout all time periods, there does exist a correllation between volume and price.

Hypothesis Testing of Price and Volume Relationship

Lets conduct Hypothesis testing to cement the relationship between Price and Volume and to measure its statistical significance.

Our null hypothesis will be that volume has no effect on price, and hence our alternative hypothesis will be that volume may have an effect on price.

As shown above, since the p-value is 0 for the Volume parameter, it suggests that it is nigh impossible for this sample of Volume-Price pairs to exist given that Volume has no effect on Price. Hence, we can conclude that in the pre-pandemic phase, volume did have an apparent effect on Price and hence it is statistically significant thereby we reject the null hypothesis in favour of the alternative hypothesis.

Similarly, as shown above, since the p-value is 0 for the volume parameter, it suggests that it is nigh impossible for this sample of Volume-Price pairs to exist given that Volume has no effect on Price. Hence, we can conclude that in the pandemic phase, volume did have an apparent effect on Price and hence it is statistically significant thereby we reject the null hypothesis in favour of the alternative hypothesis.

Observing the Relationship Between Price Change and Volume

Another key variable we want to observe is the daily percentage change in price from open to close. This variale has an important statistical implication: if we observe this variable to be stationary about a 0 mean, then we know that BTC evolves in a cyclic fashion, and in contrast if we observe this variable to be non-stationary, then we know that BTC is a stochastic variable that cannot be predicted via determining where we are in the cycle.

Hence, lets visualize the %Δ in price daily data.

As we can see above, there does seem to be a degree of stationarity present in the percentage change data up until the pandemic, after which there seems to be a large degree of non-stationarity present.

Lets now try to measure the relationship between volume and price change. We will use volume as our explanatory variable for the sake of consistency, however you can use either or. The important facet of this statistical analysis is to observe whether a relationship does in fact exist between the 2 variables, not so much the predictive power of the regression or its exact accuracy.

NOTE: Interestingly, there seems to be a lack of a relationship between daily volume and price change per the naked eye. The line is near horizontal and it seems as though the OLS regression almost defaulted to a mean line as the derived OLS Regression might have been extremely poor at explaining the variation (R-squared).

Hypothesis Testing the Relationship

Noting from earlier, the graph does seem to suggest that PriceChange and Volume are not correllated. Lets further confirm this hypothesis via hypothesis testing.

We will classify our null hypothesis as there exists no correllation between PriceChange and Volume and our alternative hypothesis as there exists a correllation between PriceChange and Volume.

As we can see above, the p-value for the Volume parameter is 0.491 implying that there is almost a 50% chance that we could acquire a sample such as this and get a Volume estimator to be 3.047e-12 while the true Volume parameter is in fact 0, meaning that there is no correllation between Volume and PriceChange. This p-value is greater than alpha = 10%, meaning that we fail to reject the null, meaning that there is a very high likelihood that Volume and DailyChange are in fact not correllated.

Machine Learning

Moving forward, we can now conduct a degree of Machine Learning on the underlying data distribution.

Describing the Problem

Using 0 time lags, we saw that there was a correllation between Closing Price and Volume, and there seemed to lack a correllation between Daily Price Change and Volume. However, as this is time series data, there might exist underlying time lagged correllations between variables. In our EDA, we did not explore that as we are unsure about which lags are actually correllated, and using AIC/BIC is outside the scope of the analysis typically performed in an EDA section.

Typically when building an ML model, we need to figure out what we are trying to optimize. Since the purpose of this notebook is to make a model that can generate profits, we can split this problem via a greedy algorthimic paradigm: we will use a greedy rule saying that if we optimize daily trading profits, we will optimize profits in the long run. That is to say, we will build an ML model that predicts whether BTC prices are going to close lower than open or close higher than open.

When we close lower than open, a key trading move we would make is to short BTC, that is to bet against the instrument via borrowing BTC from a broker, selling them, and then buying them back at a future date at hopefully a lower price.

When we close higher than open, a key trading move we would make is to long BTC, that iis to bet for the instrument via buying BTC from a broker, holding it, and then selling them for a profit.

Thereby, if we can predict whether the daily shift in prices is going lower or higher (a binary classification problem) measured by a positive or negative difference between the opening and closing price, then we can use this model to make trading decisions as to whether to buy or short every day at the opening price.

Pursing a Full ML Model

Due to time constraints as I did this entire project myself, I was unable to construct a full Neural Network to execute this binary classification. Although I dabbled in other ML models, I realised that ideally, we would use Neural Networks for probability precision. I have included my game plan to actually build out a full network.

  1. If I had a few more days to work on this, I would build out a Fully Connected Neural Network (ie no convolution filters etc.) that would have intermediate Leaky ReLU activations, a final output activation of sigmoid, and use a Binary Cross Entropy Loss function to train my network. This works out very well since ideally we are trying to predict whether price movement will be positive or negative on a given day, hence, using sigmoid as our last activation would produce probabilities whether a given day belongs to a positive class (with positive price movement) or a negative class (with negative price movement). Interemediate Leaky ReLU activations are there to prevent neuron saturation in our intermediate Fully Connected layers (as ReLU can kill neurons through saturation). Binary Cross Entropy Loss makes sense as we are determining classification between 2 classes and this loss function is standard to use in this type of binary classification problem.
  1. If I had a month more to work on this, I would shift to a different type of Neural Network which might actually function better after building out the earlier design. I would enhance my features into 2-D spectogram images of price data and volume, and I would feed those values into a Convolutional Neural Network using the standard paradigms of starting with small and fewer filters, and then moving to large and a greater number of filters. I would naturally use dropout and max pooling to assist with dimesnionality reduction, spatial invariance (in our application here can be referred to as temporal invariance), and generalization. I would also use skip-connections, like how ResNet did it, to ensure no gradient decay occurs, and after training I would test this network's performance as compared to the earlier built network.

For either time frame, the main metric by which I would measure success would be validation accuracy in the binary classification, and then deploying this into the "wild" for a moonth to see how it functions against incoming data (which we would similarly classify this as a test set). If accuracy in price movement predictions are above 70%, I would consider either network to have a pretty good performance.

Net Insights

BTC aggregate Price movements do seem to have a large degree of correllation with Volume. This is apparent between our candidate price of Close and Volume when we measured their correllation and conducted hypothesis testing on these variables.

In measuring whether daily price movement is related to volume, we came across the fact that daily price changes are in fact not that correllated with volume. It also seems as though price changes were some what stationary prior to the pandemic, but ever since the pandemic started, daily price changes have entered into a non-stationary epoch.

NOTE: However, both of these analyses should have an astrix next to them as we did not measure lagged correllation. There is a real possibility that lagged variables in our data set are actually correllated which can speak greater to our distribution. In building an actual ML model, using lagged variables would be essential as we are in many aspects trying to predict future price movements based on past price movements.

I hope by reading this notebook you have learned a greater bit of detail about the underlying statistical distribution of BTC, and it helps you in your own projects related to data science applications in finance. I also hope that seeing a full data science pipeline consisting of

  1. Data Curation
  2. Data Management and Representation
  3. Exploratory Data Analysis (and Hypothesis Testing)
  4. Machine Learning
  5. Generating Insights allows you greater insight as to what a typical data science project actually looks like. I would highly recommend to you, the reader, that you re-read various portions of this notebook as I understand it is a lot of code and text. A key insight to understand about data science is that in many ways it is just studying statistics. We are always trying to observe the underlying data distribution of our sample, and using that to gain insights about what the population distribution might look like. However, unlike studying statistics, our primary goal is almost always to find some way to gain value out of statistical insights, there is always a grander purpose.

As always, thank you so much for reading this!

The used data sets can be found here:

BTC-USD.csv can be retrieved from https://finance.yahoo.com/quote/BTC-USD/history?period1=1410825600&period2=1621123200&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true

BTCUSD_day.csv can be retrived from https://www.kaggle.com/prasoonkottarathil/btcinusd?select=BTCUSD_day.csv