Stop using Time Bars for Financial Machine Learning
How and why to create Dollar Bars from Time Bars for a predictive model
This is an ongoing beginner’s series about the ideas and practical applications outlined in the book Advances in Financial Machine Learning by Marcos Lopez de Prado. For a lengthier introduction, please click here for part 0. For the next topic on labelling, click here for part 2.
Chapter Goal
To identify and obtain the best form of data input for our financial machine learning model.
By the end of this chapter, you should know:
- what financial bars are
- the difference between time bars, tick bars, volume bars, and dollar bars
- how to construct dollar bars from time bars
- the advantages of using dollar bars for our model
Understanding Bars
Most people are familiar with the traditional stock / price / line chart that appears when you google a stock or see analysts discussing a company on TV. These charts are very useful for visualizing a clear trend, whether a stock is going up, down, or sideways. The data points usually consist of a stock’s closing price, sampled by a constant time frame — whether that be by the minute, day, or year.
Apart from the standard stock chart, bars are the next most common visual representation for price changes of any traded financial asset, such as a stock, a commodity, or a currency.
In addition to the closing price information (represented by the right tick on each bar), we can also see where the stock opened at, along with the highest and lowest price achieved during the day.
Candlesticks are effectively OHLC bars with the added benefit of colour-coding. From here on out, we will refer to candles and bars interchangeably.
As bars are useful for both human and machine learning, it is the primary input we will be using to analyze prices.
Getting our Data
Please feel free to follow along with this Google Colaboratory notebook. To run the code yourself and make changes, make sure to select File -> Save a copy in Drive.
The book itself and most other articles on this topics assumes you can access raw trade data from which to compile bars yourself. This is often too difficult and expensive an endeavour for most retail traders to do themselves. For our project, we’ll be looking to replicate the process with the most granular, freely available data we can find — 1-minute Bitcoin (or any cryptocurrency) bars. Shrimpy offers convenient data access without requiring an account or API keys, so we’ll be using them to get our data. To install Shrimpy, we run the following command:
pip install shrimpy-python
We then import our requisite libraries:
To import and plot a day’s worth of 1-minute candles, we can run the following:
The resulting plot should look something like this:
The Problem with Time Bars
While systematic sampling using regular time intervals is the standard for most traders and academics, it is not the best form of input for machine learning models.
Time bars exhibit two primary drawbacks. First, they disguise the rate of actual activity happening in the market at any given time. The same number of bars are produced during low-activity periods (e.g. noon) as high-activity periods (e.g. on market open). This leads to a poor capture of the actual information that exists in the market and introduces a lot of extra noise for machine models to parse through. Second, time bars have very poor statistical properties, such as non-normality. This presents an issue when attempting to fit conventional machine learning algorithms that were built upon a presupposition of various statistical patterns. In other words, time bars don’t do the best job in representing all the information and they don’t play well with machine learning models.
There’s a third drawback that de Prado doesn’t explicitly dive into in his book. Everyone uses time bars. Unless you’re an established player that’s paying thousands of dollars for real-time or historical tick data, you’re using time bars. The big firms know this too. They can deliberately create fake volume or manipulate order books in a way that will trigger some signal like RSI or Bollinger bands because they know exactly what indicators the retail traders are looking for. When you rely on historical price data with some technical indicators thrown in to train a machine learning model, you’re not going to make money. At the end of the day, it’s garbage in, garbage out. If you’re using the same garbage as everyone else, don’t be surprised with your results.
Alternatives to Time Bars
Rather than plotting a bar every time a minute passes, our data can become much more informative and fundamentally robust if we sample by some form of volume. Three potential candidates are tick bars, volume bars, and dollar bars.
Tick Bars
Every trade that happens in the market, whether it be a 10 share purchase or 1 share sell, can be boiled down to one event, one transaction, or one “tick”. If we compile enough ticks, we can sample our bars based on how many transactions have taken place rather than how much time has passed. They offer a much better way for us to track the actual activity and volatility happening in the market.
Volume Bars
One issue that can arise with tick bars is that each tick is still an arbitrary amount of activity. One tick can contain a buy for 1 share, 10 shares, or 10,000 shares. Multiple trades are sometimes combined on the exchange as a single tick based on timing or any number of other factors.
Volume bars sample every time a predefined amount of the security’s unit (e.g. shares, coins, contracts) are traded.
Dollar Bars
Finally, we get to perhaps the most robust yet least often discussed type of bar, the dollar bar.
Dollar bars can be thought of as very similar to the volume bar except sampled for every predefined amount of dollars (e.g. $1,000,000) traded. Dollar bars are superior to volume bars for two main reasons.
First of all, dollar bars help bridge the gap with price volatility. Especially for markets like cryptocurrencies, prices can go up or down by several hundred percent in a matter of days or weeks. Bitcoin (BTC) for example, has appreciated nearly 400% YTD as of the time of writing. A person who purchased one BTC for roughly $4000 at the start of 2020 can now sell just a quarter of a Bitcoin to reclaim their $4000. Sampling by dollars helps preserve the consistency and integrity of the information even more so than volume, especially for a security that’s highly volatile.
Secondly, dollar bars are resistant to the outstanding amount of the security. This is useful when adjusting to events that will affect the total volume of the security, such as corporate buybacks or new shares issuances.
To dive into the statistical properties of each of these alternative bars to gather more empirical evidence of their usefulness, I highly recommend the following articles by Gerard Martinez, who does a fantastic job diving deeper into these topics, as well as a type of bar we haven’t looked at — imbalance bars.
- Advanced candlesticks for machine learning (i): tick bars
- Advanced candlesticks for machine learning (ii): volume and dollar bars
- Information-driven bars for financial machine learning: imbalance bars
For our project and my personal preference, we will be using dollar bars exclusively from this point onward. However, I encourage you if you are interested to experiment with all the different types of bars mentioned above.
One type of bar is not necessarily better than another. The big idea here is that the type of input for machine learning is absolutely critical, and a lot of time and care should be spent here to ensure that the best type of data is utilized going forward.
Creating Dollar Bars
To get our time bars ready for processing, we need to apply a little cleaning and formatting to our initial dataset. Please note that this step will be different based on what type of data you start with. The end goal is to get our data into a list of dictionaries with each dictionary corresponding to one candle so that we can use it for our dollar bar creation function.
We’ll then define our dollar bars function and use it to create our bars. In essence, we’re calculating a rough dollar volume of every minute candle (average price of the candle x volume traded) and keeping a running tally of how many dollars have changed hands. Once that number exceeds the threshold we specified, we create a dollar bar, reset our running total, and continue.
Our output should look something like this:
A few notes about making dollar bars this way:
- The timing of each dollar bar can only be approximated to the nearest minute. In event of flash crashes or rallies (large price movements happening within a minute), we would be too late.
- The amount of dollar volume traded is an estimate based on the midpoint price of the candle. Based on the actual orders, this number could be very close or very off.
- We no longer need our volume column — the volume information is now built into the bar itself.
Again, creating dollar bars this way is not ideal. In Advances in Financial Machine Learning, Marcos Lopez de Prado assumes you have access to raw trade data to build your own, accurate dollar bars. For most people starting out, accessing that raw trade data is not feasible. For now, this aggregate estimation of dollar bars will do.
Comparing Dollar Bars to Time Bars
To create a plot of our dollar bars, we can run the following code:
You should get something that looks like this:
Please note that this data is a snapshot from Dec 17, 2020 — a day during which Bitcoin set several new all time highs with extremely high volumes traded. On a regular day, the dollar candles at this threshold will likely be more sparse and farther apart.
Right away we can see the differences between the dollar bars and the time bars we started with. Between the hours of 5 to 9 am, we sampled only a few bars per hour. Compare that with the flurry of activity between 9 and 10 am.
Upon first glance, creating bars this way may seem trivial or even detrimental. After all, aren’t we actually losing information by artificially removing data? For our machine learning model, can’t we introduce volume or dollar volume as a feature to capture the essence of what we’ve produced here?
In short, the answer is no. Modern machine learning models are built upon many classical statistical methods that rely on the assumption that observations are IID (or independent and identically distributed random variables).
To understand IID, imagine an experiment that samples coin tosses. There are two independent outcomes (heads or tails) that have a fair and identically distributed probability of occurrence (50%). Financial time series inherently break that IID assumption based on the inherent existence of memory, the price of the next bar is largely dependent on the price of the preceding bar.
While dollar bars and other volume alternatives cannot entirely get around this IID issue (we will discuss other methods in the future), they perform much better than the arbitrary time markers we started with. Again, I would encourage you if interested to take a look at Gerard Martinez’s excellent breakdown of the statistical differences between the different types of bars.
Another advantage of dollar bars is that they offer a better starting point for any other insight we want to generate. We’ll find that measuring trends, moving averages, and other indicators from dollar bars tend to produce much better results than using time bars.
Next Steps
A good question for you to think about next is where we should be setting the dollar volume threshold. In the example above, I picked an arbitrary number of $5,000,000. If you pick a smaller number, you’ll get more frequent candles (at the cost of accuracy, as our candle generator can only produce estimates based on time bars). If you pick an even larger number, you’ll get more accurate candles but you’ll be trading longer time intervals. You could determine a threshold by daily volatility or moving averages or something else entirely. There are many potential directions to go here. In the book, de Prado also introduces the concept of imbalance bars, which can be another useful tool to look at.
Hopefully, this article got you thinking about the variety of volume-based data inputs we can use, and how that might be useful for the model we’ll be creating soon.
Summary
In today’s article, we looked at the various types of bars or price inputs we can use for our machine learning models. We learned about one method of creating dollar bars from time bars and a few reasons why we would choose one over the other. Next time, we’ll start working with the data and implementing a few of Marcos Lopez de Prado’s signature techniques — such as meta labelling and the triple barrier method. You can click here for part 2.
Disclaimer: The views expressed consist of my own interpretations and opinions. It is not guaranteed to be a faithful representation of the ideas or work of anyone else. Reader discretion is advised.
References and Further Readings
- Advances in Financial Machine Learning by Marcos Lopez de Prado
- Financial Machine Learning Part 0: Bars by Maks Ivanov
- Does Meta Labeling Add to Signal Efficacy? by Ashutosh Singh and Jacques Joubert
- Building a Financial Machine Learning Pipeline with Alpaca (Part 1) by Max Bodoia
- Lessons learned building an ML trading system that turned $5k into $200k by Tradient