Limit Order Books and High-Frequency Trading

Research

Aug 12

Written by Ming Him (Ken) Lau and Matthew Chang, Quantitative Trading Interns

Foundations for data-driven risk management in digital assets.

Aug 12, 2025

In the rapidly evolving landscape of cryptocurrency trading, risk practitioners must prioritize advanced order and execution management systems (OEMS) to mitigate vulnerabilities. A robust OEMS—paired with a working knowledge of Limit Order Books (LOBs) and High-Frequency Trading (HFT)— is essential for navigating these complex yet high potential markets.

This article delves into the intricacies of LOBs and the significance of data integrity in trading strategies. By examining the intersection of tech and finance, we aim to equip traders and risk practitioners with insights that support sound decision-making in an era where speed and accuracy are vital. Through an exploration of data cleaning methodologies and a case study, we focus on the foundational elements of sustainable trading practices and offer a glimpse into the future of algorithmic trading in crypto.

Understanding the Limit Order Book (LOB)

The Limit Order Book (LOB) is a core infrastructure element in modern financial markets, including cryptocurrency exchanges. It serves as a continuously updated ledger that records all outstanding limit orders—buy and sell instructions placed by traders at specified prices—for a particular asset.

LOBs are typically updated through three primary operations: order creation, when a new order is placed into the book; order modification, when an existing order is amended (such as changing the price or quantity); and order cancellation, when a previously placed order is removed from the book. These updates are critical for maintaining an accurate view of supply and demand at various price levels.

LOB data is generally categorized by depth or detail level, commonly divided into three tiers:

Level 1 (L1) data provides only the best bid and ask prices along with their associated quantities. This is the most basic form of market depth and is useful for simple price discovery and execution logic.
Level 2 (L2) data includes multiple price levels (often the top 10 on each side of the book) and shows a more comprehensive view of market depth. It helps traders understand short-term supply and demand dynamics beyond just the best available prices.
Level 3 (L3) data reveals the entire order book, including individual order placements with unique identifiers. This granular view allows for detailed analysis of trader behavior, order flow reconstruction, and market microstructure research.

Understanding the structure and updating mechanics of the LOB is essential for developing effective trading strategies and risk models, particularly in high-frequency and algorithmic trading environments where market conditions change rapidly.

The trade data, which details the last trade that happened, can also add granularity to the LOB.

In most cases, traders can fetch the data from the WebSocket API provided by each exchange. For example, in the Coinbase Advanced Trade WebSocket API, level 2 order book data can be retrieved from the “level2” channel and the last trade data from the “market trades” channel.

The Important Role of Data Cleaning

Due to the high-frequency nature and noise of LOB data, raw market feeds must undergo cleaning before they are suitable for modeling or analysis. LOB data is high frequency, with updates occurring hundreds of times per minute for liquid securities. The irregular nature of LOB often requires transformation into feature vectors for deep learning, emphasizing the need for cleaning to handle dynamic changes in order volumes. This cleaning process involves preparing raw data for analysis by addressing errors and inconsistencies. Common techniques include:

Remove duplicates: Avoid double-counting of orders.
Handle missing values: Fill or drop incomplete records to ensure consistency.
Enforce chronological order: Ensure event time sequencing for time-series accuracy.
Normalize formats: Standardize units and formatting across datasets.

Case Study: Cleaning Binance’s Limit Order Book

To demonstrate how to manage real-world LOB data, we use Binance, one of the largest CEX in the cryptocurrency market, as an example. This case study focuses on processing, cleaning, and analyzing high-frequency quotes and trades data from Binance’s Level 2 WebSocket streams for BTC trading pairs. The objective is to ensure data accuracy and reliability for statistical analysis and trading model development, particularly in algorithmic and high-frequency trading (HFT). The process involves collecting raw data, reformatting it into .CSV files, cleaning it to address inconsistencies, and producing deliverables including a cleaned dataset, descriptive statistics, performance logs, and cleaning scripts.

Purpose and Importance

The case study emphasizes the critical role of data integrity in backtesting trading strategies, real-time decision-making, and market microstructure analysis. Raw market data often contains errors like outliers, misaligned timestamps, and erroneous entries, which can skew results if not corrected. By employing advanced cleaning techniques, we can enhance data quality for more accurate statistical insights and a better grasp of market behavior.

Methodology

Data Extraction

Data is downloaded from Binance’s Level 2 WebSocket streams in twenty .GZ files, each line of the .GZ file represents a JSON document. Three streams are utilized: “@depth10” (10 levels of market depth), “@depth” (general bid/ask updates), and “@trade” (individual trade details). The JSON data is parsed and consolidated into three .CSV files: “depth_list.csv”, “depth10_list.csv”, and “trades_list.csv”.

Data Cleaning

Cleaning follows methods like the procedures outlined in the R’s high frequency library. Trades with zero price and quantity are removed, and volume-weighted average prices are calculated for same-timestamp trades. As part of the data cleaning process, outliers were identified and removed to enhance data quality and ensure robustness in subsequent analysis. Specifically, Quotes with zero values, negative bid-ask spreads, or significant outliers are excluded. Outliers were defined as observations with values exceeding ten times the rolling Median Absolute Deviation (MAD). This threshold provides a robust, non-parametric method for detecting extreme deviations while minimizing sensitivity to skewed distributions.

Additionally, it is important to note that quote updates occur at a lower frequency than trade events. To avoid introducing artificial or spurious relationships, each trade was not forcibly aligned with a corresponding bid-ask spread. This decision preserves the integrity of the temporal dynamics between trades and quotes and prevents misleading associations that could distort modeling or interpretation.

Due to fewer quotes than trades, aligning trades with bid-ask spreads was skipped to avoid over-correction.

Efficiency

To manage large datasets effectively, this project employs parallel processing and vectorized operations during data cleaning, optimizing both speed and resource use. The mclapply function in R enables simultaneous processing across multiple cores, while the Polars library in Python leverages high-performance vectorized computations. Together, these tools significantly streamline data preprocessing, ensuring rapid and efficient handling of high-frequency financial data.

Key Statistics and Insights

The integrity of the data is preserved even when over half of the observations are “cleaned,” as most descriptive statistics—including critical and representative data points—remain the same. Trades removed for having zero price or being outliers will improve accuracy.

Descriptive Statistics for BTC/FDUSD Trades

*Source: Sylvanus Technologies. Above analysis is based on data from 2024/4/30 16:00 to 16:02.*

After cleaning the data, there are some noteworthy observations:

1. Price and volatility jumps: Thinner books in pairs like AVAX/BTC and OP/BTC show extreme sensitivity to trade clustering, challenging liquidity provision strategies. In the graph below, we can see price jumps are evident in less liquid markets like AVAX/BTC and OP/BTC. These markets, with lower volumes and thinner order books, are prone to significant swings from single or clustered trades and can be extremely challenging to manage in a market making strategy.

*Source: Sylvanus Technologies. Analysis is based on data from 2024/4/30 16:00 to 16:02.*

2. Significant deviation of trade and quote counts: A significant gap between trades and quotes is observed. For example, there are 6,000 trades versus 300 quotes for BTCUSDT in the given timeframe. The gap may stem from Bitcoin’s market microstructure—where trades outnumber quotes in liquid, volatile conditions as rapid executions or split large orders occur under unchanged quotes—or from data collection issues like missed WebSocket events or latency, which capture trades more accurately than quotes. Investigating whether this reflects market behavior or technical flaws is key to ensuring dataset reliability for analysis.

3. Liquidity differences: A third, noteworthy observation is the varying liquidity levels among stablecoins. BTCFDUSD has become one of the most liquid stablecoins due to strong support on Binance, where a market maker consistently provides liquidity. In contrast, BTCUSDC exhibits greater volatility and erratic trading patterns, reflecting fewer market participants. Historically, BTCFDUSD serves as a substitute for Binance USD, especially after regulatory actions affected Binance USD's availability, enhancing its stability. Conversely, BTCUSDC's limited appeal outside the U.S. has resulted in lower trading volumes on major exchanges like Binance. Thus, liquidity and trading behaviors among stablecoins can vary significantly, impacting market strategies and decisions.

Next Steps

The procedure above illustrates how we can fetch and clean LOB data. We can then feed the cleaned dataset into our predictive model. For LOB mid-price prediction, common deep learning models include CNN+LSTM, transformers and its variants. Some mid-price prediction models can set the benchmark price for market making strategy. A risk manager can use the LOB combined with predictive models to monitor market depth and anticipate short-term price movements. By analyzing order flow and LOB imbalances, they can identify potential liquidity shortages and areas of elevated volatility. Ultimately, integrating LOB data into risk models helps mitigate slippage, improve execution, and enhance resilience to market shocks.

It is also important to note that the data cleaning process and choice of model should be consistent with your goal. If you are interested in detecting outliers, for example, you will need to keep the outliers in the dataset instead of removing them. Additionally, while this case study aims to provide an introduction for back-testing with LOB data, when it comes to live trading, the choice of programming language is very important. Compiled languages such as Rust or C++, are often preferred in live trading to achieve faster execution speed, so R and Python being interpreted languages may not be the preferred choice.

Decoding High Frequency Trading (HFT)

With a clear overview of the LOB data established, we can now explore the realm of high-frequency trading (HFT). HFT is a form of algorithmic trading characterized by high speeds, high turnover rates, and high order-to-trade ratios, leveraging advanced technology for rapid execution. It aims to capture small profit margins from numerous trades, often holding positions for milliseconds to seconds.

In general, HFT requires highly efficient software such that all analysis and execution are performed in a low latency, high frequency environment. There exists a high threshold for fixed cost investment including Field-Programmable Gate Arrays (FPGAs), multi-core processors, high-speed memory, and GPUs for real-time data handling. Having a server as close to the exchange as possible is best for lower latency. HFT is not just about data and prediction accuracy but also the race for computing power and physical resources.

Popular HFT Strategies

Market Making

HFT firms provide liquidity by continuously posting buy and sell orders, profiting from the bid-ask spread. In the crypto world, there are two types of market makers. One is like market makers in other assets such as equity. They focus on Centralized Exchanges (CEX) and trade on the LOB. The other ones are recognized as a Decentralized Exchange (DEX) such as Uniswap and are often referred to as an automated market maker (AMM). In contrast to traditional market makers, they provide liquidity to the DEXs through liquidity pools instead of a LOB.

Directional Trading

This involves taking short-term positions based on anticipated price movements, leveraging speed to predict shifts before others. Traders use real-time LOB updates or market signals to predict whether a security’s price will rise or fall. This strategy relies on rapid execution and low latency to act before slower market participants, exploiting fleeting trends or momentum. Its strength lies in its adaptability to volatile conditions, though it carries risks if predictions fail, or market conditions shift unexpectedly.

Arbitrage

The backbone of this strategy is to exploit market inefficiencies of mispricing. In the still highly segmented cryptocurrency market, there are many exchanges. Arbitrage traders can use LOB data to detect mispricing, buying low and selling high across different exchanges.

Summary

As the cryptocurrency market continues to evolve, understanding LOBs and HFT is vital for risk practitioners aiming to navigate its complexities effectively. The emphasis on data integrity and advanced analytics will be necessary for crafting informed trading strategies. By leveraging these insights and technologies, traders can position themselves for ongoing success in an increasingly competitive environment, ensuring that risk management remains at the forefront of their operations.

Contributors:

Ming Him (Ken) Lau is a quantitative developer with a Master’s degree in Quantitative Finance from Rutgers University. He specializes in quantitative trading, position management, and risk management. With a data-driven approach, Ken focuses on optimizing profitability through robust strategy implementation and prudent oversight. His passion for continuous learning keeps him adaptable and forward-looking in fast-evolving financial markets.

Matthew Chang is a data science and quantitative research practitioner currently pursuing a Bachelor’s degree in Economics at the University of Chicago. He has interned at leading firms including Samara Alpha Management, where he developed tools for processing and analyzing high-frequency trading data. With experience spanning venture capital, strategy, and innovation roles across the U.S. and China, Matthew brings a unique interdisciplinary perspective to quantitative finance, market microstructure, and fintech innovation.

Data Analysis

Guest User

Limit Order Books and High-Frequency Trading

Foundations for data-driven risk management in digital assets.

A New Era for 401(k) Investing: Unlocking Access to Digital Assets

Join our mailing list to receive market and product updates.