Import Binance public archives into a floxlog tape¶
data.binance.vision publishes daily aggregate-trade zip archives for spot, USDT-margined perpetuals, and coin-margined perpetuals, going back more than two years. The archive layout is stable, but the per-row work is fiddly: column ordering, header autoskip, millisecond-to-nanosecond rescale, side mapping, and dedup on re-runs.
flox_py.archives.binance wraps the converter in two function calls, plus a flox archive binance CLI subcommand. The output is a regular .floxlog tape, which means every aggregator, MergedTapeReader, the live recorder hook, and the engine itself read it without special-casing.
Install¶
No extras are required. --mirror downloads use urllib from the standard library.
Convert a single day¶
The simplest form takes one already-downloaded zip and writes a tape directory. The example below is the same script CI runs on every push: it builds a synthetic Binance-style zip in memory, calls aggtrades_to_floxlog, then reads the produced tape back through DataReader.
"""Binance public aggTrades archive round-trip — build a tiny
synthetic zip in the exact layout published on data.binance.vision,
push it through ``flox_py.archives.binance.aggtrades_to_floxlog``,
then read the resulting ``.floxlog`` tape back via ``DataReader`` and
print a summary.
This example is the CI-runnable companion to
[Import the Binance public archive](../how-to/import-binance-archive.md).
It does not need network — the synthetic zip is built in-memory so
the test runs anywhere.
Usage:
cd /path/to/flox
PYTHONPATH=build/python python3 docs/examples/python_binance_archive.py
"""
from __future__ import annotations
import io
import shutil
import tempfile
import zipfile
from pathlib import Path
import flox_py
from flox_py.archives import binance
_ROWS = [
# agg_id, price, qty, first_id, last_id, ts_ms, is_buyer_maker, is_best
(1001, 42100.50, 0.005, 9001, 9001, 1_700_000_000_000, "True", "True"),
(1002, 42101.00, 0.010, 9002, 9003, 1_700_000_001_000, "False", "True"),
(1003, 42100.75, 0.002, 9004, 9004, 1_700_000_002_500, "True", "True"),
]
def _build_synthetic_zip(dest: Path) -> Path:
buf = io.StringIO()
for r in _ROWS:
buf.write(",".join(str(x) for x in r) + "\n")
with zipfile.ZipFile(dest, "w", zipfile.ZIP_DEFLATED) as zf:
zf.writestr("BTCUSDT-aggTrades-2024-01-15.csv", buf.getvalue())
return dest
def main() -> None:
workdir = Path(tempfile.mkdtemp(prefix="flox-binance-example-"))
try:
zip_path = workdir / "BTCUSDT-aggTrades-2024-01-15.zip"
_build_synthetic_zip(zip_path)
tape_dir = workdir / "tape"
stats = binance.aggtrades_to_floxlog(
zip_path,
tape_dir,
symbol_id=1,
symbol_name="BTCUSDT",
market="um-futures",
)
trades = flox_py.DataReader(str(tape_dir)).read_trades()
print(
f"converted: rows_read={stats.rows_read} "
f"trades_written={stats.trades_written} "
f"tape_trades={int(trades.size)}"
)
# is_buyer_maker=True maps to Side::SELL (1); False → SELL is False → BUY (0).
expected_sides = [1 if r[6] == "True" else 0 for r in _ROWS]
actual_sides = [int(t["side"]) for t in trades]
assert actual_sides == expected_sides, (actual_sides, expected_sides)
assert int(trades.size) == len(_ROWS)
finally:
shutil.rmtree(workdir, ignore_errors=True)
if __name__ == "__main__":
main()
csv_path accepts either the zip published by Binance or the extracted CSV; the reader autoskips a header row when one is present.
Side encoding follows the floxlog convention. is_buyer_maker = true becomes Side::SELL (the buyer was the resting maker, so the active flow is a seller hitting the bid). is_buyer_maker = false becomes Side::BUY.
Convert a date range¶
For multi-day backfills, range_to_floxlog accepts a date range and optionally a local mirror cache. Missing zips are downloaded from data.binance.vision in parallel and reused on follow-up calls. The function signature:
binance.range_to_floxlog(
symbol: str,
market: str, # "spot" | "um-futures" | "cm-futures"
date_from: str | date, # YYYY-MM-DD, inclusive
date_to: str | date, # YYYY-MM-DD, inclusive
out_tape: str | Path,
*,
mirror: str | Path | None = None,
parallel: int = 4,
symbol_id: int = 1,
skip_missing: bool = False,
...
) -> ConvertStats
Without mirror, the converter writes to a tempdir and discards it after the range is done. Set skip_missing=True to keep going if Binance has not published a particular day yet (common at the head of the archive).
CLI¶
flox archive binance is the same surface from the shell. For a date range:
flox archive binance \
--symbol BTCUSDT \
--market um-futures \
--from 2024-01-01 --to 2024-01-31 \
--out ./tapes/binance-um-BTCUSDT \
--mirror ./.cache/binance \
--parallel 4
For one-off conversions of a file already on disk, pass --csv instead of the date range:
flox archive binance \
--csv ./BTCUSDT-aggTrades-2024-01-15.zip \
--out ./tapes/binance-um-BTCUSDT \
--symbol BTCUSDT --market um-futures --symbol-id 1
Append-safe and idempotent¶
The converter dedups on agg_trade_id against any trades already in the target tape. Re-running the same day, or running an overlapping range, is a no-op: every previously-imported row is skipped, and the writer adds zero new records. The reported rows_skipped counter shows how many rows were elided.
metadata.json¶
The tape's metadata.json is created (or merged into) on every successful conversion. MergedTapeReader keys symbols by (metadata.exchange, name), so the Binance archive tapes line up against tapes captured live by the recorder hook. The symbol IDs picked here do not need to match what the live recorder picked; the reader rekeys both into a global ID space at read time.
What is and is not in the archive¶
aggTrades only carries print events. There is no book information. The archive that does carry it (depth20 and T1 / bookTicker) is tracked separately, so the produced tapes can later be combined into a single floxlog directory with both trades and book updates.
Other public archives (Bybit, OKX, Bitget) follow the same pattern but ship as separate subcommands under flox archive; see the multi-exchange reader task in the roadmap.