Skip to content

Train a reinforcement-learning agent on a flox tape

flox_py.rl_env.FloxTradingEnv is a Gymnasium-compatible environment that drives an agent through the trades in a captured .floxlog tape. It speaks the standard Env protocol (reset, step, render, close, action_space, observation_space, metadata) without importing gymnasium itself, so plugging it into stable_baselines3, RLlib, or CleanRL does not pull gymnasium into flox's dependency surface. The user installs gymnasium and the learner of choice; the env works out of the box.

Phase 1 stays narrow: trade-by-trade replay, scalar qty, three discrete actions (hold, long, short). Continuous action spaces, multi-instrument portfolios, and limit-order semantics are Phase 2 follow-ups.

Quick start

from flox_py.rl_env import FloxTradingEnv

env = FloxTradingEnv.from_tape(
    "./tapes/bybit-btc-2026-05-07",
    qty=0.01,
    window_size=16,
)

obs, info = env.reset(seed=42)
total_reward = 0.0
done = False
while not done:
    action = env.action_space.sample()  # plug in your policy here
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    done = terminated or truncated

print(total_reward, info["realized_pnl"], info["position"])

from_tape loads the entire tape into memory at construction time. For long captures, slice the tape upstream; the env constructor also accepts a plain list of (ts_ns, price, qty, side) tuples if you want to drive it from a non-tape source.

Action and observation spaces

The default action space is Discrete(3):

Action Meaning
0 Hold (no order).
1 Go long qty (or stay long if already there).
2 Go short qty (or stay short if already there).

Switching position closes the existing one at the current price (realizes PnL) and opens the new one.

The default observation is a Box of shape (window_size + 2,):

  • The first window_size entries are the most recent prices, normalized by the first observed price (so the values stay around 1.0).
  • One entry for current position quantity (signed).
  • One entry for unrealized PnL since the last position change.

Override window_size at construction. If you want a different observation, subclass FloxTradingEnv and override _observation.

Reward

The default reward is the change in total PnL (realized plus unrealized) since the previous step. Pass reward_fn=lambda env, ctx: ... to compute your own; the callback receives the env and a context dict (ts_ns, price, position, realized_pnl, unrealized_pnl, step) and returns a float.

def risk_adjusted_reward(env, ctx):
    pnl_delta = env._last_total_pnl  # if you read internals
    drawdown_penalty = max(0.0, -ctx["unrealized_pnl"]) * 0.1
    return float(pnl_delta - drawdown_penalty)

env = FloxTradingEnv.from_tape(path, reward_fn=risk_adjusted_reward)

The default ignores transaction costs entirely. For honest training, layer fees and slippage into the reward function or compose this env with flox_py.SimulatedExecutor (which already handles fees and queue-aware fills) for the fill side.

Plugging into stable_baselines3

import gymnasium as gym
from stable_baselines3 import PPO
from flox_py.rl_env import FloxTradingEnv

env = FloxTradingEnv.from_tape("./tapes/btc-2026-05-07", qty=0.01)
# stable_baselines3 wraps any Gymnasium env; FloxTradingEnv passes
# the duck-typed Env protocol it expects.
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)

stable_baselines3 does not introspect the env class identity; it calls the documented Env methods. The duck-typed _DiscreteSpace and _BoxSpace mirror the gymnasium API closely enough for both observation and action sampling to work. If your library is stricter, wrap the env in gym.make with a custom registration or substitute the spaces with real gymnasium.spaces.Discrete / gymnasium.spaces.Box objects post-construction.

What this is not

  • A backtest in the flox_py.bundle sense. The env replays trades and applies idealized fills (current trade price, instant). Realistic fills (slippage, queue position, latency) need flox_py.SimulatedExecutor and the latency-models module on top.
  • A multi-symbol portfolio simulator. One symbol per env. Multi-instrument is Phase 2.
  • A live agent runtime. The env consumes a captured tape; for live RL you need an outer loop that feeds new trades and re-runs the policy. The same step / reset API applies, but the data plumbing is up to you.

See also