Optimize Performance¶
Tune FLOX for minimum latency.
Build Optimization¶
Release Build¶
Default release flags in FLOX:
Link-Time Optimization¶
Already enabled by default with -flto. Ensure your compiler supports it:
Event Bus Tuning¶
Capacity¶
Default: 4096 events. For high-frequency feeds, increase:
Or per-bus:
Capacity must be power of 2.
Consumer Limit¶
Default: 128 consumers. Adjust if needed:
Memory Optimization¶
Pre-allocate Pools¶
Size pools to handle peak load without exhaustion:
// For 3 consumers at 10ms processing, 1000 events/sec:
// In-flight ≈ 3 × 10ms × 1000/sec = 30 events
// Add headroom: 64-128
pool::Pool<BookUpdateEvent, 128> bookPool;
Monitor pool usage:
size_t inUse = bookPool.inUse();
if (inUse > threshold) {
FLOX_LOG("Warning: pool usage high: " << inUse);
}
Avoid Allocations in Hot Path¶
Don't:
void onTrade(const TradeEvent& ev) {
auto data = std::make_unique<Data>(); // BAD: allocation
std::string s = std::to_string(ev.trade.price.toDouble()); // BAD
}
Do:
class MyStrategy : public IStrategy {
Data _data; // Pre-allocated member
char _buffer[128]; // Pre-allocated buffer
void onTrade(const TradeEvent& ev) {
_data.process(ev); // Use pre-allocated
snprintf(_buffer, sizeof(_buffer), "%.2f", ev.trade.price.toDouble());
}
};
CPU Optimization¶
CPU Affinity¶
See Configure CPU Affinity for details.
Quick setup:
#if FLOX_CPU_AFFINITY_ENABLED
tradeBus.setupOptimalConfiguration(TradeBus::ComponentType::MARKET_DATA, true);
#endif
Disable Frequency Scaling¶
Isolate CPUs¶
Kernel parameters:
Strategy Optimization¶
Filter Early¶
void onTrade(const TradeEvent& ev) {
// Filter first line — reject early
if (ev.trade.symbol != _symbol) return;
if (ev.trade.price < _minPrice) return;
// Expensive processing only for relevant events
processSignal(ev);
}
Avoid Branches in Hot Path¶
// BAD: Branch in tight loop
for (const auto& level : book.bids) {
if (level.price > threshold) {
total += level.quantity;
}
}
// BETTER: Branchless or predictable branch
// (compiler may optimize, but be aware)
Cache Friendly Access¶
// BAD: Random access
for (int i : randomIndices) {
process(data[i]);
}
// GOOD: Sequential access
for (const auto& item : data) {
process(item);
}
Logging Optimization¶
Disable in Hot Path¶
Use Conditional Logging¶
Profiling¶
Enable Tracy¶
Use profiling macros:
Measure Latency¶
class LatencyTracker {
std::vector<int64_t> _samples;
public:
void record(int64_t latency_ns) {
_samples.push_back(latency_ns);
}
void report() {
std::sort(_samples.begin(), _samples.end());
size_t n = _samples.size();
std::cout << "p50: " << _samples[n * 0.50] << " ns\n";
std::cout << "p99: " << _samples[n * 0.99] << " ns\n";
std::cout << "max: " << _samples.back() << " ns\n";
}
};
Compression Trade-offs¶
For replay: - No compression: Fastest read, largest files - LZ4: ~3-5x compression, small CPU overhead
For recording, LZ4 is usually worth it:
Network Optimization¶
Socket Tuning¶
# Increase receive buffer
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.rmem_default=16777216
Busy-poll¶
For lowest latency with kernel 4.11+:
Checklist¶
- [ ] Release build with
-O3 -march=native -flto - [ ] EventBus capacity sized for peak load
- [ ] Object pools pre-allocated
- [ ] No allocations in callbacks
- [ ] CPU affinity configured (if dedicated hardware)
- [ ] CPU frequency scaling disabled
- [ ] Logging disabled during benchmarks
- [ ] Profiling enabled during tuning
- [ ] Socket buffers increased
- [ ] Kernel parameters tuned (isolated CPUs, etc.)
Benchmarking¶
Run included benchmarks:
cmake .. -DFLOX_ENABLE_BENCHMARKS=ON
make -j
./benchmarks/binary_log_benchmark
./benchmarks/nlevel_order_book_benchmark
Run your own latency measurements to establish baseline for your hardware.
See Also¶
- Configure CPU Affinity — Thread pinning
- The Disruptor Pattern — Understanding latency
- Memory Model — Zero-allocation design