You’ve done it! You’ve created the one trading strategy that will rule them all. You’ve backtested it on your laptop against Snowflake or a flat file, and it’s now ready for some real, streamed data. You hook it up to the options or multi-exchange websocket client in firehose mode (you know, so you get all the things), hit start, and are excited to see some #winning results. Instead, you hear the aircraft engine-like noise of your laptop fans speeding up to maximum and you singe some leg hair as the bottom of the laptop turns into lava. You read a message on the screen saying something about out of memory before it goes dark and a puff of smoke rising from the keyboard signals its last breath.
“What just happened?” you ask yourself. After going through some checklists in your head, you assure yourself it worked just fine against AAPL and MSFT with static data before hooking up the websocket, so that must be the culprit, right? Time for some sanity checks and napkin math. After prying your melted laptop off the table you flung it onto, you run your app again, but this time only hooking it up to only AAPL. You take note that your app eats up only 5% CPU and 1GB of RAM while running, but you get a sinking feeling as you wonder, “How many stocks/options are there?”. A quick search confirms the reason your laptop turned into a yellow star – you failed to consider how your app scales.
For reference, there are around 1.4 million unique option contracts and 12,000 equities on any given day of trading. If you’re working ahead, you know where this is going – your app would need roughly 600 CPU cores and 12TB of RAM if you were to try to fit it on one machine without doing any tuning. I’m not aware of any laptops with those specs, and I know that seems like a challenge statement to someone. Please let me know if you pursue that. Your goal now is to figure out how to get your app running at full-bore subscribing to all the equities or options.
There’s going to be two main efforts in your plan, each important to scalability. The first is the performance of your code, and the second is how you slice up the work. Each has quick gains, and each has diminishing returns, so we’re going to try and find a sweet spot for how much time we spend on either. As you might imagine, slicing up the work will allow you to use more than one machine and more than one core to team up on the work, and the performance will dictate how much work that slice must do. If you increase the performance of a single slice, you can fit more slices on each machine.
The first thing I’d do is take a broad stroke first pass over the code to see if there’s any glaring easy-win performance gains to be had. These are things like using an ‘n’, ‘log(n)’, or constant time algorithm as a drop in replacement for an ‘n^2’ time algorithm. It helps to be familiar with a broad range of these and how they work so that you can spot them but also to select the best tool for the job. Great! You updated your simple moving average calculation to be constant time instead of linear time. Don’t spend too much time getting nit-picky on this first pass, though. We’ll likely be refactoring while slicing up the code and then iterating on both these steps a few times.
Slicing up the code involves a careful look at the high level flow of your app. We’re going to be looking for natural boundaries for splitting up work, but also keep in mind expectations and limitations of the different phases. Generally speaking, many apps of this type have the same major parts: ingesting the data, storing and caching the data, and analysis of (doing something useful with) the data. Imagining a large pipe flowing vertically, these are good horizontal slices for our work, while the equity/option itself makes for good vertical slices. Slicing on these will also allow you some knobs to turn later. Slicing vertically along the equity/option, means your code method signatures and structure will be focused around a single one. This doesn’t mean your code can’t use some data from another, just don’t make the individual pipelines dependent/wait on another pipeline as a hard requirement for being able to parallelize these pipes. Slicing horizontally along the major parts of the app suggests (and is highly recommended) decoupling on those boundaries. This is usually done via a queue mechanism between the major parts of your app. Doing so allows the performance of one part to not necessarily affect the performance of another part. Ingestion is and has to be the fastest part, storing/caching the second fastest, and analysis is usually slowest. If your ingestion stage can’t even handle the bandwidth of the data stream while not doing any complex processing of the data, the other parts have no hope, either. This is why I highly suggest decoupling the ingestion stage especially – if that stage can’t keep up, you’re either going to miss ticks, or slowly run out of memory as the events pile up.
Slicing it up horizontally and vertically gives you several options for breaking apart the groups of these pipes. For instance, if your ingestion code is fast enough, you may decide to put all the ingestion segments for all equities on one machine, while your caching/data storage may need two machines, so you split the data storage/caching segments into two groups based on the equity name or identifier and spread the load across two machines. Finally, your processing segments take enough work for several machines and so you again split on equity name into four groupings and spread it out to four machines. Coming up with these numbers and how to break up these groups largely depends on how your particular app performs, so you’re going to need to take some measurements to make some estimates.
After refactoring your code for the easy-win performance changes and decoupling for your units of work, you can start to get a base expectation for what kind of hardware you’re going to need. Run your app again for a single equity/option and make note of the performance at each stage. Then run it for 5, 10, etc. equities/options to average out the single case. Be sure not to saturate the resources of the machine running these or your results will be skewed.
Now that you’ve made a few passes at optimization and figured out a good estimation of your units of work, you figure out you can host the ingestion and caching on three machines, but doing real-time analysis in your last stage at tick speed is cost prohibitive at 40 machines. Talking with your team, you determined one minute is an acceptable update period on the analysis stage, but tick level data is still required on the storage side. (Hey, it’s still faster than what a Business Intelligence person considers real-time!) One thing you can do at this point because you decoupled the parts of your app is change the frequency at which you do your analysis/processing stage. Doing so and retesting the performance brings the analysis machine requirement down to a manageable two machines. Now you have a fairly confident expectation that this app can run on five machines without melting down like your laptop. You fire up some VMs on AWS, deploy, and let it fly with the firehose connecting to all equities or options. Nothing turns into lava – SUCCESS! You’re quick to open chat to let your team know, but notice an alert telling you your allotment of free logging is poly-gone at dump-a-log.com. You know what that means…
The adventure continues, but today you bested a lava monster, so take the win.