Welcome back to part two of our series on Sqreen’s architecture through the ages. Part one covered Sqreen’s history pre-customers up to the point where we first started supporting our early paying customers. Today, we’re going to take a look at how we scaled up our architecture from the early days to supporting hundreds of customers, as well as the feature richness we added along the way.
Episode III: Scaling up for our first customers
With our first customers on board, people started really relying on the product for the day-to-day safety of their web applications. They expected a top-notch service, and so we had to harden our architecture so everything would run smoothly and reliably at the standards they expected.
Up to this point, all data crunching on attack reports and metrics would be done synchronously whenever the agent sent something. This worked nicely and was pretty easy to monitor, but came at the price of having slow endpoints on the backend. At the beginning of Sqreen, this wasn’t really an issue, but as load grew on the backend this led to a few difficulties.
First, we needed quite a few live backend instances to be able to answer all the traffic our agents were generating. This was first solved by using autoscaling to scale up the number of backend containers available (with a simple heuristic to scale EC2 based on this).
Second, this wasn’t particularly efficient performance-wise for our customer applications. The agent would constantly be waiting for the Sqreen backend to answer, using a bit of resources and consistently appearing as an issue in customers’ performance reporting. We solved this issue by keeping only the bare minimum in the request critical path on the backend and pushing everything else to an asynchronous “later” using SQS. The asynchronous data crunching part became known as “the digestors” in Sqreen slang.
Creating beautiful dashboards
A data-centric dashboard wouldn’t really be complete without charts. So to make eye-catching charts, we started aggregating the data reported by the agent. In the very beginning, this was done using a minutely CRON. That worked very well up until the point we got so many users that it would take us more than a minute to calculate the minutely points. When that happened, things did not turn out very well!
The fix here was to convert the batch processing to something much more continuous. By tuning the structures in MongoDB, and adding pieces of codes in the digestors to use MongoDB primitives like
$inc to atomically update these structures, we got the eye-catching graphs we wanted, always up to date. Processing power would also scale nicely by launching more digestors consuming the SQS queue (these also can be made to autoscale).
Improving the agents
On the agent side, we also added more security rules as time went on. The addition of the new rules brought us to the point where calculating the full package of rules (or rules pack) took quite a while. Additionally, each application gets its own rules pack, depending on configuration and environment. Fortunately, rules themselves do not change that often, maybe a few times a week at most, and are signed offline. So a full pack of rules can be cached fairly well. To share the cache, we added a small Memcached server (hosted by AWS Elasticache) in our cluster.
With more rules came new complaints about slowdowns in customer applications. We realized we were pretty blind to this and had mostly no way of checking issues without asking our customers to share their applications with us. So we tried a few solutions, logging everything (big performance impact, hard to analyze), push performance metrics in your APM of choice (nice for customers, but doesn’t get the data to Sqreen), and push performance metrics directly to Sqreen. That last approach would work best from the Sqreen perspective, but is very data-intensive, as remotely sending a payload of data for each function call in an application leads to even bigger performance issues.
To make sending performance data to Sqreen feasible, the data needed to be aggregated first. We tried sending average data, but since we were interested in outliers this was too coarse. In the end, inspired by the work on t-digests (Paper, implementation) we ended up sending a histogram of the captured values to Sqreen. The histogram uses a predefined set of buckets that are geometrically increasing so we have more precision on relatively small values. For each period, the agent accumulates which measurements fall into which buckets and at the end of the period delivers the data. On the backend side, it then becomes trivial to aggregate all these histograms and get a nice timeseries of the evolution. This let us solve the performance and slowdown issues for customers.
Episode IV: Adding more features to Sqreen
Alongside our growing customer base, we expanded the feature set in Sqreen. I want to touch on a few of the bigger ones we developed during this time, and how they impacted our architecture.
One of the big features we added was user monitoring. User monitoring works by having the agent use its dynamic instrumentation mechanism to hook onto authentication routines and monitor the successes & failures. This information is sent to Sqreen, which aggregates data across all servers into a single coherent view, enabling detection & alerting. This was Sqreen’s first foray to a more monitoring-focused workload. This time, the backend captured data on everything going smoothly, not just “bad activity” reports like in the past. The volumetry quickly grew and this data, which was first persisted in MongoDB, had to be moved to a more scalable datastore (AWS DynamoDB).
Content Security Policy
Another feature came with an even bigger potential load attached to it than user monitoring. Amongst the means to defend against XSS attacks, one excellent kind of protection is the Content Security Policy (CSP). This protection is applied at the browser level. When a browser gets a webpage with a CSP header attached, it will only load assets that are deemed OK by the policy. The standard also enables the browser to send a report back to a reporting URL when something is amiss.
Sqreen enables users to simplify configuring their CSP policies by analyzing reports sent by browsers. Sqreen’s backend was already pretty good at scaling, but this time, instead of dealing with agent traffic, it’s the browsers of our customer’s users that send their policy violation reports. This is a potentially unbounded and very unpredictable number of requests. Fortunately, we don’t really need to analyze all CSP reports, the aggregation of all these data is what is interesting for us.
A bit wary of the potential load, we decided to build a system that would isolate this traffic from the backend. The Backend for Browser (or BFB) was born. This is a very simple Flask application that has a single POST endpoint that collects and aggregates all of the CSP reports per customer. Then every so often a CRON on Sqreen backend will go look into the aggregated report database and copy the top N reports to its own datastore. The BFB aggressively rate-limits requests per application so it can keep a decent cross-section of the data without getting too overloaded. This information is shared across servers using Memcached. To better isolate the Sqreen backend, the BFB has its own database servers.
Building on all the previous data we collected, we decided to test the water for a product that would be directly useful to developers. A dedicated API that would give reputation information about an email or IP address. Similar to the CSP, the client to this API wouldn’t be agents, so the load would probably not have the same profile. Clients to this API are developers and applications that need to query information about a specific actor. As we were only testing the water we didn’t want to create something too complex here. As such, we chose to create a small Flask backend with two endpoints connected to our main database, the backend for API (or BFApi). It also had its own dedicated hostname and, building on the tooling we developed for the BFB, was also capable of rate-limiting queries.
At this stage, Sqreen was robust and scalable enough to meet the needs of our growing customer base, both in terms of load and in terms of feature richness. In the final entry to this series, we’ll discuss the transition from product to platform, and some major changes we made to better serve our product vision.