Sqreen’s architecture through the ages: part three

Welcome to part three of the Sqreen architecture through the ages series. In case you missed it, here is part one, and here is part two. In this third and final entry to the series, I’m going to discuss how we leveled up the Sqreen backend to handle the growing scale of users and of the Sqreen team, and the journey we took moving from a self-contained product to a proper platform. That will catch you up to the present of where Sqreen is today, from an architecture-perspective.

Episode V: The Sqreen backend gets smarter

Up until this stage in Sqreen’s history, the Sqreen architecture had proven pretty scalable, and we didn’t have to change things in a major way. Autoscaling limits and some AWS quotas had to be updated so we could run more machines. We also had to upgrade the size of our Mongo cluster a few times, but otherwise, everything was fairly smooth.

The impetus to change this time didn’t come from ops, but from the product vision. The Sqreen backend is the convergence point of reporting for all agents, so it has a global view of what is happening in our customers’ applications. With more customers, some patterns start to emerge. An attacker IP scanning an application will probably hit most of the computers busy serving it. By detecting a peak of attacks on a subset of computers early we can often generalize that all servers will be attacked later. So if we could detect this fast enough, we could tell all agents for an application to proactively block a nefarious actor (IP address, or logged-in user). 

To guarantee the throughput of processing, our old mechanism based on SQS would not cut it. It only guaranteed that things would be digested at some point. Turns out we could use stream processing to get much better results. Once again, AWS has a managed service for this. We chose to use AWS Kinesis for stream processing. This was also discussed at length in previous articles, so I won’t detail more here. Detection is done using multiple streams that are specialized to do one specific task, piped one after another. The description of what is to be done can be exposed to customers on the dashboard. We called this a Playbook. 

Introducing Playbooks

The Sqreen backend was actually already using this central reporting point property to do some detection. For example, we had a CRON to detect a peak of attacks (a massive security scan), a peak of login failures (account takeover attempts), and so on… But with more and more applications to handle, these CRON tasks started to misbehave a bit. Fortunately, we had just developed the Playbook capability. Looking back at our CRON detections, they looked very similar to Playbooks. So we converted all these remaining CRON tasks to built-inPlaybooks. In doing so, we added the ability for each application to tune the detection threshold, which was previously not easily doable.

The power of Playbooks thrives on data: the more data, the more interesting detections become possible. However, these are mostly statistical detections — a peak of something, some value above a threshold, etc. It’s good for alerting, but part of security expertise is also identifying  relevant information in a sea of noise. Injecting this expertise is non-trivial, but some of it can actually be distilled into more monitoring security rules, sometimes alongside some backend processing to aggregate information. 

This idea led us to create Application Risks, dipping a toe into surfacing not only point-in-time issues but more structural issues in applications. For example, we could warn when an application runs as root or when its clock is desynchronized. Application Risk fit well with our security automation architecture and we were able to add these kinds of capabilities without changing too much there. 

The sprawl of insights

Between user data, package management information, application risks, attack reporting, and more Sqreen started to detect and monitor many security-related items for our customer applications. The only issue is that this data was available in many different areas of the Sqreen system, which was ok for people with a small number of applications but not great for larger customers. They would often get lost by all the different ways and areas in Sqreen to find their security information. We decided to simplify this and created a system that could be used to query applications based on summaries of the data we had. 

To create a query system we added Elasticsearch to the Sqreen toolbox. This enabled us to perform complex queries easily against the accumulated data. Maintaining this search index is actually a pretty complex proposition: most of the information gathered has a freshness with different time-to-live constraints. Once again, a stream proved invaluable to tame the complexity.

Episode VI: Creating a platform

Sqreen’s product complexity didn’t grow in a vacuum. Aligned with the technological growth we’ve been discussing, we also saw tremendous growth of the customer base and revenue team. Also, the pace of growth was also increasing! The first part of the system that started to show strain was all the metric aggregation and storage mechanisms. Digesting metrics using SQS wasn’t efficient enough, nor was storing them in MongoDB. For a while, it was sufficient to store these in DynamoDB. However, DynamoDB is meant to be a hot-data store and is priced as such. We create aggregates from metrics (e.g. for charts) but the metrics themselves are very much cold-data. This was economically unsustainable, so we had to invest a bit of time in this. 

Raw metrics are never queried through the Sqreen dashboard, but they are sometimes accessed by agent owners through the admin interface (i.e. a few times a month). This means we couldn’t throw this data out, because we might need it. However, given the usage patterns here, we could wait a bit to access it. Enter Athena, an AWS hosted Presto that is deeply integrated with S3. Athena enables querying cold data files using standard SQL and can even run aggregation against it. To be most efficient though, data has to be stored in a columnar format, like Parquet. Once again, AWS has a dedicated product called AWS Kinesis Firehose that has the ability to read data from a Kinesis Stream and dump it into flat files, potentially converting it to Parquet format first.  Now querying our row metrics was simple as writing an SQL query and waiting a few seconds for Athena to process the files. This was a great improvement!

Creating Security Signals

Scaling the Sqreen product also meant scaling the product engineering teams. Up to this point, we would hire engineers and onboard them on the totality of Sqreen architecture. Given the increasing overall complexity, this process started to take longer and longer for people to get up to speed. Meanwhile, we also wanted to get more data in the product from sources beyond the agent. The problem here is that the backend and the agent backend have been designed in symbiosis. Adding a new kind of object in the agent is generally going to require us to add a new set of endpoints on the backend for agents, new models, and new endpoints in the BFF. We also wanted the option to do more with the data we gather, and even create totally new parts of the product.

In other words, it was time to restructure. Sqreen was not just a product anymore and needed to become a platform on which we could have multiple products running smoothly. This led us to introduce a new vision for how the data would flow in the system: Security Signals. 

At its core, Security Signals are an envelope format that is able to encapsulate all the different payloads we were using. The content of the payload needs to obey schemas that can be validated and from which other teams of engineers can now build products upon. 

A larger restructure

Technically, we also used this larger push to restructure to change the way we operate services. Until this point, we were manually changing the production infrastructure using AWS console. This was a slow and error-prone process. It mostly worked fine because we hardly changed anything at all (e.g EC2 startup scripts). This would not scale well as the team grew, so we took this opportunity to introduce Terraform into our toolbox. 

Now the full infrastructure is described using Terraform files and each team can manage their own infrastructure (limiting the blast radius in case of problems). Infrastructure code can be analyzed much more easily for flaws and shared patterns can be factored out. Each part of the Sqreen product has its own backend, mostly relying on their own datastores. Signal data is captured from a new backend called Ingestion, enriched and archived in S3 (so it can be queried by Athena). Signals are then routed to the part of the product that needs them and are further processed there. Each part of the product shares a public API with the external world so anybody can build upon them. The dashboard is one more client on these APIs. The platform also has the capability to push signals elsewhere, so Sqreen can be deeply integrated into our clients’ security processes.

On the agent side, this Security Signals vision also led us to make some changes. The same story holds true: the agent scope had become quite large. Finding agent owners was difficult given their complexity and scope. To address this, we restructured the agents to separate the responsibility between deep core tech, security, supporting more environments, and creating interfaces between them. As a result, we can have a team focusing more on ensuring top-notch compatibility with more application frameworks. Security rules can now be defined by application security experts and for all agents at the same time. Finally, now the agent owners can focus on the deep core technology of the agent instead of everything related to them. Also, the agent communication with the backend is now built upon a reusable security signal SDK. This SDK can be published independently so users can publish their own signals to the backend.

Episode VII: Building the future

Over the years, the Sqreen platform evolved quite a bit from 0 rpm to over 180k rpm these days. The story is far from finished. There is so much more to build for Sqreen as we aim to be the best solution for application security for all kinds of organizations, from small startups to Fortune 500 enterprises. If this series has been interesting to you, check out our careers page and join us!

Notify of
Inline Feedbacks
View all comments