How we built a queryable Application Inventory

Sqreen is all about application security, and our focus has been on making security transparent and accessible for individual applications. The application was the central actor and everything revolved around it. Maybe you had a single application, or maybe you had a few, but you still reasoned about them separately.

As we grow, our clients are increasingly larger, and their infrastructures are more complex. Instead of a single application, we have customers with hundreds of microservices, and some of the design choices for Sqreen that make security for individual applications so much more accessible don’t hold true for large deployments.

The original Sqreen view can become unwieldy for larger environments. When a vulnerability is discovered in a popular third-party dependency, how do you validate that none of the deployed services are affected? What if you, as a security owner, want to verify that all applications are still checking in with Sqreen? How do you know which services are generating the most security incidents, and should be prioritized?

These are easy questions to answer when you have a small handful of applications, but much less so if you have hundreds or thousands. It is both time consuming and error-prone to check every single application separately, assuming you even have a full list of them in the first place.

It is also mind-numbingly tedious, and runs counter to our mission of making security accessible. We want to enable our users to be efficient. And once they’ve done their job, we want to get out of the way.

This scaling up challenge is nothing new. There are similar challenges involved with managing larger infrastructures with regards to asset discovery. It is a related topic, and you can read more about it here.

So what can we do to solve this? Glad you asked! Today, we’re announcing a new feature to address this issue, called App Inventory. In this post, I’ll provide a technical deep dive into the feature, and how we implemented the search capabilities. 

What is App Inventory?

App Inventory is a single source of truth that catalogues your applications and their components across your whole organization.

Here – at a glance – you can see all the applications in your organization, and the related information that is relevant to keeping your organization secure.

App Inventory builds upon what we started with the Sqreen Flow Map, which provides a visual representation of your application assets. The App Inventory view allows you to drill through the information that the Flow Map shows using a powerful query language. Here you can find all applications with a specific dependency version, or all production apps that have Sqreen disabled, for example. You can now extract very specific and granular information from the data that was previously available only unstructured and unsearchable. 

Security is not something you achieve with a single approach. It is a continuous “north star” aspiration, and you need to have a lot of tools available to tackle different situations. You need to be able to protect against common attacks (see OWASP Top 10), you need to be able to identify deviations from best practices in your app design, you need to be able to test your application in a safe environment so you can find errors proactively, and more. Achieving these aims requires a wide range of tools at your disposal.

App Inventory is a new tool in that toolbox, which lets you gain deep knowledge over what’s in your infrastructure, and react quickly to any new developments (for example, if a new vulnerability is disclosed, you could immediately see your exposure).

How do you implement a feature like this?

Getting at-your-fingertips visibility into your application assets is great, but it’s even better if it’s understandable. I wanted to take some time to share how we went about building and implementing this new feature. First, we had to choose the technology to power the actual search functionality. Then we had to design and implement the query language on top of the search infrastructure. Finally, we needed to take care of actually populating the data stores with relevant data. Let’s go over these topics one by one, and see how we tackled them.

Choosing a technology to power the search functionality

When we started out with App Inventory, we needed to settle on how we were going to power the search functionality that was at the core of the feature. The questions we tackled were:

  • Do we want to introduce new technologies to the stack?
  • If so, which technology should we introduce?
  • How can we be sure it will handle our data volume and use cases?

The main data store behind Sqreen is MongoDB, however with the addition of App Inventory, we decided to introduce Elasticsearch into our infrastructure.

Why did we choose Elasticsearch?

There are actually two questions hiding under this heading. Why did we need to use a non-general purpose storage engine, and why Elasticsearch specifically?

For the first part, we could have gone down the route of using Mongo for everything. Mongo also features full-text search capabilities, and we already use Mongo, so the required data is already there, and we have experience with it. 

However, there are some downsides to going with Mongo here. We would need to put a lot more effort into maintaining Mongo than we do now. Our existing data model would not necessarily map well to the new search cases. We would have to define the required indexes manually and separately from defining the document structure (whereas Elasticsearch will combine these into a single step, and then provide an index on every field).

Most importantly, it is a question about versatility and leaving room for improvement. Down the line, our use cases might become more advanced, to the point where Mongo’s capabilities break down, or require us to manually implement functionality that is available in a dedicated search solution by default.

For example, we might end up having to implement custom functionality, such as advanced stemming, or custom tokenizers, whereas a solution dedicated to search might have these features available by default.

Additionally, we wouldn’t want an issue with the App Inventory to affect the availability of the core systems. So we might want a separate solution anyway for redundancy’s sake.

Why Elasticsearch specifically though?

There aren’t a lot of search engines with a proven track record out there. Elasticsearch both fits the bill of having a proven track record, and — importantly — is available on AWS as a managed solution. A popular alternative — Solr — which we might have considered, wasn’t.

As a small company, we do not want to dedicate a lot of time and resources to managing infrastructure, so going with a managed solution helps us rapidly iterate.

How can you validate the use case?

Before committing to building the solution, we still want to validate that our use case will work. We don’t want to build out a whole feature, only to realize that it is architecturally unsuited for real world data volumes.

So we set up an Elasticsearch cluster in a separate environment, with the expected production settings, and ran large amounts of queries against it, to see how it behaves under different loads.

For example, here is a chart of successfully indexed new documents (per minute) over ~6 hours. We can see here that under the settings we used here, we could index 500k documents every minute, but after about 90 minutes, the capacity was significantly reduced. 

This gives us a baseline of ~90k documents per minute under normal conditions, with the capability of handling spikes in traffic for a limited time.

This let us test:

  • the capabilities of different cluster configurations; 
  • if our access patterns are CPU or memory-bound;
  • the Elasticsearch API (e.g. you want to use “bulk indexing”, not index every document separately; however, what is the optimal size of each batch?);
  • the Python client for Elasticsearch. We ended up using Elasticsearch DSL for defining the models, but used the underlying Elasticsearch Python Client to actually perform many of the data retrieval queries;
  • the monitoring options we would have for this service. We rely on DataDog, Grafana and PagerDuty a lot (you can read more about how we tackle monitoring for high-availability here), and it is important to validate that these tools would work well with a new service. 

Implementing the query language

Having a search backend, populating it with the data and hooking it to a UI is pretty great. But the main value from the App Inventory comes from enabling users to easily find the data they care about. 

To accomplish this, we designed a custom query language, specifically tailored to the use cases that would be relevant to security owners.

For example, if all you have is some data in Elasticsearch, you might go query it for applications with a specific vulnerability by assembling a query such as this:

{
  "bool": {
    "filter": [
      {
        "term": {
          "organization": "5d8c7cac42d3095cb0fc500e"
        }
      },
      {
        "nested": {
          "path": "frameworks",
          "query": {
            "match": {
              "frameworks.name": "rails",
              "frameworks.version": "1.0.2",
            }
          }
        }
      }
    ]
  }
}

Notice that if you just have package information in Elasticsearch, you need to know which specific versions of your dependencies might be vulnerable.

Compare this to using a tailored query language with Sqreen (combined with the additional context-aware information that Sqreen provides). All you need to type could be as little as:

packages.filter(name=rails and is_outdated)

Implementing such a query language is a pretty fun topic, so we are preparing a more detailed write-up specifically about this part of the App Inventory. Stay tuned.

Populating the data

One thing we want to do is populate the data in Elasticsearch, and not have it diverge from the ground truth data in our other datastores.

Internally, we distinguish between:

  • application attributes that don’t change often (e.g. application name, or owner);
  • process attributes. This is one document per running process of the application, and contains information like runtime version and agent;
  • package attributes. Each instance contains a list of dependencies, but very often most will use the same dependencies, so we can deduplicate the amount of processing to be done here;
  • additional related information in separate indexes, such as Security Incidents, Weaknesses, etc.

The application attributes are processed synchronously, as they change infrequently. The Mongo document has a post-save hook, which compares the changed fields with the fields exposed to the App Inventory. If the fields overlap, we will reindex the application in Elasticsearch immediately.

Every other type of data is reindexed asynchronously. We make heavy use of Kinesis, and most data gets processed in a Kinesis worker. 

This way, the applications will appear in the App Inventory immediately, and everything else might have a short delay, but will not needlessly tie up the REST API, which is used by our clients (and which needs to be as performant as possible).

In summary

App Inventory is an exciting new feature that will give security teams queryable, real-time visibility into their application components. There’s a lot that goes into getting this feature up and running, but in the end, it should enable security teams to efficiently monitor their infrastructure, and be as responsive as possible to new developments. 

The magic of the App Inventory really comes from the data seamlessly getting indexed into Elasticsearch (without bothering your developer or ops teams) AND having a tailor-built query language available to query it. This would be a pretty big initiative to build on its own. The fact that Sqreen is a platform provider means that it fit well for us to invest in making this a reality.

Even if you had all the data available in Elasticsearch (or a similar store), an infrastructure with a hundred applications, ten thousand instances, and a million packages, can be quite unwieldy to query using Kibana. However, the App Inventory makes it as easy as app.name=foo and packages.name=bar.

We hope it’s going to be a useful new tool in your toolbox, and enable you to be even more efficient. You can see what it looks like in your Sqreen dashboard if you’re an existing user, or sign up for a free trial to check it out yourself. 

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments