To most people, what you learn during a CS degree doesn’t help much in your day to day job. That being said, you sometimes run in a problem which could have been an example used by a CS professor when introducing some core notions. This is one such instance, in regard to data structures.
Some time ago, one of our customers came under a fairly intense attack. Although we detected and blacklisted the attackers as expected, the performance of the application tanked. During the post-mortem, we realized that during the attack we had blocked around ten thousand IPs. The IP blocking mechanism may be at fault. Specifically, the one embedded in Sqreen’s agents, a software component installed on our customer’s servers.
How Sqreen used to block IPs
Whenever a threat is detected and blocked, our agents insert the blocked IP address in a list. This list was then sequentially checked every time a request was evaluated by an agent, thus increasing the latency of our customer’s application. This was a O(n) operation which became problematic once we hit around five thousands blocked IP addresses, much earlier than we expected.
How did we improve Sqreen?
Computer science is very often a story of compromise. In our case, we had to make lookups super fast (they are in our hot-path) but we were willing to accept a (reasonable) performance penalty when inserting new addresses. The best candidates are binary trees, and specifically, a structure common in network routers called a Radix Tree.
A tree? 🌳
The Radix tree is a data structure based on what is called in computer science a binary tree.
Basically, take a node (
8). This node will be the root of your tree. This node may be linked to up to two other nodes (
10), called “children”. Each of those nodes may have up to two children, etc. All the network, starting from the root is the tree. It is called binary because nodes can only have up to two children.
Binary Search Trees
In order to make our decision clearer, I need to cover a type of binary trees called Binary Search Trees, and their self-balancing variant. The tree above is a binary search tree. Its peculiarity is in the way the nodes are laid out. Specifically, you can find any node by following a very simple algorithm:
- Starting from the root node, at the top of the tree. Check if the value you’re looking for is the one of your current node;
- If it is not, check if your target value is lower than the one in your node. In the case it is, your next node is the left child. Otherwise, your next node is the rightmost child;
- If no such child exists, the node doesn’t exist in the tree. If it does, loop back to the first step with your new node.
The self-balancing part comes in when inserting new nodes. The problem of naively inserting new nodes in a binary tree is that your tree can get very deep. Try inserting
12 in the tree above: follow the lookup logic and insert the new node wherever you expect to find the child. You’ll have to put it under
13, making the tree really deep (and as our access time depends of the depth of the tree, it makes looking up a value slower ☹️).
Self-balanced trees will take a slightly different approach. Once we establish that we want to insert a left child to
13, which is already a left child and have no right child, we can move
13 up and put
14 below, as its rightmost child.
Okay, now, what’s the radix part?
A Radix Tree is designed as a simpler and more space efficient structure, when compared to self-balanced binary search trees.
Radix trees embed their data in the sequence of edges leading to the node. That means that only the leaves (the nodes at the bottom) really represent a coded value. That being said, the coded value isn’t the value of the leaf (
7) but actually the path leading to it: the node
4 actually code the value
8 3 6 4, as you can see in the animation below. This makes the trees very space-efficient as leaves don’t have to store their prefixes. For instance,
8 3 6 is common to the values coded by both
7, but only has to be written once in the tree.
Radix trees also leverage something called “path compression”: in our tree,
10 has no reason to be an independent node: we can save on the node overhead by merging it with
13 and creating a single
10 13 node. This is because only the leaves (
14) specify a value and everything else code the prefix of those leaves.
Okay, so a radix tree is an optimized binary tree. Is it really better though?
Let’s go back to our problem: we have a bunch of IP addresses and we want to check whether a given address (the one making the request) is in our list. This list can either be a blacklist or a whitelist, as Sqreen supports both.
Despite being represented as a sequence of decimal (
127.0.0.1, IPv4) or hexadecimal (
e3e7:682:c209:4cac:629f:6fbe:d82c:7cd, IPv6) numbers, IP addresses are internally a single number (encoded using 32 bits,
2130706433 for IPv4, or 128 bits,
302934307671667531413257853548643485645 for IPv6).
This means that it can easily be represented in binary form. Being able to represent every IPv4 addresses using a neat sequence of 32 bits make building the radix tree fairly simple. Each node contains one or multiple bits and whichever children should be chosen next depend on the next bit in the address we’re looking for.
But what about mixed IPv4 and IPv6?!
Nice catch. Our data structure contains two trees, one for each type of address. Those trees are functionally identical but let us ensure each tree’s data have a fixed length (32 bits for the IPv4 tree, 128 bits for IPv6). This lets us optimize the lookup code.
We, however, have to be a bit careful when picking a tree as there are means of embedding IPv4 addresses in the IPv6 address space.
So, how does the new approach work ?
Whenever a new address comes in and we need to check the blacklist, we take the appropriate tree and start exploring it. We go through the path matching our prefix, looking for a node at which we diverge. A node diverges when the prefix it represents differs from the one of the address we’re comparing. For instance, in the example below, if we were checking for the address
8 10 11 12, we would match
8, but not
10 13 This happens when of all the IPs we added, none share the same prefix as ours (the only values in this tree starting with the
8 10 prefix are
8 10 13 12 and
8 10 13 14). The same would happen in the tree didn’t use path compression.
If we’re trying to insert a new node, we need to find the diverging node. Then, we split it at the point we differed so that the node is cut between a parent node and a child. The parent node matches our prefix while the child contains all the old children of the original node. We then create a new node representing the address we’re adding and insert it to the now free slot of the parent node.
If no diverging node is found, that means there is an exact match. The value already exists in the tree and we can thus ignore the new IP.
Sounds like computer science. What’s the point for the product?
Switching to this data structure made lookups up to 5 000× faster when 100 000 addresses were blacklisted. Specifically, the lookup time shrank from 5 ms to 0.000942 ms (5300 times less).
Moreover, due to the structure’s intrinsic time-complexity, the lookup time will also grow much slower (
log(N) operations vs
N/2 on average where N is the number of blocked IPs) than in the old structure, making this performance increase even larger as the number of blocked addresses grow.
Finally, because the radix tree ignores duplicates, the maximum value of N for a radix tree is 232 for IPv4 and 2128 for IPv6. Those are astronomical values when looking at N or N/2, but log(N) reduces them to… 32 and 128 😄. In the worst possible case, it’ll take 128 iterations to find a match in a radix tree, compared to 3… followed by 38 zeros, in the old system.
On the downside, this made insertion time substantially worse (between 2 and 5 times, although this penalty doesn’t worsen significantly when blocking more addresses). The memory required to store the data structure also increased slightly (depending on the runtime, between a few percents and doubling the space required). However, because slower insertions don’t make our clients’ applications less responsive, switching to a Radix tree-based IP blocking is quite a net positive.
Why didn’t you use X?
We considered a fairly wide set of options but Radix trees ended up being the most appealing. Here is our reasoning for some of the most common suggestions.
Hash tables were not well suited to us, because networks ranges would force us to do 32 lookups (for IPv4, but 128 for IPv6!) and were more expensive memory-wise. Moreover, the most common implementations don’t give strong guaranties of constant access time (especially if you’re unlucky and a lot of IPs ended up in the same bucket).
Bloom filters were a good contenders but would have still required us to lookup the list on a match, opening a door for a DoS attack by attackers through CPU exhaustion. It was a cool mitigation we might consider again in the future but doesn’t really rid us of the problem.
Storing the IPs in an in-memory database was not seriously considered mostly because we thought the overhead would be significant, especially memory wise. They would encounter the same problem as hash tables (up to 128 lookups for a negative match) We also wanted to avoid bloating Sqreen with too many dependencies.
We had a few more esoteric suggestions we didn’t consider because of maintenance cost, as we would have to write and maintain them for each language we support. Moreover, they usually wouldn’t be significantly faster than a Radix tree making them thus even less appealing to us.
Making the Internet more secure is no easy task. Although our advantageous position within our clients’ applications helps, it makes any inefficiencies in our software a liability to theirs. The example I detailed at the beginning of this post is a perfect illustration: the application was already under attack and Sqreen‘s protection started to slow it down.
This responsibility is why we take any shortcoming extremely seriously and make sure to solve the issues once and for all.