Building a dynamic instrumentation agent for Java

Sqreen’s Application Security Management platform relies on microagents to leverage the runtime context of applications for security. Our drive when building these agents is to make our protection transparent and as frictionless as possible. The Sqreen agent applies dynamic instrumentation in order to report and protect the application without code modification. We have agents in many languages (and have shared what goes into building them in different languages). In this post, we’ll focus on how we built our Java agent. Building a Java agent is no easy task, so we wanted to share some of what goes into Java instrumentation and the different areas to consider.

Sqreen protects Java applications by inserting additional control logic dynamically at runtime with the following properties:

  • Dynamic: instrumentation is not statically defined in advance, but is instead defined with a set of rules that can be updated at runtime
  • No code modification: you don’t have to modify your program & re-deploy, just plug in the Sqreen agent and deploy
  • Performant: minimal impact on performance

Instrumenting Java code

Java programs are compiled into Java bytecode (stored into .class files). This bytecode is itself executed on the Java Virtual Machine (JVM). Thus, all programs running on the JVM can be instrumented by modifying the bytecode at runtime. The JVM provides a standard way to do it in the form of a java agent, which is a specially packaged program that is either provided as a command line parameter (-javaagent) or attached at runtime.

Here is a big picture overview of what running a Java program with an agent means.

Basically, agents are able to modify the Java bytecode of classes that are loaded within the JVM at runtime. Classloaders are in charge of loading classes from binary to in-memory, and thus an agent could be seen as a way to intercept classloaders behavior at runtime.

Java bytecode provides the same features as the full Java language, so for Java programs the structure of the bytecode is really close to the original Java program source code. Hence, while we don’t instrument the Java program itself, we use a very close representation of it (and this is the one that is actually executed). Also, this is what makes it possible to decompile Java bytecode into mostly equivalent source code.

One thing to note is that there are non-Java languages that compile into Java bytecode (such as Scala, Clojure, and Kotlin), which means that the structure and shape of the bytecode for programs can be very different. This makes instrumentation language-dependant and brings another set of challenges. For simplicity here we will only focus on Java.

The Java Development Kit (JDK) provides a tool called javap that allows you to decode and display bytecode instructions (opcodes) from binary to a more software engineer-friendly form.

For example, take this very simple java program (Hello.java):

Once compiled to Hello.class, we can use the javap -v Hello.class command to provide this verbose output decoded from the binary file:

The structure is well described in the Java Class Format specification if you want to dig into the details. The most notable things to see here are:

  • There is a version (here 55, which means Java 11), and the set of instructions is not exactly the same for all versions
  • We have two methods, one of them is a generated constructor that is implicit in the source code and has been added by the javac compiler.
  • The body of the main method has been compiled into a sequence of four instructions.

Unfortunately, the API provided by the JVM for agents does not provide any high level abstractions of the bytecode, agents have to handle bytecode in binary form.

In other words, agents have to be able to:

  1. Decode binary class format to a sequence of descriptors and instructions
  2. Modify this sequence, detecting where to modify and perform the modifications
  3. Keep the structure consistent and compatible with the JVM bytecode verifier, which, for example, prevents any stack overflow vulnerability
  4. Write the modified class back to binary form before it’s loaded in memory

Completing all those steps is definitely not a trivial task, and two libraries help to make it more manageable:

  • ASM allows you to decode/encode bytecode instructions and provides bytecode-level modification primitives. Groovy and Kotlin use it for their compiler, and we could easily rewrite javap with it.
  • ByteBuddy allows you to write almost regular java code and inject it into the modified classes without dealing with all the subtleties of bytecode-level details like Java version compatibility. This is currently the go-to library for all the modern Java agents and is widely used by most APM agents.

In order to modify a method behavior, we provide what is called an advice class that allows us to define how we want to alter the original method behavior.

Here is an example of a ByteBuddy advice class that will make any instrumented method print the list of its arguments before and after the method body execution. In that case, the generated bytecode will be inlined into the original method body, just as if the statements were written in the original source code.

ByteBuddy provides a way to define which classes and methods we want to apply this advice to. One of the criteria is class inheritance, this allows us to instrument implementation classes of a known interface like JDBC database drivers that implement the JDBC API.

Because the Sqreen agent is dynamic, the list of classes and methods we need to instrument is not statically defined in the agent. Also, the advice used to instrument has to be generic and thus be the same for all instrumented methods. Our advice delegates method execution to a Dispatcher class that is responsible for delegating method execution at different stages of the method execution process:

  • pre: before the original method body
  • post: after the original method body has executed
  • fail: after the original method body threw an exception

For each method that we want to instrument, we define a hook-point that defines the target class, method, and method signature to instrument and also the callbacks that are executed on each of the stages listed above.

Javascript callbacks and execution sandbox

Most (if not all) Java agents are static, in the sense that the selection of methods to instrument and the transformation rules that are applied to them is stored within the agent itself.

While this definitely works, it has one important caveat: any new feature or bugfix in the agent will require an agent update. In other words, it requires:

  • Updating a dependency of the application
  • Likely a JVM restart (unless the agent is able to attach/detach at runtime)
  • QA for proper testing and qualification before being pushed to production

Even with the most mature and smooth deployment pipeline, this will always incur a significant delay, which would prevent us from fixing security issues quickly in a hassle-free way.

In order to solve this, the Sqreen agent is mostly composed of two parts:

  • Agent instrumentation: how to instrument methods in a generic way and delegate to security rules. This is the part that is within the agent itself
  • Javascript rules: implementation of all security algorithms in Javascript. Those rules are defined and can be reloaded at runtime. They also execute within a sandbox for extra safety and reliability

In the case of Java, the Javascript execution sandbox is implemented using either Rhino or Nashorn JavaScript engines provided by the JVM itself. Other platforms delegate this to the V8 JavaScript engine.

Recovering from callback errors

While we strongly commit on our code quality, it is always possible that either Dispatcher or callbacks misbehave and throw exceptions. Because the agent executes within the application, there are always corner cases or unexpected states at runtime that were impossible to guess beforehand.

In order to prevent those issues from having side effects beyond the walls of our agent, we:

  • Execute all of the callbacks in a try-catch block
  • Send all the exceptions to our backend to provide a feedback loop and proactive monitoring
  • Selectively disable rules that produce too many exceptions

Also, in extreme cases, it is always possible to remotely disable the agent from the dashboard settings.

De-instrumenting

De-instrumentation is required when the agent is being disabled, or more frequently when rules are being updated.

Out of the box, de-instrumentation can be a tricky problem to solve at the bytecode level as we need to revert all the changes done at instrumentation time, and thus keep track of them. Fortunately, one of the nice features of ByteBuddy is that it provides ResetableClassFileTransformer (extending the default ClassFileTransformer class) which is well suited for this purpose.

However, redefining classes has some side effects, like slightly increasing memory usage in the area dedicated to loaded classes (Metaspace or PermGen depending on JVM version and configured Garbage Collector algorithm). While the increment is usually small, it can become quite significant as this area is often limited to a few dozen megabytes by default and re-instrumentation is done frequently.

Thus, de-instrumentation is currently done at runtime by providing a Dispatcher, which only requires a no-op for implementation.

Going further with Java agents

In this article, we only scratched the surface of what could be done with a Java agent thanks to bytecode instrumentation. In practice, there are many other unexpected challenges that arise when modifying applications at runtime.

Performance

Modifying an application is never without side effects, especially on the performance side.

User expectations on performance are quite difficult to assess and acceptable levels vary from application to application. For example, a user-facing application might easily tolerate 2ms extra overhead per request, whereas on a high-frequency trading platform, it would be the major bottleneck.

Two things help in this area:

  • Observability, which provides actual metrics from agent usage in production. First a good requirement for customer support, but might also enable transparency up to end-users
  • Extensive tests, both at high level (for example, request response time), and at a low level (for example, with micro-benchmarks of the code instrumentation logic).

Both observability and extensive tests require significant investments, and adding observability might also incur some additional overhead.

Also, there are many dimensions where an agent could have an impact beyond just “response time”, which makes this definitely a very interesting topic:

  • CPU usage, extra computation done by the agent, either in-request or in a separate thread (thus contributing to the global host load)
  • Extra I/O and disk usage, for example if the agent collects metrics to disk and/or writes to a log file
  • Memory usage, allocation rate, and indirect impacts on Garbage Collector behavior. For example, holding references to objects for longer than usual might keep some objects alive and prevent them from being garbage-collected earlier, hence making GC less efficient.
  • Just-In-Time compilation behavior changed due to changing the classes and code layout. For example, making long methods slightly bigger could prevent them from being compiled to native, hence causing a noticeable performance impact

Privacy

Most applications handle sensitive data: user credentials, social security numbers or even credit card numbers. Any agent will potentially access and collect those which is something that is definitely not desirable.

Alas, it is impossible for the agent to guess what is sensitive or not, and there is definitely no silver bullet strategy here. What can be done however, is to have a set of simple heuristics that make it easy to cover most use-cases, and that’s exactly what Sqreen agents currently do.

  • Scrub fields and parameters by name, for example all those named secret, token or even password should be scrubbed by default.
  • Scrub values that match a known pattern. For example if it looks like a Social Security Number or a credit card number, then it’s very likely to be one.
  • Provide a flexible way to the users to add their own fields and values.

Also, all those parameters need to be configurable at agent level, which makes it obvious to the end-user that data is properly scrubbed before leaving the application.

On top of that, we do our best to remove sensitive values from the data that is sent to our backend. For instance, we remove string and integers from SQL queries or MongoDB commands.

Compatibility

In theory, installing an agent on a JVM is really straightforward, as you just need to either add an extra JVM startup parameter or attach to a running JVM. In practice, things are way more subtle.

The first issue with compatibility is ensuring that the agent is able to properly instrument and execute on multiple JVMs versions, which often means writing the agent in a rather old Java version like 6 or 7 or have internal components enabled at runtime that fit the right APIs. Ensuring binary compatibility with multiple versions of a single library is the same but means that the test matrix to cover every version is growing very fast.

Then, unless you write everything from scratch in the agent, it will likely require you to embed some dependencies, like a logging framework or a popular library like Guava. Those dependencies need to be relocated (moved to a separate package) when the agent is packaged, otherwise they will conflict with the application dependencies that might use them, very probably in a different version.

Also, starting in Java 9, the JDK became more modular, which means most of the things that were available by default (like access to the JDBC API) might not be available anymore unless explicitly used by the application. Other parts like the default HTTP client implementation can behave differently on some application servers (I’m looking at you Weblogic 10.3), thus it is better not to trust any external dependencies and rely on your own.

Without being exhaustive, here is a short list of all the other fun things you’ll have to cover:

  • OSGI and custom classloaders
  • Java2 security
  • Custom keystores and SSL/TLS certificates for backend communication

In short, a Java agent is definitely the kind of project that looks simple at first sight, and ends up being more than a full-time job. There’s a lot of things to consider when you look at Java instrumentation.

With great power, comes great responsibility

By their nature, agents have a very privileged access to application internals and they literally run within the host application. Thus, agents are very exposed to triggering side effects on the application like performance overhead, memory leaks, or even runtime errors.

While you can’t predict all the potential issues, you can definitely make a major difference in the way you respond to them. Because agent users grant you their trust, they deserve more than just support tickets. While the topic of customer support is definitely beyond the scope of this article, it is something that should not be taken lightly when running agents in production.

Conclusion

In this blogpost, we’ve covered the high level aspects of writing a Java agent.

We started with understanding what Java bytecode is, then how to properly read and modify it using ASM and ByteBuddy libraries. Also, we introduced some very challenging issues like compatibility and performance that should not be taken lightly.

While the idea of modifying programs at runtime without knowing exactly how they work seems to be very risky and complex, in practice it is not. Runtime instrumentation allows us to do some magic and modify running programs without having the original source code, which opens up a lot of opportunities. For example, it makes extra features like monitoring (APMs) and security (Sqreen) possible, or even the option for updating or fixing programs at runtime. Also, it can also be an opportunity to understand and learn how things work under the hood.

About the author

Sylvain has more than a decade of experience working on Java and the JVM. He has been working on Java agents for several years and loves to understand how things work under the hood with a broad ecosystem knowledge, from vintage application servers with EJB/RMI to the trendy frameworks.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments