Inspecting Runtimes

Replay is designed for recording and replaying interpreted language runtimes. In previous posts we’ve talked about how Replay’s recorder works and the ways in which recording is specialized to work well on runtimes. This lets us replay what the runtime is doing, but isn’t enough to allow actually inspecting the runtime’s state for debugging or other analysis. In this post we’ll discuss the architecture and techniques Replay uses for inspecting runtimes, illustrating what is involved in adapting a new runtime to support Replay and using a new client to inspect Replay recordings.
There are two interfaces in play when creating or inspecting Replay recordings:
  • The Recorder API is exposed by the dynamically linked library used for recording. When recording, the runtime mainly uses this API to cooperate with the recorder so that replaying will succeed. When replaying, the runtime will be linked to a different replayer component in Replay’s backend which is used to replay the program. The replayer uses the same API to collect information about the runtime’s behavior as it executes, and its state when paused at various points.
  • The Record Replay Protocol is used by clients to interact with Replay’s backend to create recordings, load them for inspection, and query information about them after they’ve been loaded. Replay’s devtools are based on this protocol, and are fully open source.
Both of these interfaces are designed to support adaptation/extension to new languages and new clients. Together with Replay’s recorder and backend, they form a platform for time travel based debugging and analysis of interpreted languages. Once a runtime has been integrated with the recorder so that it replays reliably and supports the APIs used to inspect its state, any client using the Record Replay Protocol can debug that runtime’s recordings. Likewise, new clients using the protocol will be able to debug recordings made by any runtime that has been integrated with the recorder.

Example

To show how these interfaces work, we’ll use an example of setting a breakpoint in Replay’s devtools. When a breakpoint is added, the devtools console is updated to show every point where that breakpoint is hit. The breakpoint has a message which can be edited to evaluate an expression everywhere the breakpoint is hit, and update the console with the results of those evaluations within a second or two.
To understand how this works, let’s start by describing this from the perspective of the devtools client. After loading the recording, the devtools sends a Debugger.findSources request to get the URLs and identifiers for every source (a piece of JavaScript) that was loaded by the recording. Sending Debugger.getSourceContents and Debugger.getPossibleBreakpoints requests fetch the text for these sources and the set of places where breakpoints can be added. Evaluating a user-provided expression everywhere the breakpoint is hit is done with a few Analysis requests: Analysis.createAnalysis specifies an analysis that can run when the program is paused somewhere, Analysis.addLocation indicates that the analysis should run everywhere a specific breakpoint is hit, and Analysis.runAnalysis starts the analysis and returns the results of performing it at all the hits for that breakpoint. The analysis specification causes Pause.getTopFrame and Pause.evaluateInFrame requests to run at each of these hits, so that the result of evaluating the expression at each of these hits is included in the analysis results and can be shown to the user.
Replay’s backend responds to all of these requests by replaying the recording and fetching the information it needs from the replayed program through the Recorder API. Several kinds of information are needed:
  • The set of sources loaded in the recording, and their contents and breakpoint locations.
  • For each breakpoint location, the set of points in the recording where that location is hit.
  • For each point where the analysis runs, the result of evaluating the user expression when the program is paused at that point.

Getting Source Contents

Every time the runtime’s virtual machine loads or creates a new source for the language being interpreted, it calls a RecordReplayOnNewSource API in the recorder. This call doesn’t do anything when recording, but the replayer can use this API to enumerate the sources in the recording. When it does so, it will use a callback (which the VM installed at startup) to fetch the source’s contents and breakpoint locations. This callback is based on the same requests used in the Record Replay Protocol: in the same way that a client sends a Debugger.getSourceContents request to asynchronously get a source’s contents from the backend, the backend’s replayer sends a Debugger.getSourceContents request to synchronously get those same contents, which it can store and then send to the client when needed.
When replaying, these request callbacks may or may not be invoked during the RecordReplayOnNewSource call, and either way, replaying needs to continue to work afterwards. Essentially, the replayer’s behavior within these API calls is a source of non-determinism. If it asks for source contents or breakpoint locations, the runtime needs to get that information, and in doing so it can change the VM’s state. For example, some JavaScript engines compile functions lazily: normally, compilation happens when the function is first called, but getting its breakpoint locations requires compiling it earlier than that. The point where a function is first compiled is then non-deterministic. JS compilation involves allocating both malloc’ed and garbage collected objects, but because malloc and the GC are also non-deterministic, replaying can continue without running into problems.
This illustrates one of the main benefits of recording and replaying with effective determinism rather than complete determinism. Extracting information about sources is one of the simpler analyses which the replayer will do, but if we insisted on replaying with complete determinism then we would have to get this information without using malloc or allocating GC’ed objects, which would require major invasive changes to the VM. Replaying with effective determinism allows this analysis to run without needing specialized VM changes, and takes advantage of the fact that lazy function compilation is an optimization that has already been designed to work without affecting the behavior of the running JavaScript.

Tabulating Breakpoint Hits

To be able to quickly run analyses against all the points where a breakpoint is hit, the replayer runs an analysis that collects the entire set of hits on every breakpoint location in the recording, so those hits can be indexed and queried rapidly. This is done through the RecordReplayOnInstrument API, which the VM calls while replaying to describe what code is running: calls are made whenever a breakpoint site is reached, and in a few other places. This is similar to the RecordReplayOnNewSource API described above, but unlike that API the replayer can also notify the VM about whether instrumentation is enabled (and instrumentation calls need to be made) by using a callback the VM installs at startup.
When replaying it is non-deterministic whether instrumentation is enabled and what happens within instrumentation calls, but since we replay with effective determinism, replaying will not be affected by any instrumentation related allocations or other side effects. In fact, because the JITs behave non-deterministically they can optimize away instrumentation logic entirely when instrumentation is disabled, and discard that optimized code if it is enabled later on. This uses the JIT’s existing mechanisms for optimizing and deoptimizing code, and is ensures replaying is fast when instrumentation based analyses aren’t running.

Evaluating Expressions

Evaluating an expression while the replayed program is paused at a specific point in the recording uses the same techniques we’ve described above. Before doing the evaluation, the program needs to be in the right place, and the replayer uses the instrumentation API to do this: it enables instrumentation, then replays until the desired point is reached. Inside the associated RecordReplayOnInstrument call, it sends requests to the runtime to do the evaluation, analogous to how it sent a Debugger.getSourceContents request when analyzing the sources and using the same Pause.getTopFrame and Pause.evaluateInFrame requests which the client originally specified.
There are several kinds of protocol requests which a runtime needs to support in order for the replayer to read the paused state and show it to clients. These requests fetch the stack contents, the variables in scope for those stack frames, and the contents of objects referenced by those variables. Other requests are optional, like being able to evaluate expressions or fetch language/environment specific data like the DOM/CSS state of a web page. For runtimes which have a built-in debugging interface for use by conventional debuggers (like the Chrome Devtools Protocol, which the Record Replay Protocol was modeled on), handling these requests only requires a relatively small layer of code to translate between the protocol requests and the built-in interface.

The Future

The example above illustrates the sorts of analyses that Replay can currently perform, to support new and much more efficient time travel based debugging workflows. We are, however, just getting started. The combination of low overhead recording and in-depth (high overhead) analysis while replaying opens the door to being able to thoroughly understand the behavior of any piece of software, no matter where it is running.

Powered by Notaku