I build robots for fun.Rick Sanchez
It’s common knowledge that in 2019 the NSA decided to open source its reverse engineering framework known as Ghidra. Due to its versatility, it quickly became popular among security researchers. This article is one of many to come dedicated to covering the technical details of the ghidra_nodejs plugin for Ghidra, developed by our team. The plugin’s main job is to parse, disassemble and decompile NodeJS Bytenode (.jsc) binaries. The focus of this article is the V8 bytecode and the relevant source code entities. A brief description of the plugin is also provided, which will be expanded upon in greater detail in subsequent articles.
Scripts are distributed as plain text, i.e. the source code is available. Therefore, all engines fullfill the following criteria:
- Read and parse a script into an abstract syntax tree (AST).
- Compile the AST into machine code.
- Run the machine code.
- Read and parse the script into an abstract syntax tree (AST).
- Compile the AST into bytecode.
- Compile the bytecode into machine code.
- Run the machine code.
No off-the-shelf tools for working with serialized JSC files were found. That is of course, apart from Node.js itself. Node.js has the
--print-bytecode flag, but the results leave a lot to be desired. A bare-bones function disassembler without cross-references or any additional information about objects is not the ideal tool to aid in a fully-fledged reverse engineering effort via static code analysis. Therefore, we had to reinvent the wheel and roll our own tool.
Node.js + bytenode
Stop digging for hidden layers and just be impressed!Rick Sanchez
One of the core features of Node.js is the ability to add packages. i.e. external libraries form the Node.js ecosystem, to scripts. The node package manager (npm) allows access to more than half a million open-source packages.
Let’s take a look at the V8 compilation pipeline, post 2017.
To quote the article Know your JIT: closer to the machine [RU]:
“Feeble” smartphones began to spend more time analyzing script source code than running it. Then the Ignition interpreter appeared. The introduction of bytecode allowed a decrease in memory consumption and parsing overhead.
The Ignition bytecode interpreter is a register machine with an accumulator. It includes a set of instructions and registers. It turns out that bytecode and objects interacting with it in their classes are stored within the V8 engine structures. Therefore, there is no longer any need to constantly analyze script source code. Serialization is the ability of the V8 engine to store bytecode and the structures linked with it in binary representation. Deserialization (the reverse process) allows binary data and the objects linked with it to be transformed into bytecode.
The bytenode package uses the native feature of the Node.js platform, the ability to get serialized code. In Node.js the compilation takes place using the
vm module and
Script function. Starting from version 5.7 the
vm module has a property called
vm.Script. This property allows bytecode to be stored in a cache. Example for versions higher than 10:
After running this code, the serialized data will be located in helloBuffer.
The bytenode package stores serialized data in files with the .jsc extension. Using the same package a file with the .jsc extension (from now on we will refer to such files as JSC files) can be launched for execution, as described in the article How to Compile Node.js Code Using Bytenode?.
JSC files have proven to be a good method of concealing script source code from prying eyes.
What can be done?
Wubba lubba dub dub.Rick Sanchez
Reading various literature on Node.js and the V8 engine, static analysis of the sources has led to the following thoughts:
- A parser for JSC file headers is required.
- We need to (try to) look into the inner workings of the V8 engine, into its specific objects within whose framework bytecode is created. As a result, we need to create a loader for the disassembler we have created, which will create similar specific objects and display the instructions and operands (preferably in a format like the one in Node.js when we launch the platform with the “–print-bytecode” flag in the JSC file).
- An ambitious idea is to attempt to write a decompiler for the bytecode created. That is, to implement a register machine with an accumulator, working with bytecode, into the output of a C-type decompiler. Therefore, the disassembler chosen was the open-source Ghidra with its versatility in terms of different structures and an “out-the-box” decompiler.
Before knocking on the open door with the “JSC file” nameplate, we will try to describe some of the concepts concerning the V8 engine, armed with the Node.js source code.
The V8 universe
The universe is so big, Morty, that nothing in the world matters.Rick Sanchez
The arrangement of the V8 engine is not straightforward. Here we are only going to look at the entities linked to the bytecode of the Ignition interpreter.
Isolate is an independent copy of the V8 runtime environment, including its own heap. Two different instances of Isolate will work in parallel, and they can be regarded as completely different isolated instances of the V8 runtime environment.
An important point for us: at the object level in serialized code, Closure looks like the SharedFunctionInfo object. Analysis showed that all serialized data looks like a Russian “matryoshka” doll made of SharedFunctionInfo objects which refer to one another.
When serialized, a SharedFunctionInfo object looks like a large structure which combines the different entities required for processing bytecode. Some of them are show in Figure 3.
Hidden map (hidden class). All objects created by the interpreter (for example, the same SharedFunctionInfo) are constructed not from scratch but based on a hidden class (Map class), which contains a set of standard properties. There are already hidden classes within the engine, they are created during runtime. The links to them are stored in the roots_ array of the Heap object.
And when serialization takes place (or deserialization), the link to the hidden class is stored as the index of the roots_ array. In node.exe (for version 8.16.0) the index for the hidden class kSharedFunctionInfoMapRootIndex equals 0x13 (see Figure 5). The quantity and contents of elements in roots_ may change from version to version of the V8 engine in Node.js. Therefore, it is important to pay attention to the version of the V8 engine when deserializing a JSC file.
LexicalEnvironment is a repository for data in memory and a mechanism for extracting this data when addressed. The environment may be internal (
ScopeInfo structure) and external (
OuterScopeInfo structure). Quote (link[RU]):
“Each time that a function is called in the program, within the interpreter the special LexicalEnvironment lexicon is created that is associated with that call. All definitions of the constant, variables, and so on within the function are automatically written to the lexicon. The name of the definition (the identifier, that is, the name of the constant, variable and so forth) becomes a key, and the value of the definition becomes a value in the lexicon. The definitions include arguments, constants, functions, variables, and more.
The external environment in relation to the function is the environment in which the function was declared (and not called!).”
And in the same place: [SI1] “…the interpreter carries out a search for the value of the identifier not only in the local lexical environment (the one where the identifier is being used), but also in the external environment. The search begins from the local environment, and if the required identifier is not found in it, then the search goes on up to module level, and then up to global level.”
In the example in Figure 6, the constructor function is making a call to the
errorMessages variable. In the interpreter, the constructor function is represented by the SharedFunctionInfo structure having ScopeInfo (local variables) and OuterScopeInfo (external variables). In this case OuterScopeInfo stores the address of the ScopeInfo structure of the SharedFunctionInfo parent where the declaration of the
errorMessages variable is located.
Built-ins (embedded functions) are V8 engine embedded functions. By themselves they can be implemented in different ways (builtinKind) and are placed in the global array (builtins) of the V8 engine.
Deciphering the methods of implementing the builtinKind enumeration:
- CPP is the Builtin embedded function, written in C++.
- API is the Builtin embedded function, written in C++ with the use of API callbacks.
- TFJ is the Builtin embedded function, written in Turbofan (JS linkage).
- TFS is the Builtin embedded function, written in Turbofan (CodeStub linkage).
- TFC is the Builtin embedded function, written in Turbofan (CodeStub Linkage and custom descriptor).
- TFH is the Builtin embedded function, written as a handler for Turbofan (CodeStub linkage).
- ASM is the Builtin embedded function, written in platform-independent assembler.
Embedded functions may include mathematical functions, string-to-numeric conversion functions, and others. Embedded functions are called directly from bytecode: StackCheck, Construct, ConsoleLog, CreateGeneratorObject, JsonParse, and other.
FunctionData (bytecode) is a structure that stores the so-called ConstantPool array for constants and the bytecode itself. As was mentioned early, the V8 interpreter is a register machine with an accumulator. The interpreter uses the following registers (shown below is the spelling of the registers accepted for display in a listing with bytecode):
- <this>, <closure>, <context> are internal registers of the interpreter that are required for its work (roughly speaking, <closure> stores a reference to SharedFunctionInfo, <context> stores a reference to ScopeInfo).
- The ACCU accumulator register.
- The registers of arguments a0, a1, a2… and registers of functions r0, r1, r2… Each register within bytecode is coded with four bytes. The most significant bit determines the type of register, aX or rX. The remaining bits are the number of the register. Therefore, the quantity of function registers may be up to 0х7FFFFFFB items, and argument registers 0х7FFFFFFE (only if there is sufficient memory for storing register values in the interpreter).
Each bytecode instruction reads or modifies a condition of the ACCU accumulator. The operands of the instruction are r0, r1, and so on, which are either indexes in ConstantPool (in disassembler they look like numbers in square brackets), or indexes to variables from the arrays StackLocal/ContextLocal of the structures of the lexical environment ScopeInfo/OuterScopeInfo.
Let’s evaluate the scale of parsing FunctionData using the example of the disassembly of the Constructor function for the GenericError class inherited from the base class Error. Figure 8 on the left shows the source code of the GenericError class constructor, on the right its bytecode in the SharedFunctionInfo object.
The SharedFunctionInfo structurefor the constructor function will have the name “GenericError.” The ScopeInfo structure Flags field tells us that this function is a constructor of the constructor class (Flags.FunctionKind =DerivedConstructor) and it has an external lexical environment (Flags.HasOuterScopeInfo = True). The function takes two arguments a0 (“code”) and a1 (“err”). The function has StackLocals internal local variables: r0 (special name “.new.target“), r1 (special name “.this_function“), r2 (a variable with the name “message“). Special names always start from a period, and they are required for the V8 engine when creating machine code from bytecode. The function does not have the ContextLocals context variables. If StackLocals are local variables, then ContextLocals are local context variables. This is due to the fact that V8 regards every SharedFunctionInfo as a certain context with its own bytecode and local variables. And we can call variables of the current function via the context, or the context of another function, or a specially created context (especially when the asynchronous calls async/await are being used).
The bytecode instructions then follow. The instructions follow the same sequence as the lines in the script source code. A short description of each instruction can be found in the Node.js source code.
In our version of the V8 engine, there were 167 instructions, with several dozen often being used. The peculiarity of the bytecode instructions is due to the fact that bytecode is not an end in itself for the whole engine but only a springboard for creating machine code with the TurboFan compiler. Therefore, certain specific operands of the bytecode and unusual instructions are redundant: their task is to display the script source code in a stricter and more typified form. That is to say, having the disassembly of the bytecode in our hands, it is POSSIBLE to restore the script source code. Manually it is possible. Our task is to try to do this automatically using a decompiler.
Header of the JSC file
Morty relax, it’s just a bunch of ones and zeros out thereRick Sanchez
We should say right away that version 8.16 of Node.js was the one studied (version 6.2.414.77 of the V8 engine). All versions of the software platform are available at the official site.
The JSC file consists of a header and serialized data. The table shows the structure of the header.
What is worth paying attention to? First of all, the field kVersionHash, which stores the hash from the version of the V8 engine, presented in string form. For example, for 6.2.414.77 (this is the engine used in Node.js version 8.16.0) the value of hash will be 0x35BA122E. The kVersionHash field is one of the mandatory fields which Node.js checks when launching a JSC file. The V8 engine does not stand still, with new releases coming out and in due course the format of serialized data changing. Therefore, tracking the version when deserializing is a mandatory condition.
Second, it’s worth taking a look at the
During serialization, objects are processed in a special way and written to the binary stream in the same order as they are arranged in the heap. And if the object has already been serialized and is encountered again (different objects can have common objects; for example, some object strings with an identical value), then the object is not duplicated in the binary stream but is written as a link to its first entry. And this link will be a number heap offset (this technology is known as Pointer Compression, when the pointer can be written in the form of an offset relative to the base). When deserialization takes place, the V8 engine creates objects from binary data in the same order and correctly processes link offsets.
Conclusion. During deserialization our loader MUST process the stream of serialized data in the same way as the V8 engine (emulating the arrangement of orders in the heap).
The stream of serialized binary data itself begins immediately from offset 0x3C. That will be discussed in the next part.
V8 engine serialized data loader
Hate the game and not the player, son.Rick Sanchez
In telegraph style, we are going to quickly talk about how the V8 engine loader works. It was its logic that we focused on when creating our loader in the plugin. A detailed article on the loader will appear later.
1) Where does the reading of the serialized data take place? We can find the function in deserializer.cc in theNode.js source code.
Our loader follows the logic of this function when parsing serialized data.
2) There are several number-coding formats that the loader needs to know about:
- Number storage format: variable-length encoded (GetInt function in the file snapshot-source-sink.h inthe Node.js source code). This format is used in serialized data, for example, to store the sizes of objects.
- Number storage format: zigzag encoding (file source-position-table.cc inthe Node.js source code). It includes two methods of encoding. The first is variable-length integer encoding (a popular method of “compressing” integers; the idea is 7-bit encoding: one byte contains 7 bits of information, and the most significant bit indicates whether the following byte contains useful information). The second is the zigzag method of encoding positive and negative numbers (the sign bit from the most significant bit of the number is transferred to the least significant bit). Structure data from the SourcePositionTable (a special table of matches, where string numbers of the source script code match offsets within the bytecode) is stored in serialized data.
- Number storage format: SMI (small integer). In the 32-bit version of the V8 engine the least significant bit in the SMI number signifies the following: 1—the number is a pointer, 0—this is an ordinary number. In order to gain the SMI value, the displacement must be shifted one bit to the right of “SMI>>1..” The numerical values of fields within objects (for example, the Length field in the FixedArray object) is stored in SMI format in serialized code.
3) Format of serialized data:
<opcode> (one byte) <opcode data> (N byte)
In the V8 engine, each object consists of a set of fields (a direct analogue with the fields of a class in object-oriented programming). Each field can be a “pointer” to another object. This object can be embedded (stored in the roots_ array of the Heap object), or already created in the early stages of deserialization (a so-called back reference must be stored for it), or is a new object (in this case the deserialization of its fields is launched).
4) What opcodes are implemented in the loader (opcodes enumerated here we have implemented in our plugin):
- 00h-04h (kNewObject) = creation of a new object. The object is always placed in the Reservation Data (heap fragments managed by the V8 engine). How is a specific fragment chosen? Using a simple operation: opcode & 7. The following fragments exist:
- 2 = CODE_SPACE
- 3 = MAP_SPACE
- 4 = LO_SPACE (for storing large objects)
- 08h-0Ch (kBackref) = back reference to object. In order to reduce the size of serialized data, there is no need to reserialize already serialized objects. To do so, the so-called back references are used. All objects are created in their heap fragments: NEW_SPACE, OLD_SPACE, CODE_SPACE, MAP_SPACE, LO_SPACE. Therefore, there are also five opcodes for back references, for each type of heap fragment. All the objects in the cloud fragment are located sequentially one after another. It turns out that the value of the back reference is essentially the offset of the object within the heap fragment.
Our plugin loader acts in a similar way: it takes a snapshot of the created objects for each heap fragment. The loader also keeps track of the sizes of objects, taking alignment into account. When a back reference appears in the serialized data, the loader finds the object by going through the list of created objects. The object found is written in parallel to the “hot links” array (hot_objects, a small array, only seven elements). The hot_objects array is required for processing the kHotObject opcode.
- 05h (kRootArray) = find a Map-class object in the roots_ array of the Heap object and write a pointer to the object in the current offset of the deserialized object. The index for the roots_ array is located in the opcode data. The object itself is written in parallel to the hot_objects array. The hot_objects array is required for processing the kHotObject opcode.
- 0Dh (kAttachedReference) = find an object in the attached_objects array and write a pointer to the object in the current offset of the deserialized object. For our plugin loader, the attached_objects array contains one element with a zero (it is not used).
- 0Eh (kBuiltin) = find an object in the builtins array and write a pointer to the object in the current offset of the deserialized object. An example of a common embedded object is InterpreterEntryTrampoline.
- 0Fh (kSkip), 15h-17h (kWordAligned, kDoubleAligned, kDoubleUnaligned) = our loader emulates these opcodes for alignment of data on the heap.
- 18h (kSynchronize) = this is the deserialization breakpoint. The deserialization process is completed.
- 19h (kVariableRepeat) = is used for repetition of the previous deserialized value. The number of retries is transmitted in the opcode data.
- 1Ah (kVariableRawData) = used when it is necessary to write “raw” data into the object. The size of the data must be aligned (found in the opcode data).
- 38h-3Fh (kHotObject) = find an object in the hot_objects array and write a pointer to the object in the current offset of the deserialized object.
- 4Fh (kNextChunk) = this is an instruction to the V8 engine to “proceed to the next reserved chunk.” The instruction is related to fragments of the “Reservation Data” heap and to placing so-called FILLER objects in them. Fillers reserve memory in the heap fragment. The loader takes these FILLER objects into account for the correct calculation of the values of back references.
- 80h-9Fh (kRootArrayConstants) = find an object in the roots_ array.The index for the roots_ array is calculated using the formula “value – 80h.” The pointer to the object found is written into the current offset of the deserialized object.
- C0h-DFh (kFixedRawData) = is used when it is necessary to write “raw” data (a fixed number of bytes) into the object. The number of bytes calculated using the formula “value – BFh.”
- E0h-EFh (kFixedRepeat) = is used for repeating the previous deserialized value a fixed number of times. The number of repeats is calculated using the formula “count = (value – DFh).”
5) What objects the loader in the plugin parses in the course of serialization:
- SharedFunctionInfo (a closure object)
- ScopeInfo and OuterScopeInfo (lexical environment)
- FunctionData (object with bytecode)
- FixedArray, FixedCOWArray (a “key-value” lexicon; objects themselves can be a key)
- ConsString (a binary tree, where leaves, for example, are ordinary ConsOneByteString string objects)
- HandlerTable (exception handlers)
- SourcePositionTable (special table of matches where string numbers of the source script code match offsets within the bytecode)
The parsing of enumerated objects is necessary in order to correctly form data for the bytecode disassembler and for the subsequent work of the decompiler.
— Why would anyone do this?Rick and Morty
— The reason anyone would do this is, if they could, which they can’t, would be because they could, which they can’t.
And again, we are going to briefly go through how the plugin disassembler works. Detailed articles on the processor module will appear later.
1) Where bytecode is located: in SharedFunctionInfo objects created by the loader:
- In the object SharedFunctionInfo → FunctionData:
- The kBytecode field = the bytecode itself.
- The kConstantPool field = array made up of constants, strings, lexicons, and other objects.
- The SourcePositionTable field = table of matches, where string numbers of the source script code match offsets within the bytecode.
- kHandlerTable = table with exception handlers.
- In the object SharedFunctionInfo → ScopeInfo, the lexical environment for the bytecode of the function is situated.
2) What a V8 bytecode instruction looks like:
- Width (optional, 1 byte):
- 0 —Wide
- 1 — ExtraWide
- Opcode (1 byte)
- Operands (can be coded with 1, 2, or 4 bytes each)
// TestEqualStrict <src>
// 00: Wide
// 5a: TestEqualStrict, test if the value in the <src> register is less than the accumulator
// 00 02: a1 - <src>
// 0x145 - slot_index
00 5a 02 00 45 01 TestEqualStrict.Wide a1, 
Several basic groups can be identified by their types of instructions: loading into the accumulator (prefix Lda), writing from the accumulator to a different place (prefix Sta), working with registers (Mov/Ldar/Star), mathematics (prefixes Add/Sub/Mul/Div/Mod), boolean operations (BitwiseOr/ BitwiseXor/ BitwiseAnd/LogicalNot), offset (prefix Shift), multipurpose embedded instructions (prefix Call/Invoke/Construct and without the prefix StackCheck/TypeOf/ToName/ToNumber/ToObject/Illegal), comparison instructions (prefix Test), creation of objects (prefix Create), conditional jumps (prefix Jump), loops (prefix For), instructions to generate exceptions (prefix Throw), debugging instructions (prefix Debug), branching (SwitchOnSmiNoFeedback).
Conclusion. Node.js version 8.16.0 provides 167 instructions.
Numeric operands in the instruction have a reverse order of bytes (little-endian).
3) The processor module for the plugin was written for the Ghidra framework. The bytecode instructions are described in SLEIGH. The description of the analyzer and loader and redefinition of the p-code for certain V8 operations are executed in Java. When loading the file into Ghidra, the loader parses the JSC file breaking it into functions and defining the entities required for analysis and decompilation.
4) It was difficult to implement the function arguments and registers precisely as in Node.js. There was an option to implement them as registers or make a repository like a stack for each function. To better match the output of Node.js and for ease of perception, the decision was made to use the register model, although it did impose limits on the number of registers which a function can work with (a0-a125, r0-r123).
The result of all this, our work, became the plugin ghidra_nodejs for the Ghidra disassembler.
You know the worst part about inventing teleportation? Suddenly, you’re able to travel the whole galaxy, and the first thing you learn is, you’re the last guy to invent teleportation.Rick Sanchez
A great adventure needs a great ending!Rick Sanchez
As a result, we completed our job of analyzing the server application. We were able to obtain readable code. We hope that this tool will help our readers when working with serialized bytecode from the V8 engine.
Many thanks to Vladimir Kononovich, Vyacheslav Moskvin, Natalya Tlyapova for researching Node.js and developing the module.