From opcode to code: how AI chatbots can help with decompilation

Sometimes, when searching for vulnerabilities, you come across protected PHP code. Often, it’s protected by commercial encoders. These encoders perform a straightforward task: they compile the source code into Zend Engine bytecode and then encode it. The obfuscation result looks something like this:

Unfortunately, there are no free or open source tools that can decode PHP scripts protected by commercial encoders, just like there is no decompiler for the Zend VM opcodes.
Can we do something about that? Absolutely!

Zend Engine 101

Zend Engine plays a key role in executing PHP scripts, acting as both a compiler and runtime environment. The execution of a PHP script starts with the source code being parsed and then transformed into an Abstract Syntax Tree (AST). Next, the AST is converted into opcodes for the Zend virtual machine, which are then executed.


As mentioned earlier, commercial encoders handle every step except the last one: the execution of opcodes. After compilation, the encoders don’t execute opcodes. Instead, they serialize them along with their metadata, encode the resulting blob, and encapsulate it in a bootstrap script. When executing the encoded script, the following reverse operations occur: blob decoding and deserialization, loading opcodes and their metadata into Zend VM, and finally, the opcode execution.

Decoding and disassembling

Having understood the working principle of commercial encoders, our next step is to figure out how to extract opcodes and their metadata from an encoded blob. I’ll leave this challenge to you, dear reader, as the goal isn’t to take work away from developers of commercial encoders 🙂

In this study, I analyzed the loader of one of the most widely-used commercial encoders of PHP code. As a result, I created a tool that can extract (and disassemble) the Zend Engine opcodes from encoded PHP scripts. Below is a glimpse of how this tool works.

Source code:

<?php 

function get_hello_text($name) { 
    return "Hello, " . $name; 
} 

echo get_hello_text("PT SWARM");

The result of disassembling the encoded script:

The result of the disassembling is already sufficient to analyze and search for vulnerabilities. However, I wanted to make my job easier and finally get to something that would look more like regular PHP source code, so next came…

Decompilation

Developing a decompiler for the Zend virtual machine is a tricky task. There are over 200 opcodes, making the development quite time-consuming. While I could have used Ghidra’s flexible SLEIGH specification language, as my colleagues did for V8, I was looking for a simpler solution… and I found it!

AI

Just for fun, I asked Microsoft’s Copilot chatbot to decompile the Zend Engine opcodes, and it nailed it!

However, that felt too easy, so I decided to try decompiling something more complex.

I took a random WordPress function, encoded it with a protector, then decoded and disassembled it. I sent the result to the chat with a request to convert the Zend Engine opcodes in the comment block to PHP code.

I also took the RC4 function code and did the same thing by sending the following request: convert PHP opcodes in the comment block to PHP code.

Next, I ran the decompiled code to verify it.

Impressive, but the quality of code decompilation using AI chatbots can vary significantly. In some cases, AI can achieve 100% accuracy by flawlessly restoring source code from opcodes. In other cases, the results are less accurate, leading to issues like skipping nested loops. Another challenge is the limit on the length of requests. When you send very long code for decompilation, a chatbot might struggle to complete the task due to limitations in handling large volumes of data. In such cases, you may need to break the code into smaller fragments or shorten the request, keeping only crucial fragments to be analyzed. Nevertheless, the results generated by AI simplify the reading of opcode listings, making this tool worthy of your time.