Advanced-Hexrays-Decompiler-reverse-engineering

Software obfuscation has increasingly become a complex issue over the years, so it is essential for us to equip ourselves with the necessary tools and skills to effectively counter these obfuscation techniques. Through my research over the past few years, I have found that challenges have grown; for example, stack strings are no longer simply plain-text string bytes moved to the stack. Instead, they are often encrypted. Unfortunately, in many obfuscators, the decryption routine is inlined, which diminishes our chances of cross-referencing those functions.

Adapting to this common obfuscation technique, which has proven effective in many cases, should be a priority. In this gentle introduction, I aim not to cover all possible methods to circumvent this issue but rather to share my experience utilizing HexRays Ctree for pattern matching. I hope this will help us identify these string decryption routines and handle them appropriately.

What is the CTREE API ?

If you have ever used Hexrays IDA, you may have wondered how the output from the Hexrays decompiler is generated ?. The process is quite complex but fascinating, especially if you have experience with compilers. Decompilation is a multi-step pipeline that starts with processing the generated assembly code in order to produce the well-annotated pseudocode that we see in the decompiler view of IDA.

The decompilation process begins by taking the generated assembly code and converting it into IDA's Intermediate Representation(IL), known as Microcode (which we'll explore in more detail in a future post). This Microcode is then subjected to further processing, during which various optimization passes are applied. Ultimately, an Abstract Syntax Tree (AST) is generated from this optimized Microcode, which is later rendered to produce the C-like pseudocode displayed in IDA's decompiler view.

The produced Abstract Syntax Tree (AST), also known as CTREE, is a data structure representation of code that consists of two distinct types of nodes: statement nodes and expression nodes.

A comprehensive list of all the different node types can be found in the Hex-Rays SDK, which clearly categorizes the nodes. Statement nodes are prefixed with cit_..., while expression nodes are prefixed with cot_....

// Ctree item code. At the beginning of this list there are expression
/// codes (cot_...), followed by statement codes (cit_...).
enum ctype_t
{
  cot_empty    = 0,
  cot_comma    = 1,   ///< x, y
  cot_asg      = 2,   ///< x = y
  cot_asgbor   = 3,   ///< x |= y
  cot_asgxor   = 4,   ///< x ^= y
  cot_asgband  = 5,   ///< x &= y
  cot_asgadd   = 6,   ///< x += y
  cot_asgsub   = 7,   ///< x -= y
  cot_asgmul   = 8,   ///< x *= y
  cot_asgsshr  = 9,   ///< x >>= y signed
  cot_asgushr  = 10,  ///< x >>= y unsigned
  cot_asgshl   = 11,  ///< x <<= y
  cot_asgsdiv  = 12,  ///< x /= y signed
  cot_asgudiv  = 13,  ///< x /= y unsigned
  cot_asgsmod  = 14,  ///< x %= y signed
  cot_asgumod  = 15,  ///< x %= y unsigned
  cot_tern     = 16,  ///< x ? y : z
  cot_lor      = 17,  ///< x || y
  cot_land     = 18,  ///< x && y
  cot_bor      = 19,  ///< x | y
  cot_xor      = 20,  ///< x ^ y
  cot_band     = 21,  ///< x & y
  cot_eq       = 22,  ///< x == y int or fpu (see EXFL_FPOP)
  cot_ne       = 23,  ///< x != y int or fpu (see EXFL_FPOP)
  cot_sge      = 24,  ///< x >= y signed or fpu (see EXFL_FPOP)
  cot_uge      = 25,  ///< x >= y unsigned
  cot_sle      = 26,  ///< x <= y signed or fpu (see EXFL_FPOP)
  cot_ule      = 27,  ///< x <= y unsigned
  cot_sgt      = 28,  ///< x >  y signed or fpu (see EXFL_FPOP)
  cot_ugt      = 29,  ///< x >  y unsigned
  cot_slt      = 30,  ///< x <  y signed or fpu (see EXFL_FPOP)
  .........
  cit_empty    = 70,  ///< instruction types start here
  cit_block    = 71,  ///< block-statement: { ... }
  cit_expr     = 72,  ///< expression-statement: expr;
  cit_if       = 73,  ///< if-statement
  cit_for      = 74,  ///< for-statement
  cit_while    = 75,  ///< while-statement
  cit_do       = 76,  ///< do-statement
  cit_switch   = 77,  ///< switch-statement
  cit_break    = 78,  ///< break-statement
  cit_continue = 79,  ///< continue-statement
  cit_return   = 80,  ///< return-statement
  cit_goto     = 81,  ///< goto-statement
  cit_asm      = 82,  ///< asm-statement
  cit_try      = 83,  ///< C++ try-statement
  cit_throw    = 84,  ///< C++ throw-statement
  cit_end
};

Statement nodes are of type cinsn_t, while expression nodes are of type cexpr_t. However, both inherit from the common base class citem_t.

We will not explain the meanings of the different fields or methods for each, but a description of which is which is provided in the SDK documentation, which you can check out as needed.

CTREE and pattern matching

A Trivial function

The main question is how to utilize the generated CTREE structure for handling obfuscation ? Turns out that the CTREE is incredibly useful for pattern matching. For example, consider the following trivial function and the corresponding CTREE represenation:

Examining the CTREE representation of the decompiled function, we can clearly observe nodes that branch into additional nodes, which in turn may branch further. At the top of the generated tree is a block statement (cblock_t) that has a couple of child nodes: one is an expression statement, and the other is an if statement.

A typical if statement has at most three child nodes: one for the condition, another for the true branch, and the last for the false branch.

A decompiled function is represented by a cfunc_t object.

A cfunc_t object represents the entire CTREE structure of the decompiled function. The key components of this object are the mba property (which will be discussed in future posts) and the body property. The body property provides access to the root node of the generated tree, which is always a block statement of type cblock_t.

With access to the root node, we can branch out to sub-nodes. In our simple case, this includes an expression statement for the call to the tail function j___CheckForDebuggerJustMyCode, and an if statement for the test.

Let's begin with the simple expression statement that represents the call to j___CheckForDebuggerJustMyCode. A typical call node is usually a binary node with two children: the first child is a cot_obj node representing the callee, and the second child is a cot_ref node representing the argument list, which, in this simple case, contains only one argument.

We can access the callee node of the call through the x property, while the argument list can be accessed through the a property, allowing us to examine the individual arguments and their types.

A cot_obj node representing the callee is any object in the IDB database that has a known address. This is helpful for checking calls to determine the true identity of the callee.

Now let's take a look at the if statement, a typical if can have up to 3 children: the first child is the condition, which represents the modulus operation, the second child is the true branch, and the last child is the false branch.

Both the false and true branches represent the block of code that executes if the condition evaluates to either true or false.

In this simple case, either the true or false blocks contain a straightforward statement expression", which is a function call. The children of the call can be addressed in a manner similar to our previous approach.

Reading some codegen

Now that we have established a foundational understanding of what a typical CTREE looks like and how to navigate its various nodes, let's revisit the topic of pattern matching and code deobfuscation.

Recently, I encountered a challenge where I needed to perform pattern matching to identify stack string decryption routines in IDA. My goal was to emulate these routines in order to recover and decrypt the stack strings. I discovered that traversing the CTREE nodes and looking for certain heuristic patterns can be very effective in identifying these string decryption routines, especially since they tend to have similar structural patterns.

Let's analyze one of those functions and identify useful heuristics that can be used to locate similar functions later.

A typical string decryption routine begins with an expression statement node of type cot_asg (assignment statement). While there can be multiple assignment statements though we cannot determine the exact number at least one must be present as the first node in the function.

Another notable feature of a typical string decryption routine is that the last node is a cit_return statement, where the return expression specifically calls runtime_slicebytetostring.

The final notable feature is that just before the return (the node before the last), there is a cit_for statement that iterates over the ecnrypted stack string.

By combining these three distinct features, we can develop a pattern that can later be applied to the IDB functions to identify structurally similar routines.

By matching this simple pattern against the IDB, I found at least 1,785 similarly structured functions.

Having successfully identified most, if not all, of the functions in question, we can now move on to the easier part: emulating these functions to recover the strings after they have been decrypted. To achieve this, we will use FLARE's flare_emu to emulate the functions and extract the decrypted strings.

For this, I utilized a very special feature of flare_emu: the ability to hook into function calls and examine their arguments or the state of the registers at the time of the call.

After executing the complete script, I successfully recovered over 1,600 encrypted stack strings, which can also be added as comments for contextual analysis in IDA.

I encountered some issues after function number 1641. I suspect the problems stemmed from reusing the same EmuHelper instance, which may have led to state corruption. Creating a new EmuHelper instance for each emulation attempt might resolve the issue. I was hesitant to wait that long, but please let me know if you try it and if it works for you.

Conclusion

This post may be quite lengthy, but we have only scratched the surface of what can be accomplished with IDA's CTREE or microcode representations. In our brief and straightforward example, we utilized the tree-like structure of the decompiled functions to locate the functions in question, and then further processed the data to achieve our ultimate goal of decrypting the obfuscated stack strings.

My approach may not be entirely accurate, and we might need to examine the binary more closely to determine if other string decryption functions deviate from this pattern. If that's the case, we will refine our approach or develop a new one. I just wanted to share my experience with IDA's CTREE and hope it inspires future discussions.

References

  • https://www.youtube.com/watch?v=TjtoqAf3TYw&list=PL0xCSYnG_iTuwOcStiVbZihFnzJtdksC0&index=6&t=761s

  • https://www.elastic.co/security-labs/introduction-to-hexrays-decompilation-internals

  • https://hex-rays.com/blog/hex-rays-decompiler-primer

Last updated