Source Code Obfuscation: What it is and Techniques to Use

April 10, 2019
No Comments

Share:

Your source code is your intellectual property that takes time and costs money to develop. If source code gets into the wrong hands your company can suffer a number of side-effects, including loss of competitive edge, exposure of your innovations, and can even have serious security implications. For these reasons, protecting your code is a top priority in any organization where source code is created.

One of the ways that code is at risk is by ‘reverse-engineering’. This can be achieved using a number of methods that are applied to decompile software to the original source code. One of the ways we can use to ensure that our source code is safe from reverse engineering is to use a technique called ‘code obfuscation’. In this article, we will take a deep dive into what code obfuscation is and how it works.

What is Code Obfuscation?

The basic premise of code obfuscation is to modify code such that the underlying algorithm is opaque – even to someone who has full access to low level debuggers. How and when the code is modified, and how effective this is, has a dependency on the language used to write the software.

The typical uses of code obfuscation include:

  • To protect intellectual property (IP) by preventing the exposure of source code or underlying algorithms to a greater or lesser extent;
  • Software licensing code;
  • Whitebox cryptography;
  • Secret hiding; and,
  • Digital rights management.

Computer viruses are also commonly subject to obfuscation to disguise their actions.

The requirement for code obfuscation is, generally determined by:

  1. The sensitivity of the application
  2. How unique or valuable it is
  3. Security considerations – e.g. as part of making software less vulnerable to unauthorized modification.

For software protection, code obfuscation should be thought of as just one part of an overall system of software protection – other components include code signing and encryption, as well as data leak prevention (DLP); software protection is a much larger subject and we have discussed other ways of securing code in previous articles.

Although code obfuscation is generally thought of as being applied to software applications, it is also important to consider these techniques for firmware code in hardware applications such as IoT, to help protect IP and hide keys, etc.

A Deeper Dive into Source Code Obfuscation

After reading the above, you might conclude that an application written in a compiled language, such as C, C++ or GO, might not require obfuscation, as the code is compiled into executable form (machine code) prior to distribution.

However, although machine code cannot be reversed to give the original source code, use of a either a disassembler, or run-time examination of the system with a low level debugger, can reveal exactly how the software works, which may be a problem if your software involves some unique algorithms, or has special security requirements. These tools essentially reveal your source code, ‘secret sauce’.

With other languages, the situation is less clear, and the following is a generalization:

Java and C#

Languages that are compiled to an intermediate language (IL) rather than directly to machine code may be obfuscated to help maintain the intellectual property of the source, as IL is relatively easily reversed into something resembling the original source. (This is particularly easy if symbol tables are accidentally included in a distribution.) These languages include Java and C#.

Other Languages Including JavaScript, PHP, Ruby

Other languages are commonly distributed directly as source code and are interpreted at run-time; these include JavaScript, PHP and Ruby.

NOTE: There can be exceptions to some of the above, for example there are compilers for Ruby and PHP, but these have their own complications and are rarely used.

The reason why semi-compiled and uncompiled languages have gained traction, in spite of the lack of IP protection, is because they are portable across different operating systems; languages such a C must be compiled for the OS that the software will run on.

Obfuscation Techniques

In general, code obfuscation may be applied to source code, IL code or final machine code, depending on the language used; normally it would be applied as part of a build process, although for some applications the obfuscated code is incorporated directly into the source code from the start. The advantage of this is that such code is protected from the beginning.

The ultimate systems for code obfuscation and IP protection are neural networks: once a network is trained, it is, as yet, not possible to determine the underlying relationship between the input variables and output.

Techniques for Languages Distributed as Source

For languages that are distributed as source, the simplest method is to use minimization: the source is run through an application that removes whitespace and comments; although this makes the code less readable, it can be reversed easily. Another method, commonly used by malicious software, is to use character encoding to disguise the code text. For example:

Original code

alert (“hello world!”);

Character encoded:

var _0x72e4=["\x48\x65\x6C\x6C\x6F\x2C\x20\x57\x6F\x72\x6C\x64\x21"];alert(_0x72e4[0])

Again, however, this is easily reversed.

More advanced tools are available that rename functions and variables to make analysis difficult, add extra functions and loops and generally attempt to make analysis difficult.

Other Techniques Applicable to All Languages

The following methods are applicable to all programming languages, but especially compiled ones.

More effective methods for code obfuscation are based on:

  • Hiding strings
  • Altering data structures
  • Code expansion
  • Introducing custom micro interpreters to replace language function calls
  • Replacing functions with lookup tables
  • Obfuscating function calls, especially OS system calls
  • Insertion of non-functional code segments
  • Encryption

Examples from The List

1.    Code expansion

Here, code is expanded in size to make analysis and real time debugging more difficult; techniques here include replacing simple instructions (such as addition, for loops, logical operations) with more complex and unfamiliar, but functionally identical methods:

Obfuscated XOR function (C source)

Performs exclusive OR logic on operands (Only part of code shown). In effect, it is a replacement for this single line of un-obfuscated code:

This line of code: A = B^C

Expands to:

// obyte_xor(ob1, ob2, ob3)
// ob3 = ob1 XOR ob2
void obyte_xor (struct obyte *ob1, struct obyte *ob2, struct obyte *ob3) {
    int i;
    for(i = 0; i < 8; i++) {
        obit_xor( &ob1->s[i], &ob2->s[i], &ob3->s[i] );
    }
}

void obit_xor (unsigned char *b1, unsigned char *b2, unsigned char *b3) {
    int i;
    i = fake_function_3(*b1, *b3);
    if (obit_get(*b1) == obit_get(*b2))
        *b3 = obit_set(0);
    else
        *b3 = obit_set(1);
}

// We encode obits as follows:
//  1 - even
//  0 - odd
// Return 1/0 bit encoded as obit
unsigned char obit_set (int b) {
    unsigned char p;
    do {
        p = ob_random_byte();
        fake_function_1 ((PINT) &p, (int)b);
    } while (ob_evenness(p) != b);
    fake_function_2((int *)&p, (int)ob_key);
    return p^ob_key;
}

// Return 1 if obit is 1, 0 if obit is 0
unsigned char obit_get (unsigned char b) {
    return ob_evenness (b^ob_key);
}

// Parity check. Input: byte to check.
// Output: 1 if even, 0 if odd.
static unsigned char ob_evenness (unsigned char p) {
    fake_function_3 (ob_key, p);
    return !(p % 2);
}

__inline int fake_function_1 (int *a, int b)
{
    unsigned int d1 = 2011, d2 = 12051, d3 = 3, d4 = 1976;
    d1 = (*a - b);
    if (d1 < b) {
        d2 = (*a + b);
        d3 = (*a * b);
    } else {
        d2 = (b * *a);
        d3 = (*a + b);
    }
    d4 = (d2 + d3) * d1;
    return d4;
}

2.   Replacement of a function with a table lookup

This is a particularly good method for obfuscation, as the underlying transformation can often be difficult to work out by anyone examining the code. In essence, the method involves building a table of all possible values that the function produces given the possible inputs. If the transformations are complex, or the input data set large, the problem can often be addressed by breaking it down into several smaller tables.

Here is a simple example of a table lookup as part of an obfuscated bitwise addition:

// obit_add(b1, b2, b3, carry)
// b3 = b1 + b2, carry holds bit carry for the sum
void obit_add (unsigned char *b1, unsigned char *b2, unsigned char *b3, unsigned char *carry) {
    unsigned char e1, e2, c;
    int i;
    unsigned char add_table[8][5]={
//      e1 e2 c   s  c
       {0, 0, 0,  0, 0},
       {0, 1, 0,  1, 0},
       {1, 0, 0,  1, 0},
       {1, 1, 0,  0, 1},
       {0, 0, 1,  1, 0},
       {0, 1, 1,  0, 1},
       {1, 0, 1,  0, 1},
       {1, 1, 1,  1, 1} };
    e1 = obit_get(*b1);
    e2 = obit_get(*b2);
    c = obit_get(*carry);
    for (i = 0; i < 8; i++) {
        unsigned char t1, t2, t3, t4, t5;
        t1 = add_table[i][0];
        t2 = add_table[i][1];
        t3 = add_table[i][2];
        t4 = add_table[i][3];
        t5 = add_table[i][4];
        if (e1 == t1 && e2 == t2 && c == t3) {
            *b3 = obit_set(t4);
            *carry = obit_set(t5);
            return;
        }
    }
    return;
}

3.    Hiding static data

A common method here is to divide static values into several parts, replacing each part with a function.

For example, the byte sequence [0xc6, 0x2c, 0xe0] could be replaced by three functions:

For [0xc6] substitute 0x56 XOR 0x90
For [0x2b] substitute 0x28 XOR 0x04
For [0xe0] substitute 0xa0 XOR 0x40

This technique is particularly useful when data are combined in logical operations. For example, using the substitutions above, the expression:

B = A^0xc6

Would be replaced by

temp = A^0x56
...
...
B = 0x90^temp

Problems with Code Obfuscation

For uncompiled languages, code obfuscation may lead to noticeably slower execution, particularly if extra steps are introduced that must be also be interpreted. However, the most significant problem with code obfuscation is debugging – if errors occur in obfuscated code it can be difficult to determine exactly what the problem is, due to the use of the modified code; to minimize this problem, one approach is to only obfuscate the critical functions or classes. Occasionally, anti-virus software will flag obfuscated code as dangerous, due to the protection methods being commonly used in malicious software.

Choosing a method to obfuscate your source code should be done based on the language used and the code protection requirements. But as with many things in software development, code obfuscation is part of a larger arsenal of techniques used to protect your intellectual property and thus your competitive edge.



Read other posts like this:


Trends in Data Loss Prevention (DLP)
What is DLP (Data Loss Prevention)
How to Choose a Secure Software Development Company
The Great Resignation and What it Means for Software Development and Data Security
Source Code Security Highlights of 2019 Report
Top Data Breaches of 2019: Half-Year Review