Protecting Python Source Code

November 2, 2018
No Comments

Share:

Securing Python source code is more urgent today than ever before. This is true in large part because popular trends in distributed computing have created new vulnerabilities. Continuous integration and delivery of new versions of an application often lead to unsecured source code exposure on repositories like GitHub. The recent Aeroflot story is a breathtaking example, in which the company’s entire Python codebase was publicly exposed on a repo! What are the best measures enterprises and developers can take to protect their python code from vulnerability?

We will explore answers to this question in depth. Let’s begin with a quick survey to see where most Python source code vulnerabilities arise today:

  • Repositories and versioning platforms 
  • Agile/DevOps developer methods in CI 
  • Unencrypted source code 
  • Scripted authentication credentials 

GIT and Subversion 

An important and common Python source code vulnerability today exists in the repository and versioning platforms which support the wildly popular workflows of continuous integration and continuous delivery of new software. An inherent vulnerability exists because of the way CI pipeline builds are generally scripted. The vulnerability appears in two forms:  

  1. Automated builds containing authentication credentials. 
  2. Sharing source code in plain text form on repositories. 

An Unsecured repository recently left the entire Aeroflot Python code base open to the public!  

We have broken out four particular vulnerabilities, but they are in fact interwoven in the workflows of web application development. Let’s take an overview of common software development workflows to understand why these Python source code vulnerabilities arise 

Continuous Integration Pipelines

Continuous integration and delivery of software means that when a developer commits a change to source code the new software version goes from a scripted build-and-test all the way through to release to customers automatically. The CI and CD pipeline, as it is called, is all the rage in enterprise software development now, because it is perceived as the optimally efficient method of rapid development and delivery. But for many apps, this is a risky and complex process to automate. Whether Tesla releases a new module that controls steering in an autonomous vehicle, or Passapp simply updates its taxi client, enterprises now compete to deliver new app versions continuously.  

CI and CD as they are called, are now extremely popular Agile and DevOps strategies  for getting new software updates out to customers as quickly as possible. However, pressure for speed in these strategies tempts developers to use tactics which involve security shortcuts and risks. These are the source code security issues that we are focused on today. Let’s briefly review how CI pipelines work.

Vulnerability in CI Pipelines

When a developer commits a Python source code change, an app like Jenkins detects the change and automatically executes a suite of scripts to operate the entire CI pipeline, all the way through to production. Among these scripts is a test suite which verifies that the change did not introduce errors to the software. Test scripts must simulate real users and actually sign into virtual accounts. To accomplish this feat, the authentication credentials for those accounts must be scripted. These scripts often sit unsecured on code repositories, along with the Python source code and other resources needed to run the CI pipeline. Here is an example of Java code which creates a virtual user and logs it in automatically through SSH:

Protecting Python Source Code- An example of Java code which creates a virtual user

API tokens and secret keys are also scripted for automation. Why is this important to securing Python source code? Attacks usually come through indirect channels or points of entry. These scripted plain text credentials eventually lead attackers to access a variety of infrastructure where Python source code is stored, including GIT repos and versioning platforms like Apache Subversion.

Ultimately, the need to share Python source code among multiple developers, along with other build resources creates a substantial security issue. Developer shortcuts taken for convenience lead to unanticipated risks. Solving this issue requires a combination of new security compliance training and new security technology. We will explore the best methods to secure Python source code and resolve other related vulnerabilities here.

Solutions to Protect Python Code

The methods to secure Python source code exist now in many developer utilities and platforms, both paid and open source. The remaining problems include leadership and developer education on the subject, followed by real initiative to change developer culture. Let’s take a survey of the basic measures needed to secure source code and protect all software resources:

  • Source code encryption 
  • Source code conversion 
  • Automatic source scanning 
  • Developer security compliance training 

Python Code is Particularly Vulnerable

Python source code requires more effort  to secure than C++, for example, because Python is an interpreted language. This means that the source code which developers write is stored in human readable form right up to the point of execution. C++ on the other hand, is a compiled language, which means that it is translated into a form which is not readable to humans – a form called machine language – and stored in this form up till the point of execution. Both ways of storing code have advantages. But an important disadvantage of interpreted languages like Python is that Python source code is more easily stolen.  

The best solution to this vulnerability is to encrypt Python source code. Encrypting Python source code is a method of “Python obfuscation,” which has the purpose of storing the original source code in a form that is unreadable to humans. There are actually programs available to reverse engineer or uncompile C++ code back to human readable form. But every measure taken to secure source code decreases the probability of theft. 

Changing Developer Habits 

Reliable solutions to protect Python code depend on developer compliance. Because, as we have seen, a major Python source code vulnerability is created inadvertently by developers in a normal CI workflow. This implies that a first line of defense is security compliance training for software engineers and developers. Of course, developers understand better than anyone how security methods work. The purpose of training is to develop best practices and new habits of compliance with security standards. We need to reinforce awareness of the important risks of typical developer shortcuts.What are some of the methods we can build into developer workflows to secure Python Source code? 

Source code Conversion  

One very powerful method is the exceptional trick of compiling Python source code to C++ machine code! As we discussed earlier, C++ machine language is not readable to humans. Source code which is not easily identifiable requires a laborious step on the part of rebuilders.  

A rebuilder is a type of hacker whose method is to reverse engineer Python source code in order to discover the innovative methods originally used by the owner – in other words, to steal the secret formula in the code. If plain text source code is compiled, then it is not apparent what language the original code was written in. So it is more difficult to reverse engineer, and more difficult to steal. Furthermore, there is a beneficial bonus in taking the security measure. Converting Python to C++ machine code has the additional advantage of optimizing Python and accelerating execution of Python code. Let’s look in depth at how this works. 

Cython Conceals the Source 

Developers can now use a C++ compiler called Cython to compile Python source code into optimized C++ machine code. The resulting executable code is less vulnerable to theft.  And in most cases the resulting executable runs substantially faster than the interpreted Python original. Otherwise the code is still equally compatible with other modules. 

The resulting app still uses the same Python DLLs, depending on how the optimization is configured. In this method, all the proprietary Python code for an app compiles into a .pyd module. From there, the Nuitka app can generate an EXE file and embed required libraries for execution. Cython and Nuitka both optimize and compile Python source code to executable C++ code. Let’s look more in detail at this innovation, because this is essential to secure Python source code. 

Strong Data Types 

First, Cython converts the loosely typed parameters and variables of Python, and translates them into the strong types of C and C++. Cython is essentially a Python language compiler with data types borrowed from C and C++ languages. Effectively all Python code can be compiled with Cython, and the resulting executable makes calls to both Python and C++ APIs.  

The brilliant innovation in Cython is that now weak variable types in Python can be declared as strong C and C++ data types. This naturally leads also to increased memory security (preventing memory buffer hacks – yet another security advantage). The Cython compiler accepts any valid Python source code as input. Let’s have a look at the first steps toward implementation of Cython. 

In addition to an ordinary Python source code file, we will add a make file to prepare the build for Cython. We have a Python source code file with a single line:  

print(“Hello brave new world!”) 

Saved along with a setup.py file including the name of our source like this: 

from distutiles.core import setup 
from Cython.Build import cythonize 
setup( 
    ext_modules(“HelloBraveNewWorld.pyx”) 
) 

Next, enter this statement to Cython on the Python command line: 

$ python setup.py build_ext --inplace 

This will create a file in the Python folder called HelloBraveNewWorld.pyd in Windows (HelloBraveNewWorld.so in Unix). The next step is to actually created a compiled Python executable. C that our source does not require any C or C++ libraries, we can use pyximport to load all Python source files to compile: 

import pyximport; pyximport.install() 
import HelloBraveNewWorld 

And now the output of our single print statement actually comes from the compiled version of the code! It’s that simple to implement Cython to compile basic Python source code. And there is a bonus yet to come. 

An important additional advantage to compiling Python for source code for security purposes is that the compiled Python code may run up to 50% faster. This is especially optimal for algorithms which are computationally intensive, such as the following Python code which calculates Pythagorean triples:Protecting Python Source Code - Python code example

Automatic Scanning in Versioning Platforms  

Perhaps the single most important and commonplace Python source code vulnerability arises from developers automating build scripts in a continuous integration and delivery pipeline. When developers script test suites to verify a new app version, they usually hard code authentication credentials – usernames and passwords – directly into build scripts. These scripts turn up in unexpected places as security risks which can compromise an entire application! 

When a test suite runs, very often hundreds of virtual users are generated through Jenkins and Gradle apps to operate performance, functional, and load balancing tests of the app. Each of these virtual test users must actually sign into a valid user account to run the tests automatically. When login credentials for virtual testers are scripted, they can show up in server logs as plain text files. In other words, if a server generates a log file containing all scripts executed, now all server admin staff have access to those login credentials which were hard coded into the build! 

The best way to prevent this problem is for team leaders to use versioning platforms which automatically scan source code and scripts for credentials and other risk related keywords which would give unauthorized users access to source code and data on a server.  

This measure is accomplished by adding secure repository platform to the pipeline. Such platforms are often enhanced versions of open source platforms like GIT and Subversion. Now, when a secure repo replaces GIT, a code push by a developer triggers automatic source code scanning to detect any security risks which may have been added.  

These risks may include the addition of a secret key or API token since the previous build. If a risk is detected, admin is alerted automatically and the build is halted prior to execution of testware.   

Security compliance training for developers  

If there are too many cooks in the kitchen, there is increased security risk. In a common scenario, three new branches of an app may appear on the versioning control panel at once. These must be rolled out and rolled back for testing purposes. The fact is that manual security scanning by humans alone is no longer viable. Agile team leaders need sophisticated versioning apps with built-in security scanning. Automatic source code scanning ensures that no login credentials or secret keys are exposed. 

When five developers are working on a module, there is the temptation to share resources. Security leaks are more common when developer teams share access to resources. The best measure to reduce this risk is to train or retrain developers about the critical nature of security in staging apps for deployment and in scripting pipeline builds. Progressive leaders will remind developers – when under time pressure – to be continually mindful of potential security breaches. This is especially true of Python source code, which often sits unencrypted on staging servers. The best way to protect Python source code is to implement a twofold program. First, introduce training and second add the automatic code scanners. 

Best Practices for Protecting Python Source Code 

CI and CD pipelines are poised to continually increase pressure on developers to optimize their workflows. A forever reality of enterprise software development is the disconnect which occurs when the baton changes hands in a developer team relay race. In developer terms this could mean two coders sharing a set of authentication credentials which wind up in a server script log. The result for an enterprise can be the loss of millions of customer sensitive info like credit card numbers. 

We have seen that the best practices for securing Python source code include a combination of new training and sophisticated new security-focused developer tools. Security-based repositories and versioning platforms are the logical evolutionary step in CI and CD pipelines.  

Changing the culture of developer habits must be interwoven with security based repository and versioning platforms to create and enforce a new developer culture which is security-centric! The recent dramatic news of security breaches and massive losses of client sensitive data at familiar companies like Uber, Equifax, and Aeroflot demonstrate that we must pay a debt of diligence to the necessity of securing our most valuable resource: our Python source code.



Read other posts like this:


Trends in Data Loss Prevention (DLP)
What is DLP (Data Loss Prevention)
How to Choose a Secure Software Development Company
The Great Resignation and What it Means for Software Development and Data Security
Source Code Security Highlights of 2019 Report
Top Data Breaches of 2019: Half-Year Review