Encryption vs Compression, Part 2

I’ve recently been examining the feasibility of differentiating compressed data from encrypted data based on variations in the entropy of the data. Initial results showed some promise, but were tested against too small of a sample set to draw any hard conclusions. Since then, I’ve been experimenting with larger data sets (more files and more varied types of encryption / compression) with quite satisfactory results.

The TL;DR is that 98% of the compressed files tested were correctly identified as compressed, and 100% of the encrypted files were identified as not compressed (i.e., encrypted).

In general the entropy of compressed data shows significant variances from that of encrypted data, and can be reliably identified with very few false positives. While identification of certain compression algorithms (namely LZMA) still present some practical concerns depending on your situation, even those compressions were reliably distinguishable from encrypted data during testing due to non-random data in the file’s header structure (see ‘Analysis’ below).

What is particularly exciting, at least for me, is that compression formats which would have been otherwise unknown (e.g., files that weren’t signatured by file or binwalk) were easily identified as compressed through entropy analysis.

Continue reading

Differentiate Encryption From Compression Using Math

When working with binary blobs such as firmware images, you’ll eventually encounter unknown data. Particularly with regards to firmware, unknown data is usually either compressed or encrypted. Analysis of these two types of data is typically approached in very different manners, so it is useful to be able to distinguish one from the other.

The entropy of data can tell us a lot about the data’s contents. Encrypted data is typically a flat line with no variation, while compressed data will often have at least some variation:

Entropy graph of an AES encrypted file

Entropy graph of an AES encrypted file

Entropy graph of a gzip compressed file

Entropy graph of a gzip compressed file

But not all compression algorithms are the same, and some compressed data can be very difficult to visually distinguish from encrypted data:

Entropy graph of an LZMA compressed file

Entropy graph of an LZMA compressed file

However, there are a few tests that can be performed to quantify the randomness of data. The two that I have found most useful are chi square distribution and Monte Carlo pi approximation. These tests can be used to measure the randomness of data and are more sensitive to deviations in randomness than a visual entropy analysis.

Continue reading