Encryption vs Compression, Part 2

I’ve recently been examining the feasibility of differentiating compressed data from encrypted data based on variations in the entropy of the data. Initial results showed some promise, but were tested against too small of a sample set to draw any hard conclusions. Since then, I’ve been experimenting with larger data sets (more files and more varied types of encryption / compression) with quite satisfactory results.

The TL;DR is that 98% of the compressed files tested were correctly identified as compressed, and 100% of the encrypted files were identified as not compressed (i.e., encrypted).

In general the entropy of compressed data shows significant variances from that of encrypted data, and can be reliably identified with very few false positives. While identification of certain compression algorithms (namely LZMA) still present some practical concerns depending on your situation, even those compressions were reliably distinguishable from encrypted data during testing due to non-random data in the file’s header structure (see ‘Analysis’ below).

What is particularly exciting, at least for me, is that compression formats which would have been otherwise unknown (e.g., files that weren’t signatured by file or binwalk) were easily identified as compressed through entropy analysis.


What Has Changed?

Some changes have been made since my initial analysis in the way that randomness is measured. First, the following data reflects only the results of the chi square distribution test; while pi approximation may be useful in the future, it was ignored during these tests.

Secondly, the size of the data blocks has been changed. Previously, data was analyzed for randomness in 6MB blocks, that is, the randomness of each 6MB chunk of data was examined separately. Dramatically reducing the size of the individual data blocks to 32 bytes prevented small sections of less random data from being lost in the noise (32 bytes may not be the optimal figure; further testing is required here).


Sample Files

As previously stated, a much larger sample size was used. Twenty files were used as a starting base. These files were all firmware images (because I happen to have a lot of those sitting around) of various sizes and origins. Most, although not all, contain a mixture of high entropy and low entropy data, including file systems, strings, executable code and binary data.

These twenty files were each compressed using the following compressions (note: some of the following are actually compressed archive formats and/or use the same or similar compression algorithms; they were included in an effort to examine as many common compression formats as possible):

  • gzip
  • deflate
  • bzip2
  • lzma
  • xz
  • 7zip
  • compress
  • lzop
  • lz4
  • arj
  • rar
  • zip
  • jar
  • rzip
  • tornado

The same twenty files were then encrypted using the following algorithms:

  • AES
  • 3DES
  • Blowfish
  • Twofish

All total, this resulted in 300 unique compressed files and 80 unique encrypted files.


Results

As previously described, these files were then subjected to the chi square distribution test, the idea being that any 32-byte block which showed a significant deviation from that of an ideal random 32-byte block would indicate that the data had been compressed and not encrypted. The higher number of blocks which deviated from the ideal, the more likely that the data was in fact not encrypted and therefore compressed.

In the following tests, a block whose chi square value exceeded 512 was considered a suspect data block. A file with 1-2 suspect blocks was given a confidence level of ‘low’; 2-4 suspect blocks a confidence level of ‘medium’; 5 suspect blocks or more a confidence level of ‘high’. Any file with zero suspect blocks was assumed to be encrypted.

The performance of these tests is reflected in the following confusion matrix (view the raw results here):

Predicted Encrypted Predicted Compressed
Actual Encrypted 80 0
Actual Compressed 6 294

From this, we can deduce the following metrics:

Metric Value
Accuracy 0.98
True Positive Rate 0.98
False Positive Rate 0.00
True Negative Rate 1.00
False Negative Rate 0.02
Precision 1.00

Analysis

The results show that all encrypted files were identified as encrypted (100% accuracy), and only 6 compressed files were mis-identified as encrypted (98% accuracy).

All the compressed files incorrectly identified as encrypted came from the rzip and tornado compressions. Bear in mind though that the magic cutoff value of 512 and the block size of 32 bytes were somewhat arbitrarily chosen; it is possible that they can be adjusted to better compensate for these compressions (and even if they can’t, the majority of files compressed with these algorithms were still correctly identified).

The bigger concern is with LZMA compression. Although technically all of these were identified as compressed, notice that all of them had only one suspect data block which was right at the beginning of the file. In other words, the deviation was not in the compressed data itself but rather in the header used for that particular file format.

It is fair to say that most compressed files will include a header of some sort while encrypted files typically will not, and thus it is valid to include the non-random header data as part of the analysis. But consider an otherwise encrypted firmware update file which contains a header with the typical firmware header information (build date, CRC, file size, etc). From an entropy analysis standpoint, it would look the same as an LZMA compressed file.

What if you can’t determine the exact start of the data in question (perhaps it’s slapped in with a bunch of other data inside a larger file)? You may include unrelated data (or exclude related data) before/after the data in question, also skewing the results. Although these issues can often be resolved by considering the context in which the data is being analyzed, variations in entropy of the compressed data itself precludes such problems and is therefore more desirable.

Interestingly, LZMA2 (implemented by the xz utility) was much easier to detect than LZMA.


Real World Application

Let’s take a look at some real-world examples. How about some router configuration files? Some vendors like to encrypt or obfuscate backup config files, which is just plain annoying. But which is it? Encrypted or obfuscated?

Let’s first take a Netgear configuration file:

Number of Blocks of Deviation Largest Deviation File Offset of Largest Deviation Analysis Confidence
0 368 7168 Encrypted High

Clearly this result is indicative of an encrypted file, and since we know that Netgear router configurations are in fact encrypted, we can leave it at that.

What about the Verizon FiOS ActionTec router configuration files? It has been claimed that these too are encrypted, but I have never seen any proof for or against these claims. The file’s entropy analysis however, is quite interesting:

Number of Blocks of Deviation Largest Deviation Offset of First Deviation Offset of Last Deviation Analysis Confidence
260 720 162944 354688 Compressed High

Looking more closely at where the entropy deviations occur shows that while the first part of the file is very random, over half of the file (> 200KB) is insufficiently random to be considered encrypted. At best this file is only partially encrypted, possibly not really encrypted at all.


TODO

More testing, obviously. Particularly with compressions such as rzip and tornado where we got acceptable results, but not as accurate as most other compressions.

LZMA of course warrants more attention, as well as optimizing the block size and magic deviation cut off.

More varied data sources (plain text ASCII, for example) must also be considered.

Additionally, options and flags passed to the compression utilities may play a significant role in the resulting entropy. For example, both the lzma and xz utilities were given the ‘–extreme’ option, which is supposed to improve the compression ratio, but how does this compare to the ‘-9’ option in terms of entropy?


The Code

The binwalk repo has been updated with the new code changes; although binwalk itself hasn’t been updated to exercise these changes, the maths.py file can be run stand alone from the command line if you want to play around with this yourself.

Again, this work is still preliminary, so suggestions and criticisms are welcome.

Update: I’ve gotten some questions asking if this method will work equally well against AES ECB (AES CBC was used during testing). Some initial testing indicates that ECB vs CBC makes no appreciable difference in the test results.

Bookmark the permalink.

14 Responses to Encryption vs Compression, Part 2

  1. Pingback: In the News: 2013-06-15 | Klaus' Korner

  2. Matthew Green says:

    I know it’s a lot of work — but would be very cool to see visualizations of the files in question, perhaps with a color (or ASCII character) assigned to show the entropy measure of each 32-byte byte block. This would help a lot in seeing how and where the compressed files differ from random (or encrypted) data.

    • Craig says:

      I will if I get the time. In general though, an entropy graph of an encrypted file is very flat and straight, while most compressions (save those with a very good compression ratio, such as LZMA) will have a “drip effect” caused by smaller sections lower entropy.

  3. Pingback: Ripper firmware-ów – binwalk

  4. hpux735 says:

    Any idea how the analysis would change with files that are encrypted after compression? I would expect that encrypted files wouldn’t compress very well, but I can’t imagine what would happen to the entropy after a compressed file was encrypted. I’m very curious.

    • Craig says:

      The input to a (good) encryption algorithm should have no appreciable effect on the resulting encrypted data’s entropy. In fact, several of the files used in the results above were largely comprised of compressed data. So, a compressed file that is subsequently encrypted should exhibit the same entropy signature as those examined here.

  5. mayuri says:

    hi….
    now am doing research on cryptography & network security . which method is best and differ between them? as “compression the data and then encryption” or “encrypt the data and then compression”………..

  6. dsh3ph3rd says:

    zip file with crypt-xor:
    maths.py zipped-data.xor
    0x93CE80 -> 576
    Number of deviations: 1
    Largest deviation: 576 at offset 0x93CE80
    Data: Compressed
    Confidence: Low

    • dsh3ph3rd says:

      file_entropy.py zipped-data.xor
      File size in bytes:
      9687146

      Shannon entropy (min bits per byte-character):
      7.9998184012

      Min possible file size assuming max theoretical compression efficiency:
      77495408.8259 in bits
      9686926.10324 in bytes

      • Craig says:

        Without context, I assume you posted this to show that this method doesn’t detect XOR “encryption” (?).

        First, I wouldn’t consider XOR an “encryption”; it’s obfuscation at best, and there are already tools to help brute force XOR obfuscation. Further, XOR encoding a file will do little, if anything, to change the file’s entropy so I would expect an XOR’d compressed file to be identified as compressed/not encrypted.

  7. Pingback: Usage | Binwalk

  8. Seemant says:

    Hey Craig,

    Can you tell me, how to decompress Zip encrypted archive data.
    I tried Binwalk also but no results 🙁 …. Moreover i tried to extract through dd also… but it still gives the error….

    What should i do..

    File is in .bin extension…

  9. yehuda says:

    Hey Carig,

    Can you tell me if it’s possible to use Binwalk on windows OS?
    And if it is how can I do it?

    Thank you!

    • Craig says:

      If you grab the latest binwalk code from github, it is pure Python. So in theory it will run on Windows, but I haven’t tested it out myself. You’d need to install a Python interpreter on your Windows OS, and from there just run binwalk’s setup.py script as usual.

Leave a Reply

Your email address will not be published. Required fields are marked *