I’ve recently been examining the feasibility of differentiating compressed data from encrypted data based on variations in the entropy of the data. Initial results showed some promise, but were tested against too small of a sample set to draw any hard conclusions. Since then, I’ve been experimenting with larger data sets (more files and more varied types of encryption / compression) with quite satisfactory results.
The TL;DR is that 98% of the compressed files tested were correctly identified as compressed, and 100% of the encrypted files were identified as not compressed (i.e., encrypted).
In general the entropy of compressed data shows significant variances from that of encrypted data, and can be reliably identified with very few false positives. While identification of certain compression algorithms (namely LZMA) still present some practical concerns depending on your situation, even those compressions were reliably distinguishable from encrypted data during testing due to non-random data in the file’s header structure (see ‘Analysis’ below).
What is particularly exciting, at least for me, is that compression formats which would have been otherwise unknown (e.g., files that weren’t signatured by file or binwalk) were easily identified as compressed through entropy analysis.
What Has Changed?
Some changes have been made since my initial analysis in the way that randomness is measured. First, the following data reflects only the results of the chi square distribution test; while pi approximation may be useful in the future, it was ignored during these tests.
Secondly, the size of the data blocks has been changed. Previously, data was analyzed for randomness in 6MB blocks, that is, the randomness of each 6MB chunk of data was examined separately. Dramatically reducing the size of the individual data blocks to 32 bytes prevented small sections of less random data from being lost in the noise (32 bytes may not be the optimal figure; further testing is required here).
As previously stated, a much larger sample size was used. Twenty files were used as a starting base. These files were all firmware images (because I happen to have a lot of those sitting around) of various sizes and origins. Most, although not all, contain a mixture of high entropy and low entropy data, including file systems, strings, executable code and binary data.
These twenty files were each compressed using the following compressions (note: some of the following are actually compressed archive formats and/or use the same or similar compression algorithms; they were included in an effort to examine as many common compression formats as possible):
The same twenty files were then encrypted using the following algorithms:
All total, this resulted in 300 unique compressed files and 80 unique encrypted files.
As previously described, these files were then subjected to the chi square distribution test, the idea being that any 32-byte block which showed a significant deviation from that of an ideal random 32-byte block would indicate that the data had been compressed and not encrypted. The higher number of blocks which deviated from the ideal, the more likely that the data was in fact not encrypted and therefore compressed.
In the following tests, a block whose chi square value exceeded 512 was considered a suspect data block. A file with 1-2 suspect blocks was given a confidence level of ‘low’; 2-4 suspect blocks a confidence level of ‘medium’; 5 suspect blocks or more a confidence level of ‘high’. Any file with zero suspect blocks was assumed to be encrypted.
|Predicted Encrypted||Predicted Compressed|
From this, we can deduce the following metrics:
|True Positive Rate||0.98|
|False Positive Rate||0.00|
|True Negative Rate||1.00|
|False Negative Rate||0.02|
The results show that all encrypted files were identified as encrypted (100% accuracy), and only 6 compressed files were mis-identified as encrypted (98% accuracy).
All the compressed files incorrectly identified as encrypted came from the rzip and tornado compressions. Bear in mind though that the magic cutoff value of 512 and the block size of 32 bytes were somewhat arbitrarily chosen; it is possible that they can be adjusted to better compensate for these compressions (and even if they can’t, the majority of files compressed with these algorithms were still correctly identified).
The bigger concern is with LZMA compression. Although technically all of these were identified as compressed, notice that all of them had only one suspect data block which was right at the beginning of the file. In other words, the deviation was not in the compressed data itself but rather in the header used for that particular file format.
It is fair to say that most compressed files will include a header of some sort while encrypted files typically will not, and thus it is valid to include the non-random header data as part of the analysis. But consider an otherwise encrypted firmware update file which contains a header with the typical firmware header information (build date, CRC, file size, etc). From an entropy analysis standpoint, it would look the same as an LZMA compressed file.
What if you can’t determine the exact start of the data in question (perhaps it’s slapped in with a bunch of other data inside a larger file)? You may include unrelated data (or exclude related data) before/after the data in question, also skewing the results. Although these issues can often be resolved by considering the context in which the data is being analyzed, variations in entropy of the compressed data itself precludes such problems and is therefore more desirable.
Interestingly, LZMA2 (implemented by the xz utility) was much easier to detect than LZMA.
Real World Application
Let’s take a look at some real-world examples. How about some router configuration files? Some vendors like to encrypt or obfuscate backup config files, which is just plain annoying. But which is it? Encrypted or obfuscated?
Let’s first take a Netgear configuration file:
|Number of Blocks of Deviation||Largest Deviation||File Offset of Largest Deviation||Analysis||Confidence|
Clearly this result is indicative of an encrypted file, and since we know that Netgear router configurations are in fact encrypted, we can leave it at that.
What about the Verizon FiOS ActionTec router configuration files? It has been claimed that these too are encrypted, but I have never seen any proof for or against these claims. The file’s entropy analysis however, is quite interesting:
|Number of Blocks of Deviation||Largest Deviation||Offset of First Deviation||Offset of Last Deviation||Analysis||Confidence|
Looking more closely at where the entropy deviations occur shows that while the first part of the file is very random, over half of the file (> 200KB) is insufficiently random to be considered encrypted. At best this file is only partially encrypted, possibly not really encrypted at all.
More testing, obviously. Particularly with compressions such as rzip and tornado where we got acceptable results, but not as accurate as most other compressions.
LZMA of course warrants more attention, as well as optimizing the block size and magic deviation cut off.
More varied data sources (plain text ASCII, for example) must also be considered.
Additionally, options and flags passed to the compression utilities may play a significant role in the resulting entropy. For example, both the lzma and xz utilities were given the ‘–extreme’ option, which is supposed to improve the compression ratio, but how does this compare to the ‘-9’ option in terms of entropy?
The binwalk repo has been updated with the new code changes; although binwalk itself hasn’t been updated to exercise these changes, the maths.py file can be run stand alone from the command line if you want to play around with this yourself.
Again, this work is still preliminary, so suggestions and criticisms are welcome.
Update: I’ve gotten some questions asking if this method will work equally well against AES ECB (AES CBC was used during testing). Some initial testing indicates that ECB vs CBC makes no appreciable difference in the test results.