I’ve recently been examining the feasibility of differentiating compressed data from encrypted data based on variations in the entropy of the data. Initial results showed some promise, but were tested against too small of a sample set to draw any hard conclusions. Since then, I’ve been experimenting with larger data sets (more files and more varied types of encryption / compression) with quite satisfactory results.
The TL;DR is that 98% of the compressed files tested were correctly identified as compressed, and 100% of the encrypted files were identified as not compressed (i.e., encrypted).
In general the entropy of compressed data shows significant variances from that of encrypted data, and can be reliably identified with very few false positives. While identification of certain compression algorithms (namely LZMA) still present some practical concerns depending on your situation, even those compressions were reliably distinguishable from encrypted data during testing due to non-random data in the file’s header structure (see ‘Analysis’ below).
What is particularly exciting, at least for me, is that compression formats which would have been otherwise unknown (e.g., files that weren’t signatured by file or binwalk) were easily identified as compressed through entropy analysis.