Understanding Forensic Copies & Hash Functions

By Kristian Lars Larsen Digital Forensics E-Discovery February 12, 2019

This is the first in a series of articles focused on the technology of e-discovery and digital forensics written specifically for attorneys and legal professionals.

In the course of most digital forensics and e-discovery investigations, it is necessary to capture electronically stored information (ESI) for future discovery and analysis. Before any data examination occurs, sources of potential ESI need to be preserved in a manner that protects its integrity. That is the role of the “forensic copy.”

The question may arise—how do we know that a forensic copy is accurate? Furthermore, how do we assure that the forensic copy remains faithful to the original?

For that, we have a mathematical algorithm called a “hash function” to thank. This article will give you a bit of grounding into the theory and operations of hashes and how they maintain the integrity of digital evidence throughout the discovery process.

Logical vs. Forensic Copies

In the world of digital forensics, it is commonly understood that direct examination of original electronic media should never occur. Instead, investigation should occur only on copies of the data.

When the average person makes a backup of a hard drive, he or she is most likely making a “logical copy”—the duplication of known, visible files in the allocated space of a hard drive. Let’s say you have 200GB of files and folders written to your 1TB hard drive. A logical copy would only include those 200GB of visible files and folders.

“Forensic copies” differ in that they replicate every bit from every sector of the hard drive, whether that space is allocated or not. A forensic copy would capture not only the 200GB of visible files and folders, but would also capture the remaining 800GB of unallocated space. This is important to digital forensic investigators because unallocated space may contain deleted files or other residual data that can be invaluable during discovery. A forensic copy also preserves file metadata and timestamps, while a logical copy does not.

While you or your IT department can handle making a “logical copy” of a hard drive, most of us are not equipped to make a forensic copy. Few people working outside the realm of digital forensics will have the specialized hardware, software, and training needed to produce a proper forensic image of a hard drive.

The Role of a Hash

By definition, forensic copies are exact, bit-for-bit duplicates of the original. To verify this, we can use a hash function to produce a type of “checksum” of the source data. As each bit of the original media is read and copied, that bit is also entered into a hashing algorithm. When the copying is finished, the algorithm will produce a hash value which will act as a type of digital fingerprint that is unique to the dataset.

Hash functions have four defining properties that make them useful. Hash functions are:

Deterministic – For any given input, a hash function must return the same value each and every time that input is processed.
Pre-Image Resistant – All hash functions must be “pre-image resistant.” By this, we mean that the hash function should not provide any clue about the size or the content of the input.
Collision Resistant – A collision is where multiple inputs are found to produce a common output (or common hash value). Since our potential inputs are infinite and our output is a fixed length, collisions are bound to occur. Collision resistance does not mean that collisions don’t exist, but rather, they are very difficult to find. In practical use, the odds of finding two distinct inputs that produce a common output are astronomically high.
Computationally Efficient – Finally, we expect that a hash function will be computationally efficient, or, in other words, speedy.

Understanding these properties, we can expect that our original media and our forensic copy will produce the same hash value if, and only if, they are identical. After we have calculated our initial hash value, we will run our forensic copy through the hash function. If the two hashes are equal, we can be assured the underlying data is identical.

In the future, if we ever need to verify the integrity of our forensic copy (to make sure it has not been altered), we can re-run the hash function to make sure we obtain the expected hash value.

It’s important to understand that while hashing is considered a cryptological function, a hash does not encode the actual data into the output. It is virtually impossible to reverse-engineer a hash value to arrive at the input. The goal of using a hash function is to provide an immutable fingerprint of a dataset that can be used to determine the integrity of that dataset in the future.

Popular Hash Functions

In digital forensics, there are a few different hash functions that are used. The most widely used is called MD5 (Message Digest 5), an algorithm that produces a 128-bit hash (represented by a 32 character hexadecimal value). While MD5 has been found to suffer from cryptographic vulnerabilities, it is more than sufficient to function as a checksum to verify data integrity. SHA-1 is another popular hash function that operates with a higher 160-bit bandwidth.

To better understand what a hash value is, let’s take a look at some sample MD5 hashes. You are welcome to follow along by encoding your own text here:

First, let’s take a simple word (“The”) and run it through the MD5 hash function:

Input: “The”

Hash Value: a4704fd35f0308287f2937ba3eccf5fe

The hash function took the input (“The”) and generated a 32-character hexadecimal hash value. Let’s try a slightly more complex phrase.

Input: “The quick brown fox jumped over the lazy dog’s back. ”

Hash Value: c63fb23376e70ce986f3ec87cd0334e4

You’ll see that the added length of the input does not affect the length of the output, which is fixed at 32 characters. Now, what if we change a single character, like changing “dog” to “hog”? Might we might expect a similarly slight change in the hash value?

Input: “The quick brown fox jumped over the lazy hog’s back. ”

Hash Value: 660d4ca68c9e87706721b651f8d5f9c2

No! The single character change in the input creates a completely different output. Let’s try another slight change–we’ll change “hog” to “frog.”

Input: “The quick brown fox jumped over the lazy frog’s back. ”

Hash Value: 9337d651fb83b087e26f86a844a76b6d

Again, we find a completely different hash value despite the minor change to the input. This is because the MD5 hashing algorithm is pre-image resistant and is effective at obscuring the input.

Wrapping Things Up

Although our examples were short, length of input is immaterial. When we pass a 1TB forensic image through the MD5 hashing algorithm, we’ll still end up with a 32-character hash value output.

If there is ever a question about the authenticity of the data, the hash can be recalculated. Should a single folder, file, or bit be altered in that dataset, it would generate an entirely new hash value. If the hash values match, the authenticity of the forensic image is validated.

And there you have it—an explanation about how hash values are used to protect the integrity of ESI that is preserved during the course of an e-discovery effort. Feel free to reach out to Data Narro if you have any technical or procedural questions concerning digital forensics or e-discovery!

Stay tuned for new articles in our educational series. Sign up for our newsletter to receive our articles in your inbox!

Photo Illustration by Data Narro. Photograph by Lorenzo Cafaro from Pexels.