What Is Hashing, and Why Does It Matter for eDiscovery?

What Is Hashing, and Why Does It Matter for eDiscovery?

Hash: it’s what’s for breakfast (or lunch or dinner). And, more importantly, though less tasty, it’s also a critical element of a defensible eDiscovery process.

A hash value is a computer file’s digital fingerprint: it’s a unique identifier assigned to each file that is closely tied to its contents. In other words, if the file’s content changes—for example, if you add so much as a space to a Word document—then the hash value will also change to reflect that the file itself has changed from the original.

The standard algorithm used for hashing in eDiscovery today is MD5. An MD5 hash algorithm (MD means “message digest”) usually creates a 128-bit hash, which is represented as a 32-character alphanumeric hexadecimal text string. An MD5 hash value might look like this: 79054025255fb1a26e4bc422aef54eb4. Another popular hashing algorithm is SHA-1, is a 160-bit hash. The SHA-1 algorithm creates an even longer character string that might look something like this: d13ba1375bfdbce03a18fabc21b3a7e83faff69e.

No matter which algorithm you use, hash values will appear to be totally random: the algorithm won’t order files in a data collection sequentially.

The Purpose of Hashing for eDiscovery

Hashing serves two main purposes in discovery.

The first is to ensure data integrity. In eDiscovery, hashes are useful because they allow us to verify that a file, including its metadata, did not change during file collection or processing. It also can help expose whether anyone has attempted to tamper with electronic evidence.

Second, hashing is useful for winnowing data collections to a manageable size and lowering the overall cost of eDiscovery by eliminating documents before the document review stage. This is done using hash values in one of three ways: deduplication, near deduplication, and deNISTing.

  • Deduplication: The elimination of duplicate files in a collection is performed by comparing two files’ hash values and removing copies to avoid the cost of re-review.
  • Near deduplication: Some eDiscovery vendors also offer software that can isolate identical fields or segments in a document—such as just the text of an email rather than its headers—even if it has different senders and recipients.
  • DeNISTing: Hash values permit the identification and removal of meaningless system files (such as those ending in .exe or .dll) from a data collection. Here, the eDiscovery software compares the hash values in the data collection against a list, the Reference Data Set, which is part of the National Software Reference Library and is compiled by the National Institute of Standards and Technology (NIST). The Reference Data Set contains more than 28 million file signatures.

To learn more about how you can use hash values in eDiscovery to maintain the integrity of your data collections and reduce the cost of data processing and review, get in touch.

Reader Interactions

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.