Even with all the technological advancements of the last 20 years, document review remains the slowest, costliest phase of eDiscovery. For most review projects, technology only takes you so far: attorneys still have to manually look at pages upon pages of potentially discoverable data to determine what’s actually relevant or privileged. The upshot is that the fewer pages there are in the pile for document review, the faster and cheaper that review can be completed.
What you definitely don’t want to find, at the end of an eDiscovery project, is that you reviewed the same documents—or nearly the same documents—over and over again. To avoid that situation, you want an eDiscovery solution that incorporates both deduplication and near-deduplication.
How Does Near-Deduplication Work?
You’d think near-deduplication would be nearly the same thing as deduplication, but you’d be wrong.
Deduplication detects exact copies of documents and emails and removes all but a single version of any given document. It does so using the descriptive metadata accompanying that file, calculating a hash value using the document’s creation date, sender or author, email header, and so on. The software then compares hash values, identifies duplicate documents, and eliminates extraneous copies.
But what about all the nearly identical versions of documents that have been reused multiple times? This might occur with contracts that are only slightly updated for new parties or with emails that have been forwarded, creating different metadata without changing the substance of the message. Deduping won’t catch these documents, and rightly so: they’re not, in fact, identical copies. They might be very similar, but they could also have important variations.
That’s what near-deduplication, also known as near-duplicate detection, is for. Near-deduplication is a method of clustering like documents together. It doesn’t work its magic by analyzing metadata or generating hash values; instead, it compares the actual text in documents, creating “piles” of documents with similar text. And it doesn’t discard anything from the pile; it simply groups related documents together so that a reviewer can consider all of them at once.
Benefits of Near-Deduplication
Near-deduplication offers the same benefits as any other processing method worth its salt. By grouping like documents together, it effectively reduces the volume of data for review, allowing the review team to quickly assess an entire “pile” of related documents and determine whether the near-duplicates need to be individually examined or whether they can all be coded the same way.
That saves time and money, of course, but it also improves the consistency of results. Instead of having multiple reviewers encountering multiple nearly identical documents in a scattershot fashion, you can have one reviewer consider all of those documents at once, ensuring consistent, efficient coding.
Don’t blow your eDiscovery timeline—or your budget—reviewing unnecessary duplicates and near-duplicates of documents. Instead, minimize the amount of potentially discoverable data before the review stage using deduplication and organize the near-duplicates into neat piles to expedite and simplify review of what’s left.
If your existing eDiscovery solution isn’t helping you identify both duplicates and near-duplicates, iDiscover can help you do better. Contact us to learn more.
Leave a Comment