Text Extraction and Indexing: You Can’t Do eDiscovery Without Them

Text Extraction and Indexing: You Can’t Do eDiscovery Without Them

You know the parable: two young fish are swimming along when they meet an older fish. The older fish says, “Morning, boys, how’s the water?” As the young fish swim away, they turn to each other and ask, “What the heck is water?”

If you’ve done eDiscovery for any length of time, you might have forgotten about the water. That’s how essential text extraction and indexing are: without them, there is no text search, no email threading, no concept clustering, and no predictive coding—no eDiscovery as we know it.

What Are Text Extraction and Indexing?

During processing, electronically stored information (ESI) is converted into a form that eDiscovery software can recognize and work with. That begins with text extraction—converting the data in each file type into a uniform structure that software can recognize as text. There are two ways to go about text extraction.

With the first method, native text extraction, processing software extracts all of the code and data contained within a document and converts it into readable text. This captures not just the visible data but also any hidden comments and metadata. Native text extraction is 100 percent accurate, but it doesn’t work everywhere. If a file is not recognized as a text file, native text extraction will simply skip over it.

This points out one of our hidden strengths as humans: we are adept readers, which sometimes blinds us to the difficulty that computers have with different types of ESI. We can read a picture of a document—assuming it’s of high quality—just as easily as we can read a Word document or a printed document, but these are not all evident to a computer as “text.” That’s why we need the second type of extraction, optical character recognition (OCR).

OCR recognizes the way that individual letters and numbers look and converts those graphical patterns into readable text. It’s commonly used with paper discovery or discovery that has been printed and scanned. It’s also useful when discovery includes images, such as a photograph that includes text.yu

Note that OCR, unlike native text extraction, is not 100 percent accurate; in fact, per character, it’s been measured at only 71 to 98 percent accurate. While 98 percent accuracy may sound good, that still means that two out of every 100 characters, not words, are wrong. In a document the length of this blog post—around 4,250 non-space characters—that would mean 85 errors. We wouldn’t be satisfied with that degree of accuracy, and you shouldn’t be, either.

The solution comes from combining both native text extraction and OCR to identify all types of text, along with quality control checks to correct character errors. Your QC should also include an analysis of any documents that weren’t recognized as containing text. Those documents are identified during the next critical step: indexing.

An eDiscovery index operates much like an index in a book: after text has been extracted or rendered readable to a computer, indexing software identifies individual words from that text. (But in the case of eDiscovery, unlike most books, an index may include practically all words found in the corpus.) Those words are then sorted into an index, which tags the locations where each word can be found. Instead of searching through the entire text of an eDiscovery production for a keyword, with an indexed production your software can flip to the index, find everywhere that a word is located, and almost instantly return those results.

Benefits of Text Extraction and Indexing

As mentioned above, it’s a bit difficult to isolate the benefits of text extraction and indexing. It’s akin to discussing the benefits of brick, wood, concrete, and glass to a building project: without these essential ingredients, there would be no structure to provide benefits.

Most fundamentally, text extraction and indexing enable software to read text and execute search functions. Put simply, you cannot run an eDiscovery search without first extracting the text in your production and indexing that text. That means no keyword searching, no searching for custodian names, and no rapid way to identify or locate known relevant documents.

Text extraction and indexing also allow more sophisticated classification technologies. Recognizable text is a necessary precedent for the application of techniques like entity extraction, email threading, concept clustering, and predictive coding. These classification tasks, unlike searching, help reviewers find documents that they didn’t know they were looking for.

In short, text extraction and indexing allow computers to take over much of the load of eDiscovery, saving human review teams from having to read every document in a production themselves. They’re the building blocks that make the rest of eDiscovery technology work.

You may not think about the water very often, but remember that it’s what’s enabling the rest of your eDiscovery technology to work. If you’re not sure you’re building on a strong foundation of text extraction and indexing, please contact us to learn how we can help you do better.

Reader Interactions

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.