Can Keywords Still Be Used as a Culling Tool in eDiscovery?
Photo by Steven Wright on Unsplash

Can Keywords Still Be Used as a Culling Tool in eDiscovery?

“What important truth do very few people agree with you on?”

This is what Peter Thiel, co-founder of Paypal, Palantir and the first external investor in Facebook, used to ask startup founders in order to separate those worth investing in from those that were not.

Since I read his book Zero to One, I have always wondered what important truths very few people agree on in eDiscovery. I believe one of them is that keywords are not a good nor defensible culling tool.

I am not the first one that raises this problem, but across the industry the consensus is that this methodology is imperfect but still useful and, more importantly, necessary due to proportionality reasons. I hope to show that the degree of imperfection is vastly underestimated, and that proportionality is less of a concern given the current technological state.

Different States of Information

The first argument against the use of keywords as a culling tool is epistemological and not technological.

What an investigator knows at the beginning of a case is radically different compared to what he knows at the end of the matter, or just after a few weeks of going through the documents. When a case starts the legal team can only formulate assumptions based on the information provided by the client and on knowledge gained from similar matters. But it is not possible at that stage to correctly guess how people communicated (e.g. formally or informally), how information was exchanged (e.g. emails, chats) or how things were named (e.g. acronyms, code words). This is obvious because if the factual landscape was completely clear and uncontroversial the discovery of the evidence would not be necessary. And yet, we generally accept that the documents that will be pushed to review – and ultimately disclosed – will be determined by a list of search terms created at the very beginning of the case and, by nature, based at best on a limited understanding of the matter.

From a methodological point of view, it is intuitive to see how this critical step already lacks defensibility. However, contrary to any other mistakes that can happen during the course of a review,  this cannot be easily corrected and it will prevent relevant documents from being disclosed simply because they don’t hit on an arbitrarily list of keywords. Not a good start.

Language, OCR and Documents Without Text

The methodological problem is compounded by the increasingly transnational nature of many of today’s litigations and investigations. It is sufficiently difficult to imagine how people we don’t know would communicate in our native language but it is almost impossible to make similar assumptions in a different language. In addition, even if language recognition and classification software is improving, the legal team may not even be aware of the existence of foreign language documents whose content should be searched and reviewed.

Technical challenges get in the way of keywords too. A very frequent one is that not every electronic document has text (e.g. most image files). Such documents will never be responsive to any keywords and very rarely a specific review strategy is implemented to mitigate the risk of relevant information being missed as a result of this.

The most common way of dealing with documents without text is to run OCR (Optical Character Recognition) software on them in order to generate the text from a digital image of that document. While this is a widely adopted and effective practice, the accuracy can be mixed especially for low quality scanned documents. In addition, OCR processes add extra time on top of the standard indexing and processing time and they need to be completed across all the documents without text (or a selected subset) before an accurate keyword hits report can be produced for the entire dataset. In the time sensitive scenarios typical of eDiscovery investigations and with ever increasing datasets, the need for running keywords not only will return inaccurate results, but it is also likely to increase the overall cost and time necessary to identify the documents to be selected for review.

The Hidden Costs

Everyone in the eDiscovery world is familiar with “the funnel”. A popular slide in many PowerPoint presentations directed to perspective customers, it shows how a large amount of data originally collected for the matter can be reduced to a more manageable size, through the applications of certain filters in the ECA (Early Case Assessment) module of a review platform. Keywords play the most important role in drastically reducing the dataset that will eventually be reviewed.

Photo Credits to Salixdata

Running keywords on the dataset, however, is not normally cost-free. The original list of proposed keywords needs to be sent to the project manager/litigation support specialist for the syntax to be checked. The list of search terms is then run over the whole dataset (and often over certain custodians and/or within specific date ranges too). As a result, one or more keyword hits reports are generated and sent to the legal team for evaluation.

This process is normally iterative as search terms are subsequently tweaked and tested until an optimal number of documents is returned. This process can take days or even weeks as, depending on the results, review budgets may need to be adjusted and discussed with the end client and, in the context of a formal litigation, every amendment to the search terms needs to be formally agreed with the opposing party. At every step of this multi-party conversation legal and discovery costs are accrued and the start of the review is delayed. In some extreme instances, the billable hours accrued in the process surpasses the savings of not promoting the entire dataset to review.

 Can I Run a Review Without Culling My Dataset Using Keywords?

 Yes, absolutely. Technology Assisted Review (TAR) will do the job in a faster and more defensible way.

Review platforms today operate in a scalable Cloud environment. Five, ten or twenty million documents databases do not suffer the same performance issues that they used to as resources, memory and storage can be added almost instantaneously.

When documents are processed and indexed, an enormous amount of information is captured and made available without the need of human interaction. Documents are clustered in concept groups, emails linked in threads, duplicates and near-duplicates are identified for every record in the database. In addition, text and hundreds of metadata are extracted and made available.

The review has not yet started but we already know a lot about the documents in our database. We can now use statistics to set the targets of our review and machine learning to ensure it runs efficiently.

The starting point to understand a large collection of records is to draw a representative random sample (i.e. an estimation sample) and then project its results across the entire set (e.g. the % of relevant documents in the random sample will be the same in the entire database, within the applied margin of error).

Let’s assume that our entire de-duplicated set is 1,000,000 documents. If we apply keywords and cull 90% of our dataset (promoting to review only 100,000 documents), the number of documents to review in a representative random sample (using 95% confidence level and 2% margin of error) is 2,345. If instead we promote the entire dataset to review (1,000,0000 documents) the random sample that will have to be reviewed is 2,396 documents.

It is easy to see that the two samples are almost identical in size even if one dataset is 10x larger than the other. This is a very important point because one common misconception is that the documents to review will increase proportionally to the size of the review population. This would be true in a traditional linear review but is not an accurate statement when machine learning and other analytical tools are deployed.

The machine learning algorithm learns from the decisions taken by the review team, suggesting similar documents to the ones that were already deemed relevant. The ability of dynamically changing the order in which documents are reviewed, continuously bringing to the top the most relevant ones and pushing to the bottom all the others, renders practically irrelevant the overall size of the database. If only the likely relevant material is reviewed, the amount of non relevant material will not impact the review workflow. In addition, taking advantage of the relationships built across documents (family, threads, duplicates and near-duplicates) in many cases allows for an even more streamlined review of the potentially relevant documents.

This iterative process continues until the designated target has been reached (e.g. the projected number of relevant documents from the sample has been found within the larger dataset). It is now time to use statistics once again to defend the accuracy of our review. At this point of the review, only a subset of the document will have been reviewed and the vast majority will have not been touched by the legal team. Another random sample (i.e. the validation sample) will be drawn from the unreviewed population and, if the results are satisfactory, we will be able to defensibly state that our document review exercise is complete.

It is important to note that a similar technology assisted review worklfow is certainly possible when search terms are applied. But in this scenario the validation sample’s results will be limited to the culled dataset and it will provide no information about the documents not promoted to review, potentially exposing us to further requests from the opposing party. Instead, when all documents are promoted to review, a successful validation sample will prevent similar claims to be made.

If a Solution is Readily Available, Why are Keywords Still Used as a Culling Tool?

A possible answer to this question is two-fold. At his core, it is a combination of asymmetry of information and incorrect economic incentives.

For the asymmetry of information part, it is important to remember that eDiscovery is a highly technical and niche domain whose end customers may not share the same level of knowledge of its practitioners.

In this scenario, a customer may not be aware of the implications of certain decisions or of its possible alternatives. Search terms culling is a consolidated industry practice that only recent technological developments have decisively put into question. It would be unrealistic to expect from a customer a suggestion on how to overcome dated and inaccurate technical methodologies in the same way that it shouldn’t be the patient to inform his doctor about the most recent and advanced medical therapies. It is up to the eDiscovery professionals, as trusted advisors, to highlight pros and cons of every choice and, in a constructive way, challenge certain assumptions or practices if they are not in the customers’ best interest.

For the incentive part of the answer, the pricing model applied by several vendors is to blame, even though its application is driven by good intentions.

Pricing an eDiscovery exercise is an extremely difficult task. Information about the specific case is often patchy and subject to swings in both directions. In addition, the price needs to include the costs of maintaining a secure and performing datacenter, a quota of the software development costs (or a quota of the license fee if a re-seller) and the expected profit margin. Moreover, the overall price structure needs to a) take into account the opportunity of future business from that customers and b) be competitive in the market.

Traditionally, this complex structure is simplified in the service contract under two main line items 1) a per GB processing and hosting fee and 2) a professional service hourly rate.

The idea of a per GB rate is attractive because it is simple enough and it makes the cost estimate very accurate once the amount of data to process and to host is known. However, this option provides the customers with the wrong economic incentive.

If the bill grows proportionally to the amount of data pushed to review, the customer will always be incentivized to cull the dataset, even if this could damage the accuracy and defensibility of its review. In a domain filled with asymmetry of information the contractual relationship needs to be designed to mitigate the inevitable issues of a principal-agent scenario. Can the customer trust his agent when the latter suggests promoting all documents to review or is he acting this way to maximize his gains at the expense of the customer?

In the short term, the remedy for this misalignment is to remove the per GB rate and to move to a fixed and tiered pricing for processing and hosting charges. In addition, archival or nearline solutions should be implemented to allow customers a more efficient and granular control over their hosting fees. In the medium and long term, as explained in this article, it will become more difficult for eDiscovery companies to charge any processing and hosting fees as the data will remain hosted in the customer’s private cloud. At that point, the only remaining applicable charges will be the eDiscovery software license fees and the professional service hours.

Based on the above, it is relatively easy to imagine a future where keywords will continue to play a very important role in finding relevant information, but they will never be used to arbitrarily determine what will eventually land in a court of law or in an investigative report.

To view or add a comment, sign in

Insights from the community

Explore topics