The difference between eDiscovery and Data Governance. Why your litigation tool won't cut it behind the firewall.

The difference between eDiscovery and Data Governance. Why your litigation tool won't cut it behind the firewall.

QUESTION: Is it better to solve a problem using a general-purpose tool, or one that's designed and optimized to best solve the problem?

That old saw "Jack of all trades, master of none" comes to mind. I always assumed it was an argument in favor of specialization, until I read recently that a second phrase is usually omitted: "but oftentimes better than a master of one". Kind of reverses its essential meaning, doesn't it? Proof, I think, that you shouldn't look for solutions to important problems in proverbs. Look to the problem itself.

In the Data Governance world where I live, using general-purpose, generally available, open-source software whenever possible has an understandable economic appeal. You will leverage the development efforts of a sizable community. Your only costs are those that result from adapting the software to your particular needs. Why would anyone ever build something from scratch if there were an open source alternative? Only if the gap between the architecture of the open source tool and the problem had expenses far in excess of benefits of using free stuff.

When it comes to indexing unstructured data, the most popular free stuff is the open source Lucene indexing software, the Elasticsearch clustering layer build over it, and the related set of tools known as the Elastic Stack (from now on, when I say "Lucene" I'll mean this entire software suite ). In the eDiscovery world, it's hard to find any indexing product on the market (save my company's) that doesn't leverage this collection of software, and for good reason: it's a comprehensive, well thought out, and powerful platform. Its complexity can be intimidating to the newcomer, but most commercial quality products hide that complexity by wrapping it in a friendly user interface and a purpose-defined workflow.

So why shouldn't every indexing product use Lucene at its core?

The answer can be found hid in a fascinating paper I recently read: How we reindexed 36 billion documents in 5 days within the same Elasticsearch cluster. You may not share my enthusiasm for the paper's detail. Read it if you do; trust my very high level summary if you don't. If you choose the latter, know that I believe it to be a fine piece of work by author Fred de Villami, who until recently was the VP Engineering at Synthesio, the company where the work described by the paper was done.

For the record, I have no special knowledge of Synthesio's market, and have no reason to believe that Lucene isn't the perfect tool to solve their problems. In fact, I assume it is. My focus is eDiscovery tools based on Lucene that are asking you for your data governance business. My thesis is that if you give it to them, and take them behind your firewall to solve your data privacy and data security problems, your project will fail, for the reasons hidden in the detail of de Villami's paper. To understand why, you need to first understand the differences between eDiscovery and Data Governance.

There are three key differences:

  1. Scale. eDiscovery is a Gigabyte problem. Data Governance is a terabyte to petabyte problem. That's three to six orders of magnitude difference. Technically solvable by clustering into a federation, but at what cost?
  2. Time Frame. eDiscovery is about well known time periods in the past, so the data set is by definition static. Data Governance is about the past, present and future, a far more complicated undertaking.
  3. Accuracy. Let's be honest here, eDiscovery has two parties, and at least one has an incentive to hide data from the other. I'm not accusing explicit misdeeds here, but there is always relentless arguing about limiting the scope of discoverable data, which of course goes to #1 above. There is always at least one party who would just as soon not find what the other is looking for. The victim of inaccuracy in data governance is the customer themselves; it is in their interest to find data responsive to governance initiatives and remediate it as soon as possible, thereby eliminating as many risks as possible.

In all three cases, Lucene is an adequate platform for eDiscovery. In all three cases it falls far short as a platform for data governance.

Scale

Indexing unstructured data using the techniques implemented in Lucene is a random I/O intensive process. That means the bottle neck is going to disk seek time, the slowest moving part of a traditional computer. The only way to truely overcome this bottleneck is to design it into your indexing process. I detailed how to do that in The Secret To Building an Enterprise Index over two years ago. You can't retrofit an existing design with this approach. It has to be built from day one.

You'd think that all the SSDs configured in de Villami's project would have eliminated this defect in Lucene. I certainly would have. But consider this: de Villami used 75 nodes to process 138TBs of data, and completed indexing in 5 days. Using the process I outlined in my paper, my company's software could have done that indexing with one node in ten days. If you really had to do in in 5 days, I could have done it with two nodes. Assume the cost of each of those nodes to be $15,000. Then we are talking a total hardware purchase price of $1,125,000 vs either $15,000 or $30,000. And I'm not counting the maintenance costs for that hardware, the backup costs, the replacement costs, the racking, cooling, power costs, the personnel costs.

So much for free software, no? The costs for my company's software would be a fraction of the costs for that hardware. Keep in mind that 138TB may have be a large amount of data for de Villami, but is a small amount in the data governance world, by an order of magnitude or more, which only amplifies the cost discrepancy.

Of course, indexing is just one aspect of scale. Queries must scale as well. de Villami demonstrates that the secret to query scalability is avoiding disk latency bottlenecks. How does he do that? By allocating enough SSDs to his infrastructure make the problem go away. How do you do that cost effectively? By making the size of the index as small as possible so you can minimize the amount of SSD storage necessary. A purpose built full content index can be as small as 5% of the data indexed. Even if you overprovision by 50% of that, you are talking about roughly 10TB of SSD space to store an index for 138TB of data.

Time Frame

There are a very long list of requirements for managing live data behind the firewall which are not necessary when you are indexing static data with litigation tools. For now, I'll focus on one - it must be managed in place. There is no time to make copies of it first, and the copies themselves are likely to loose information in the process, or at least make the preservation of that information another challenging requirement. Yet ever since the dawn of the EDRM, eDiscovery workflows relied upon a seperate collection process to copy, unpack and prepare the data as mandatory prelude to indexing it. In the de Vlllami paper, more time is spent copying the data than is spent indexing it.

In the petabyte class data governance world, it's just not practical to copy. Plus, you lose an understanding of how your data is changing over time.

For the record, the process I proposed in the above citation eliminates the need to do the data copy at all. The indexing is all done in place without making any additional copies. The performance numbers I quoted above assume that. So maybe taking 10 days with one node is acceptable, when you consider the total time it would take to copy and index the data with the de Villami workflow.

Accuracy

When you can't scale to manage the petabytes of data required by a governance project, then you are starting at a disadvantage. Of course, on occasion, eDiscovery projects can scale to that size. The knee jerk reaction from eDiscovery vendors when faced with a project like that is to solve it using random sampling, indexing only a randomly selected fraction of the data on the assumption that the sample will faithfully represent the whole.

In my opinion, that's professional malpractice. Many cases rest on finding exactly one piece of evidence to prove or disprove your case. We once had a customer that was looking for a single signed contract for over a year. The existence of that contract was vital to the competitive future of the company. Because of the size of the search space, they were forced to limit the searches to data from suspected custodians. Vendor after vendor couldn't find it, until they deployed our solution, which not only could search everything, but searched it all directly from a backup tape made shortly after the contract was executed - without having to restore the tape. They found it in less than a day, and saved the case.

Accuracy means getting access to data even when it's difficult to access, because it's stored in obsolete formats, or on difficult media like backup tapes, or has some corruption that makes it difficult to read (but not impossible), or has been deleted but the space has not been reclaimed so it's easy to recover. These are almost impossible problems to solve with Lucene - but all very possible when using the process detailed in the posting I cited above.

Because in the Data Governance world, Architecture Matters

If you are thinking, okay fine, the open source community will just have to fix these problems in Lucene, sorry to break it to you, but not gonna happen. These problems go to the core of Lucene's architecture, and for that matter, to the core of the Internet search architecture upon which Lucene was based. That architecture works well for the Internet, where data is published, not managed in place, and where your investment in an indexing infrastructure can be amortized over an enormous addressable market: the free world. Neither is the case in the Enterprise.

Data Governance cannot be solved with a one size fits all technology. It's one of those rare problems worth solving with purpose built technology. It can only be solved by studying the problem in great detail, understanding where the bottlenecks are, and architecting the solution to those bottlenecks into the product at a deep, architectural level.

Still not convinced? Feel free to respond with your rationale in a comment to this post.

Want to learn more? You can read about the performance you should expect when doing GDPR data governance projects in Last Minute Tips To Get You Ready For The GDPR. If you like, we can help you do an assessment. Or answer your questions - just contact us, and let us know how we can help.

Jeanne Somma

Executive/Thought Leader/Professor working at the intersection of law and technology.

5y

Merideth Helgeson, CFE, JD before i realized this was an ediscovery article i noticed there was a lumberjack. #priorities

David Dickmeyer

Enterprise Advisor at McCracken Group, Inc.

5y

Have to love this...

Like
Reply
Shaune Conant

Client Advocate | PRIVATE & Inbound | Audit, Tax, Advisory

5y

Exactly!

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics