Electronic Discovery & Information Governance – Tip of the Month: AI for an Eye – The Limits of Using Technology in a Data Review
Scenario
A company is the victim of a data breach by a malicious actor and is now required to identify and notify an unknown number of individuals impacted by the breach. In order to determine which individuals to notify, the company must identify which documents contain protected information, extract data on impacted individuals from those documents and use that data to determine whom to notify and by what means. This process requires a large and complex data review of documents from sources with varying degrees of uniformity and accessibility—ranging from scanned hard copy files to spreadsheets containing data for thousands of individuals.
As with any data review project, the most significant driver of costs is the sheer number of man-hours required for individual document reviewers to set eyes on the documents. Technology can increase efficiency and reduce the overall cost of the data review, but it has limitations. The types of technology available and the limitations of these technologies are key considerations when deciding how much to rely on technology in structuring a data review stemming from a cybersecurity incident.
What types of technology are out there?
There are several types of technology that could be used in the data review project described above. Text recognition1 and searching technology are familiar and time-tested technologies useful in every data or document review project. In addition to these technologies, new form and pattern recognition software2 driven by artificial intelligence (AI) can be useful in extracting data from uniformly structured and predictable document types. These technologies show some promise in identifying and extracting data from large document pools but, as noted below, suffer from significant deficiencies that can limit their utility.
Where is technology most useful?
In data review projects where extraction of data is a requirement of the project, traditional text recognition and the use of search terms are efficient and effective tools to exclude documents unlikely to contain information that needs to be extracted. This threshold step is used in nearly every data review or e-discovery project and is essential to establishing the scope of the data review.
In addition to the tried-and-true use of text recognition and search terms, data engineers can craft Python-script3 codes to extract data from documents exhibiting a high degree of uniformity. For example, certain types of Excel spreadsheets or .csv files may be structured in precisely the same manner, even if the spreadsheets contain enormous quantities of data. These types of files are often amenable to automated extraction and do not require manual review.
Emerging form and pattern recognition technologies can also be used in data extraction projects, but these technologies have significant limitations and their utility is highly dependent on the nature of the documents at issue.
Where do you need a set of eyes?
AI-driven technology abhors irregularity. The automated extraction technologies described above struggle with documents sets that are not uniform. PDF files where pages are oriented in different directions or where pages are missing or out of order can stump even advanced form and pattern recognition technologies. It is possible to carefully prepare documents for automated extraction, but this is a labor-intensive and time-consuming process that yields inconsistent and sometimes unsatisfactory results. For complex or irregular files— such as those scanned from hard copies or compiled originally by hand—it is better to rely on human reviewers.
Consider the source
Because the utility of technology depends on the format and characteristics of the documents in the data review, it is important to consider how your data is stored and what types of documentation might be leaked in a cybersecurity incident. Files generated at a local office or franchise may be scanned hard copies or otherwise compiled by individual office workers. This often contributes to a lack of uniformity in these documents—particularly when data is collected from individual facilities nation- or worldwide. Even small variations in page order, orientation or scan quality can cause problems for form and pattern recognition technology. On the other hand, if the project consists of review and extraction of automatically generated or structured data in spreadsheets or .csv files, then technology can cut down significantly the number of man-hours required to review and extract data.
What are the timing considerations?
Technology can significantly increase the rate at which data is extracted from a document set. However, it is not an instantaneous process. Generating code and preparing documents for automated extraction can take weeks of dedicated work. If the source documents are mostly amenable to automated extraction, this time is well worth it. A manual review of a large document population can take many weeks to complete and will be significantly more expensive than the full-time labor of a small team of data engineers.
If you are not sure if documents will be amenable to review using technology, then it can be wise to begin a manual review at the outset and evaluate the suitability of documents for automated extraction on a rolling basis. E-discovery vendors can put a manual review team in place and begin reviewing documents in a matter of days. It is faster to begin a manual review and siphon off documents in a separate workflow using technology on suitable documents than it is to exhaust the capabilities of AI and later put a manual review in place. This is particularly true if the data review relates to a cybersecurity incident because timely notification of impacted individuals is a requirement in many jurisdictions.
Summary
Using technology in a data review undoubtedly has its benefits, but the technology, as it currently exists, has limitations. The types of documents at issue and the time in which the review must be completed are key considerations in evaluating whether technology will benefit a particular data review project. Technology continues to improve, but, for the time being, it is important to evaluate carefully whether technology actually can provide benefits in a data review project.
1 Text recognition technology, often referred to as OCR (“optical character recognition”), is a commonly used technology to make source documents searchable.
2 Form and pattern recognition technologies build on OCR technologies to evaluate whether documents are forms, such as order forms, patient records or invoices. These types of documents typically have the same categories of data in the same location in the form, which facilitates automated extraction.
3 Python is a popular general-purpose programming language used by data engineers to create scripts to perform automated processes, including data extraction, de-duplication and data clean up.