E-Discovery Terminology for Every Litigator
Reading Time: 12 minutes
On September 29, 2016 the Florida Supreme Court amended rules 4-1.1 and 6-10.3 to the Rules Regulating The Florida Bar. As a result, Florida attorneys will now be required to obtain three credit hours of CLE in approved technology programs. Further, language was added to the Comment to Rule 4.1.1 Competence, which reads as follows:
Competent representation may also involve the association or retention of a non-lawyer advisor of established technological competence in the field in question. Competent representation also involves safeguarding confidential information relating to the representation, including, but not limited to, electronic transmissions and communications.
What does this mean for litigators dealing with electronic evidence? I like to say that the realm of eDiscovery is a melding of legal and IT, two groups of professionals who speak different languages, and who under typical circumstances do not care to speak the language of the other. We have all heard that attorneys speak legalese, and we know of the “IT speak” that flies over the heads of IT industry outsiders. How can we manage the additional eDiscovery industry terminology that is a necessary part of the conversations between IT and legal that surround litigation or a government investigation? At the time of my entry into the field of eDiscovery in 2013, I had not even heard of the term metadata, much less jargon like DeNIST, TIFFing, load file, and so on.
Fortunately for me, I didn’t have to perform a Google™ search of every other word I came across, because I had a friendly (and well-versed) Project Manager as my guide, and he took the time to answer my questions whenever they arose. Enter Parke McManis, Esq., RCA, CEDS, the co-author of this article, who is Managing Director, Mid-Atlantic, for Complete Discovery Source. I made a list of my favorite eDiscovery vocabulary words and Parke has given his plain English definition and explanation for each of the terms below.
Parke and I will be co-authoring a series of articles which will join the perspectives of attorney and eDiscovery Project Manager. We hope to bring a uniquely eDiscovery point of view to issues that arise in handling electronic evidence that must be looked at from both legal and IT angles.
Without further ado, here’s an eDiscovery Vocabulary List with Parke’s definitions. Keep in mind that these definitions are intentionally broad, intended to give a basic understanding that may just barely scratch the surface for some of these terms. We hope it helps any newbies out there ease into their next eDiscovery project.
Parke McManis:
Document: a specific file, such as an email or Word document. Sometimes a specific file (“Parent”) can also have files within it (“Children”) and that entire group of documents is called a “Family.” So if we have an email that has two attachments, the email is the Parent document, the two attachments are Children documents, and all three are considered a Family. A document can contain many pages, but generally must be converted / printed to an image format (see “TIFFing” below) to see how many pages are in that document.
Native: a document that is in its native format, or the format that a document would naturally appear on your computer, such as a Microsoft Word or Excel document. This is opposed to an image format, such as a TIFF format that is explained below.
Basic Linear Review: reviewing documents one after another as they naturally appear in a collection. This is the most basic way of reviewing documents and is the electronic equivalent to opening up a banker’s box and starting to go through documents one by one.
TIFF/ TIFFing: TIFF (Tagged Image File Format) is simply an image format, like JPEG. What is generally referred to as “TIFFing” is the act of converting, or printing, a native file to this image format, much like when you take a Word document and “Print to PDF.” Documents are TIFFed for production for much the same reason as you would print a document to PDF before sending it to someone, which is that it memorializes the document. TIFFing also can allow for additional features to be added to the document without altering the original version of the document, such as adding bates numbers, confidential designations, and redactions.
Custodian: the person or entity that is responsible for, has administrative control over, or has access to electronically stored information (“ESI”) – basically it is who “owns” a document. For example, you are the custodian of your emails, your cell phone, and the files on your computer.
Culling: a broad term that is simply the act of removing documents from a collection in an attempt to reduce the size of the collection. Some standard ways to cull are DeNIST, deduplication, applying date ranges, running search terms, and some forms of analytics. This is generally done to reduce the size of the population that has to be reviewed.
DeNIST: The National Institute of Standards and Technology (“NIST”) has a running list of non-user generated document signatures that it has established as having little or no value for litigation purposes – in other words, it is an industry accepted list of “junk” files (mostly program and system files that do not contain user-generated data). When you “DeNIST” a collection of ESI you are simply removing these industry accepted junk files from the collection. An example of files that would be removed during this process would be the files that are created on your computer when installing a new program, such as Adobe Acrobat Reader™, onto your computer.
Hash Value: a value that is automatically assigned to data based on an algorithm that assesses many aspects of the data. This algorithm is accepted by the industry to be so thorough that if two pieces of data have the same hash value, then they are considered identical. A hash value can be applied to an individual file to identify duplicate files in a collection or an entire collection of files can be assigned a hash value to authenticate that the data set has not been altered by showing that the hash value of the data set has not changed.
Deduplication: the process of removing duplicate files from a collection of ESI based on their hash values. If two documents, or a family of documents, in a collection have the same hash value, one of them is removed.
Global vs. Custodial Deduplication: two different ways of running deduplication. Custodial Deduplication removes all duplicate files within a single custodian’s collection. Global Deduplication removes all duplicates across all custodians in a matter. It is a judgement call as to which method you use. Global deduplication can result in fewer documents to review, but custodial deduplication ensures that a custodian’s full collection is kept intact.
- Example: In the regular course of business, Custodian A emails Custodian B, and then Custodian B saves a copy of the email to his Archive folder. This same email now exists in three places: 1) Custodian A’s Outbox, 2) Custodian B’s Inbox, and 3) Custodian B’s Archive folder. Now let’s say that both Custodian A and Custodian B are people of interest in a litigation and both of their emails are collected, but Custodian A is a higher priority custodian.
- Custodial Deduplication: If we dedupe this collection by custodian, Custodian A will end up with one copy of the email and Custodian B will also end up with one copy of the email, but only one copy because one copy of the email will be counted as a duplicate and removed from Custodian B’s collection.
- Global Deduplication: If the collection is globally deduped, Custodian A will end up with one copy of the email, but both versions of the email will be removed from Custodian B’s collection. The document was removed from Custodian B because Custodian A was considered a higher priority custodian and only one version of a document can exist when globally deduping. Usually there will be a record kept in the database that Custodian B also had this document.
Technology Assisted Review (TAR): the broad concept of using technology to organize or expedite review. The term broadly refers to many methods of technology assistance, including analytics, but it is most frequently associated with Predictive Coding, a specific type of analytics that will be discussed in a future article.
Image (Forensic Image): a bit-by-bit copy of a computer’s hard drive, which essentially equates to a full and exact copy of the entire computer. Once an image is taken, you can open it up, or “mount” it, to look at the image exactly the same way you would have looked at it on the original computer the moment the image was collected. This is the most inclusive, broad, and complete form of collections. One of its biggest advantages is that it captures the “unallocated space” on the drive as well, which is where deleted fragments of files can still be recovered if that is relevant to your investigation.
Forensically Sound Collection: In most cases, a full forensic image is not necessary and a more “targeted” collection method is sufficient. If that is the case, any number of other collections methods may be used as long as they are “forensically sound.” This term means that the collection happens in a manner that ensures the collected documents, including their metadata, are not altered in any way and the resulting collected documents are identical to the documents as they originally existed. One way to prove that the documents are unaltered is by assigning hash values to the collection of documents, as mentioned above.
Metadata: self-created data within a file. It can be created to record various elements of the file, such as the name of the document or when it was created. In an email you would find such metadata as the time the email was sent, who sent it, who received it, and so on. Different files store different types of metadata. For example, some basic image files have almost no metadata, but a Microsoft Word document contains hundreds of pieces of metadata, including when the document was created, printed last, and last modified. During processing, as mentioned below, metadata is extracted from a file to make the metadata searchable.
Searchable Text: the body of a document is made searchable when the text of the document is indexed. Just like an index in a book records pages on which various words appear, this index records where various words are located within a collection of documents. Thereafter, when you run a Google-like search for a particular term across a body of documents, the search is performed by referencing the index to find which documents contain that search term. For that index to be created, the text of the document must first be obtained and recorded. This text can be obtained by extracting it from the document, if the document stores the text, or through OCR, which is explained below.
Processing: the process by which metadata and searchable text are extracted out of a Native file and put into a usable (i.e. searchable) format. Once this data is put in a database, or review platform, you may search for documents based on the metadata or searchable text, mentioned above. This metadata and text is also what is analyzed when analytics are run on documents, which will be covered in a future article. Deduplication and DeNISTing of the collection often also occur during processing. Some software will process documents directly into a review platform, whereas other software will extract the metadata and searchable text and put it in a load file.
Load File: Similar to an Excel spreadsheet, a database is neatly organized into rows (referred to as “documents”), and columns (referred to as “fields”). One way to populate a database is through a load file that contains metadata and other data related to the documents. It may just look like a jumble of text if you open it up, but it is actually organized in a way that the database can understand, indicating what values go into which fields for what documents.
OCR: Optical Character Recognition (“OCR”) is a way to get searchable text from a document that does not have any. Essentially, this technology converts a picture or image of text to usable text by scanning over the image to identify a character and recording it as text. Since it works based solely on the appearance of the letter, lower quality images can yield bad OCR results. Frequently, similar looking words or letters can get mixed up. For example, the word “Oil” may be recorded as “Oll’ or “Oii” if the image quality is poor and the OCR software cannot distinguish between an “i” and an “l.”
Suzanne Clark:
I am a big believer in working in one’s core competency and then pulling together a team of experts to achieve the goals and produce the best end results for clients. Apparently, the Supreme Court of Florida agrees. Florida attorneys are bound to either be technologically competent themselves, or alternatively seek out the expertise of technologically competent professionals. What I have found is that by associating with professionals like Parke McManis, studying the rules and case law, and participating in continuing legal education and product demonstrations, becoming an attorney competent in technology is an absolutely attainable goal.
Stay tuned for our future article explaining terminology relating to Predictive Coding and Analytics.
By: Suzanne Clark, Esq., CEDS and Parke McManis, Esq., RCA, CEDS