By using their databases for “knowledge discovery,” healthcare organizations can transform drowning islands of information into assets.
In the last few years, the term “data mining” has become common parlance in business. For a while, it was a buzzword for any kind of information processing that involved building databases and extracting data for use by computer decision-support systems. However, with the evolution of gigantic databases and their subsequent challenges, the term has been narrowed to a more precise role.
The analysis phase of the knowledge discovery in database (KDD) process describes a spectrum initiated at the data collection point and completed at the production of statistics and reports. Unlike the legal profession, in which the term “discovery” means finding critical information to present at trial, in the data mining domain, it refers to using data analysis to detect new associations or patterns that can help enterprises address vexing problems or find new customers as part of their customer relationship management programs.
Additionally, good data mining should produce data that can be translated into results readily understood by business managers. Performance analytics can be developed to improve the efficiency of operational processes and reduce costs, provide actionable insights based on key performance indicators such as customer satisfaction rates, and determine the best leverage points for financial investments.
The KDD process appears to be straightforward, but due to the acceleration of data generation, processing and storing data has become both more challenging and more necessary.
Size of the Problem
Due to the rapid proliferation, ubiquity, and increasing power of computer technology, the need for high data integrity and intelligent indexing of information storage has never been greater. In the paper “From Data Mining to Knowledge Discovery in Databases,” Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth discuss the “urgent need for a new generation of computational theories and tools to assist humans in extracting useful information (knowledge) from the rapidly growing volumes of digital data. These theories and tools are the subject of the emerging field of knowledge discovery in databases.”
The authors say KDD boils down to learning to use tools to make sense of data for human consumption. “The traditional method of turning data into knowledge relies on manual analysis and interpretation,” they wrote, but manual analysis can no longer handle the mind-boggling amounts of data requiring review.
Such concerns seem far removed from the interest of everyday Americans. What a surprise it was then that a book called The Information: A History, a Theory, a Flood, written by James Gleick, which details the genesis of our current information age, made the nonfiction best seller lists in 2011. Believe it or not, consumers were eager to purchase a book discussing the history of bits (units of measure for computer information). Alert to the overwhelmed mental state of today’s information recipients, Gleick became a best-selling author.
The New York Times notes the magnitude of available information in its book review, stating that “James Gleick has such a perspective, and signals it in the first word of the title of his new book … using the definite article we usually reserve for totalities like the universe, the ether—and the Internet. Information, he argues, is more than just the contents of our overflowing libraries and Web servers. It is ‘the blood and the fuel, the vital principle’ of the world. Human consciousness, society, life on earth, the cosmos—it’s bits all the way down.”
Spanning centuries of communication, Gleick’s narrative hits on its key moment when discussing Claude Shannon, a young mathematician with a background in cryptography and telephony, and the 1948 publishing of his paper “A Mathematical Theory of Communication.” For Shannon, communication was purely a matter of sending a message over a noisy channel so that someone else could recover it. Whether the message was meaningful, he said, was “irrelevant to the engineering problem.” In other words, he invented what we call today “information theory,” or the concept that information can be separated from its meaning.
The logical outcome of all this is that enormous volumes of information are generated, collected, and stored without thought to the meaning inherent in the stored data. It is through KDD that the meaning lying within these infinite mansions of data can be recognized with the right tools. A further challenge regarding data and their mining involves where they are stored and their safety and accessibility.
Cloud Conundrum
Over the past decade, cloud vendors have assured everyone—even private individuals—that the answer to their prayers for keeping their data forever lies in the cloud. However, cracks in the cloud promise have become increasingly evident.
In a November 14, 2011, New York Times article, “Internet Architects Warn of Risks as Ultrafast Networks Mushroom,” Quentin Hardy notes that “if nothing else, Arista Networks [a company providing cloud networking solutions] proves that two people can make more than $1 billion each building the Internet and still be worried about its reliability.” Billionaires David Cheriton, a computer scientist at Stanford, and Andreas Bechtolsheim, a cofounder of Sun Microsystems, go on to express concerns about “the promise of having access to mammoth amounts of data instantly, anywhere” that is “matched by the threat of catastrophe.”
Perhaps most telling, Arista Networks’ founders caution that it would be a mistake to assume the vast amounts of data “stored” on the Internet are permanent. “We think of the Internet as always there,” they say. “Just because we’ve become dependent on it, that doesn’t mean it’s true.”
Although they were among the first to invest in Google, Cheriton and Bechtolsheim recognize the challenge of actually archiving and mining virtually infinite amounts of data.
Big Industry Mines for Gold
Unlike Oscar Wilde, who said, “It is a very sad thing that nowadays there is so little useless information,” today’s knowledge seekers have virtually nothing but useless information thanks to a lack of tools to sort through growing mountains of data.
In a 2010 special report for The Economist on managing information, Kenneth Cukier explored the many challenges of building satisfactory data repositories. Cukier goes beyond “data exhaust” (ie, the trail of clicks left behind by Web users from which value can be extracted) to detail how data mining is becoming the foundation of the Internet economy. He targets healthcare as one of the key markets likely to benefit from the ability to aggregate and analyze data.
According to the article, heavy hitters such as Craig Mundie, chief research and strategy officer at Microsoft, and Google Executive Chairman Eric Schmidt sat on a presidential task force to reform American healthcare. “If you really want to transform healthcare, you basically build a sort of healthcare economy around the data that relate to people,” Mundie explained. “You would not just think of data as the ‘exhaust’ of providing health services, but rather they become a central asset in trying to figure out how you would improve every aspect of healthcare.”
Digital medical records are supposed to make life easier for physicians, bring down costs for providers and patients, and improve the quality of care. Some of that promise is being realized. Since 1998, the Uppsala Monitoring Centre has used data mining methods to routinely screen for reporting patterns indicative of emerging drug safety issues in the World Health Organization global database of 4.6 million suspected adverse drug reaction incidents. Being able to spot unwanted drug interactions or adverse reactions can lead to more effective treatments.
The Economist article highlights the work of Carolyn McGregor, MD, of the University of Ontario, who, in conjunction with IBM, has managed to spot potentially fatal infections in premature babies. The IBM system is able to monitor “subtle changes in seven streams of real-time data, such as respiration, heart rate, and blood pressure. The electrocardiogram alone generates 1,000 readings per second.”
It is easy to surmise that such a staggering amount of data could not be printed on paper and then analyzed in time to save a baby. However, because McGregor can feed the data through her computer, she is able to detect the onset of an infection before obvious symptoms emerge.
Biting Into Unstructured Health Information
Although there are many wonders that can be achieved through the aggregation and analysis of structured information, it is in those unstructured documents characteristic of EHRs where real treasure lies. Estimates vary, but it is often surmised that 50% or more of health information resides in narrative reports such as history and physicals.
However, extracting and aggregating information from these documents has remained elusive. Indeed, this is one of the primary goals of today’s healthcare entities that recognize they must be able to penetrate and gather all patient information to ensure patient safety and provide cost-effective care.
The Health Story Project, a national initiative formerly known as CDA4CDT, is an alliance of healthcare vendors, providers, and associations that pooled resources over the last three years in a fast-developing mission to produce data standards for the flow of information between common types of healthcare documents and EHRs.
One of the group’s subprojects involving unstructured documents, the HL7 Implementation Guide for CDA (clinical document architecture) R2 (published in August 2010), focuses on the accessibility of the information contained in these documents. The project’s Structured Documents Work Group states that “extensive and advanced interoperability and effective information systems for patient care must consist of CDA Level 2 and Level 3 documents, yet the full understanding, use, and compliance of CDA Level 2 and 3 is likely years away by all major participants, including EHR system vendors, providers, payers, carriers, etc. Much of the patient record still exists and is being captured in an unstructured format that is encapsulated within an image file or as unstructured text in an electronic file such as a Word or PDF document. There is a need to raise the level of interoperability for these documents to provide full access to the longitudinal patient record across a continuum of care. Until this gap is addressed, image and multimedia files will continue to be a portion of the patient record that remains difficult to access and share with all participants in a patient’s care. This project addresses this gap by providing consistent guidance on use of CDA for unstructured documents.”
While there are national consortiums such as the Health Story Project trying to make unstructured information accessible and exchangeable through formatting standards, private vendors such as Clinithink are also working to make the content of unstructured documents searchable and “mineable.” Although rich information exists in the unstructured portions of health records, such as discharge summaries, history and physicals, operative reports, and progress notes, historically this information and the knowledge that might be acquired from it has remained isolated and inaccessible.
With technologies in place that offer the possibility of converting unstructured narrative into structured data (which can be mined) and tools such as natural language processing being able to permit advanced search of information previously locked inside text, Clinithink may be on track to make all EHR information available.
Tielman Van Vleck, PhD, a postgraduate from Columbia University’s biomedical informatics program and director of language processing at Clinithink, analyzes health information across clinical notes and reviews patterns that may be relevant to existing clinical scenarios. In essence, he’s mining previously inaccessible information in clinical narrative text and applying it to real clinical problems.
In addition, Van Vleck is working on a parser for Clinithink that will be able to extract data on miniscule levels, allowing for a tight integration with SNOMED codes that will feed a text analytics engine, build a rich database, and crosswalk to ICD-10 for nearly automatic coding. Of course, the volume of such information will be staggering in terms of governance, archival, indexing, and storage, but that is altogether another information frontier.
Author: Sandra Nunn