For The Record

In a time of resources, shortages within most healthcare organizations, electronic data are assets that continue to grow and accure. Electronic clinical data, in particular, are expanding rapidly – some call it an explosion. Among the factors contributing to the proliferating data supply are the conversion of patient health records from paper formats to electronic and data-generation, related to outcomes management, computer-supported disease management, and other e-health innovations and initiatives. Still, as in so many other industries, most of the data remain, at best, underused.

Yet, much of the uncultivated value of these data lies in the wealth of descriptive and predictive information that skillful manipulation and analysis can reveal. While such rich information could enable healthcare professionals to provide more timely, evidence-based, and cost-efficient care, one of the primary challenges has been figuring out how to convert the data to actionable information.

The growing sophistication of database technology and tools coupled with the decreasing cost of more powerful computer hardware are transforming healthcare data into a more flexible medium form which useful information can be extracted.

Data mining, the principle knowledge extraction tool for large stores of data, holds the potential to produce information that can facilitate improved, evidence-based decision making, whether financial, administrative, or clinical. The power inherent in data mining’s methodology gives it the capability to help healthcare organizations realize a significant return on their data investments. Consequently, this return can translate into safer, higher-quality care delivered more effectively and in a more cost – and other resource-efficient manner.

Rush-Presbyterian-ST. Luke’s Medical Center in Chicago, Ill., experienced such an improvement when a cross-functional group (clinical, technical, and administrative) used data mining to analyze the care delivered to its Medicare-Medicaid patients as a way to identify opportunities for improvement. The group made surprising discoveries about the care provided in the diagnosis-related group (DRG) category Medicare-Exempt rehab/psych/SNF and has begun making changes to improve cost efficiency, while continuing to give a high level of care to patients in this category.

Data mining offers a structure and a process for information and knowledge discovery. This problem-solving tool’s effectiveness depends not only on the specificity of the identified problems it is asked to address, but also on workable data – that is, data that are accurate, complete, and consistent.


The data mining process capitalizes on advances in computer hardware, software development, statistics, and machine learning techniques borrowed from research in artificial intelligence. It becomes most useful when the volume of data exceeds the capacity to unaided human analysis and understanding. Part of the emerging technology referred to as knowledge discovery in databases (KDD), data mining applies advanced statistical models and algorithms – a set of rules or a series of steps for solving a problem – to discover hidden and often unexpected patterns and relationships in data.

Pattern and relationship identification differentiate data mining from online analytic processing (OLAP) tools, which can be used to test a hypothesis once patterns have been identified. Data mining accomplishes this by reducing or summarizing large amounts of data so they can be manageably viewed, understood, and analyzed. “Data mining is essentially a data reduction tool for summarizing data,” explains Usama Fayyad, PhD, president and CEO of digiMine, a data warehousing and data mining outsourcing company based in Seattle, Wash. Widely regarded as a leading data mining expert., Fayyad helped establish this technical discipline in his work as a graduate engineering student at the University of Michigan and later for both NASA’s Jet Propulsion Laboratory at Cal Tech in San Diego and Microsoft. Data mining’s summarizing capability enables huge amounts of data to be visualized – amounts much greater than can be handled unaided by a human being. As a result, says Fayyad, “Deep patterns can be recognized – deeper than what a bar chart can show.”

Data mining uses two types of models: descriptive and predictive. The descriptive model identifies patterns and facilitates discovery within and an understanding of a huge mass of data, such as health characteristics of a specific population. The predictive model identifies relationships among variables and helps forecast or anticipate certain trends, conditions, or outcomes.

“Where [the predictive model of] data mining could be very useful [in healthcare] is in trying to determine the variables that produce a certain outcome,” explains Herb Edelstein, president of Two Crows Corporation, a data mining consulting firm based in Potomac, Md. For example, the results of a specific procedure could be analyzed using selected data mining approaches to help discover why outcomes from the procedure varied among patients. The information derived from data mining could then be built into a toll that could help proactively identify – by individual patient history and characteristics – those patients who might respond well to the procedure.

A group of researchers at Williams College in Williamstown, Mass., the University of Vermont College of Medicine, and the Memory Disorders Clinic at Southwestern Vermont Medical Center developed this type of predictive tool to help better distinguish patients experiencing cognitive defects related to Alzheimer’s disease from those patients experiencing cognitive changes due to the normal aging process. The tool’s accuracy and reliability not only help direct the right patients to the right treatment, but they also do so in an effective, timely, and cost-efficient manner, reducing the time and resources needed to administer tests from three to four hours to approximately seven minutes.

Data mining’s ability to sift through large, complex data stores to answer “why” questions makes it a powerful, evidence-based decision support tool for clinicians and business analysts. “It has phenomenal potential in healthcare,” says Edelstein, referring to clinical as well as business applications. “Unfortunately, it is not widely used.”

He believes this is partially due to a lack of knowledge about data mining technology.

When used for clinical purposes, data mining tends to be broadly focused and high level, performing functions such as analyzing overall DRG patterns, case mix patterns, or overall severity of illness. These analyses have allowed organizations to benchmark costs, determine how sick patients have been, and deduce how well care is being delivered – but only at a high level. To drill down from this point is more difficult, maintains Gunasekaran, and will take time.


The difficulty in drilling down lies less with data mining techniques, which are relatively straightforward, and more with acquiring the necessary process knowledge and business or clinical analytics that enable data mining to be used effectively and to produce accurate results. “Process-based expertise [in healthcare] has only grown recently,” explains Gunasekaran, pointing out that identifying organization business logic – the business analytics – has also recently become a concern for most healthcare organizations.

Analytics are a central component of the data mining process, Contrary too many

The larger issue as he sees it, though, is the industry’s cautious approach toward high-tech information technology (IT). “[Healthcare] information services departments tend to be very conservative,” observes Edelstein. “Too many organizations are not aggressive enough in asking how [information] technology can help them solve problems.”

Nonetheless, viable data mining solutions for those organizations that have been more aggressive in implementing high-tech IT have been in relatively short supply. According to Surech Gunasekaran, healthcare industry principal analyst for Gartner Dataquest, organizations interested in keeping pace with or staying ahead of the IT curve to improve their operations have only recently been provided with usable solutions to help them control the cost of care, quality, and performance improvement issues. The number of organizations in this category has remained small and mostly within the hospital sector.

Gartner Dataquest is an IT market research service provided by Stamford, Conn-based Gartner, Inc. As an analyst, Gunasekaran sees organizations employing data mining primarily on the building and insurance end of operations in areas such as authorization and referral management myths and misconceptions, data mining is not a one-size-fits-all process, nor does it operate blindly on data and produce information by magic. “You need the analytics to help solve the kinds of problems that data mining is designed to address,” says Gunasekaran. The analytics are used to build the data mining models, which are then employed to mine models, which are then employed to mine selected data. Developing analytics requires both historical data and people who know how to analyze it. Again, contrary to some data mining misconceptions, the human element is a vital component.

Successful data mining relies on the knowledge and skill sets of both technical and “content” people. In addition to IT expertise, staff who have detailed knowledge of the area or areas to which data mining is being applied, as well as the data themselves, must be intimately involved. Moreover, they should have the necessary analytic skills to understand, interpret, and work with the mined results. “Data mining can never replace human knowledge and experience,” maintains Fayyad. “In fact, [with data mining] they [people] take on greater importance.”

While a close, collaborative partnership between technical and content staff always facilities smoother and more effective IT implementations, such a partnership becomes a critical success factor in a data mining operation. Another factor, of course; is the data quality itself.


Obviously, data mining requires data, but not just any data. The data must be relevant, available, accurate, and not to mention, “clean” – that is, complete and consistent. Data mining techniques are highly sensitive to missing or inconsistent data, and the same garbage-in, garbage-out adage applies as much to data mining as to any other data dependent computer operation. Among the several preliminary steps in a data mining initiative, data preparation remains the most time-consuming. When data preparation involves consolidating data from multiple source, this tedious and difficult task becomes that much more fatiguing and time-consming.

“Getting the data [in healthcare] is such a challenge” says Gunasekaran. “Data integration is still a profound problem,” he adds, referring to a host of obstacles in most healthcare organizations that must be hurdled to obtain quality, mineable data. These include contending with data, stored in system terms that use vastly different platforms or systems that run arcane or “homegrown” software; lack of commitment to putting data currently stored on paper into electronic format; missing data; and poorly archived data. Substandard data can be an issue from the outset of a data mining initiative because retrospective data are needed to design the kinds of questions data mining can address. Gunasekaran believes it will take most healthcare organization several years to “ramp up” their data to an acceptable, mineable level. This is one of the major reasons why many organizations elect not to encounter the data mining challenge.

Although Fayyad doesn’t see data quality as a problem unique to healthcare, he stresses that pulling together the data to be mined is a tricky operation. They have to be collected thoroughly and systematically, organized correctly, and the discipline has to be maintained over time. The data also must be stored in a way that allows flexibility and enables accurate, effective, expeditious mining. Most organizations store their data in vertically oriented, transactional databases, which have a more rigid structure. According to Fayyad, the advantage of a data warehouse lies in its design, which uses a different architecture and approach than other databases. The result is a more flexible structure for collecting, storing manipulation, and maintaining link” in effective data mining.

Fayyad also believes that data preparation needs to be done by the right people in the right teams, underscoring again the necessary knowledge, skill sets, and cooperative nature that team members must possess. Moreover, he strongly recommends that those involved in a data warehousing initiative be extremely disciplined about staying focused on identified objectives. It’s easy for data warehousing initiatives to run away from the participants, and the project can quickly become complex by adding more to it, whether data or functionality. As a result, many organizations either never finished building a data warehouse, go over budget, or fail to get the desired results.

Edelstein estimates that approximately 60% to 95% of the time dedicated to data mining will be spent working with the data, regardless of how clean they are or how they’re stored. While initial data preparation can absorb a lengthy period of time, data work doesn’t end there. The data must be maintained and kept up-to-date, otherwise, says Fayyad, “the data stat becoming like last year’s newspaper.” However, pulling data from existing systems, not to mention keeping them maintained properly, raises a longstanding healthcare IT issue: lack of computer systems integration. Data mining is dependent upon systems that can talk to one another, as well as use and update data in a consistent way, Gunasekaran stresses.

Fayyad concurs, advises organizations to put their systems in order so they can leverage the power of data mining for discovery or prediction and, hence, improvement. According to Fayyad, large retailers ad many Fortune 100 companies have been wrestling with getting their data and data systems in order for 15 to 20 years. “The sad truth is that there has been little success,” he relates. He believes the only way to het data mining to work is to treat it systematically and make it part of daily operations. To do so requires a shift in mindset about how information systems and technology – this technology in particular – can be employed to improve clinical, administrative, and financial operations today, rather that tomorrow.


Obtaining the greatest value from any data mining investment necessitates thinking of it as a continuous, iterative process rather than a project, stresses Gunasekaran. Paradoxically, developing a successful data mining program requires staying grounded in the present and building a system to address today’s operational problems and issues that will not be an obstacle to addressing tomorrow’s problems and issues. A tall, near-impossible order? Not necessarily.

As a first, very particular step, take the time to become knowledgeable about what data mining can and can’t do, and become familiar with the boundaries of the technology. Keep in mind that it is a complementary tool, and, as Fayyad cautions, “don’t attach magical intelligence to it”. There are a growing number of excellent, readily accessible print and online resources to help in this effort, many of which are cited in this article’s reference.

Well before investigating any technology, identify a specific, concrete clinical and/or business problem that data mining can help solve, as well as the people who will be needed to address it. Fayyad underscores how tempting it can be to build a data mining effort with futurist vision. The risk of never completing it, however, is not worth taking, he stresses. “Meet today’s need and forget the grandiose visions,” he advices, “then build from there. The stuff you’re uncertain about, forget it for now. It the technology weren’t so complex [to implement], then I’d say take the long view, but it’s not. It’s easy to build a bad data mining initiative, and much harder to build a good one.”

Gunasekaran also advises organizations to concentrate on clinical or business problems from which they can derive the most benefit today, recommending that they invest in technology with an open architecture that can be extended tomorrow. He believes this approach will accrue the kind of return on investment HIM and HIT leaders must demonstrate, particularly when applying any new technology or upgrades to existing technology.

Edelstein urges organizations who are novices to data mining to start small. Large, complex initiatives can run into hundreds of thousands of dollars go awry quickly and easily without the know-how to manage them well. Edelstein contends that small and manageable, though no less useful, efforts can be accomplished for as little as $5,000 to $15,000. “Starting with a smaller, simpler [goal] provides a less expensive opportunity not only to learn about the technology, but also how the organization can achieve success with it,” he maintains. Rush-Presbyterian-St. Luke’s Medical Center did just this in its early data mining excursions with the explicit intention of learning how the technology and associated tools worked.

Fayyad also recommends developing a clear understanding of the data operations commitment needed.
“Building [a data mining operation] is one thing,” he says. “Maintaining it is quite another.” Fayyad and digiMine put a lot of stock in taking a disciplined approach to data mining and being realistic and committed to staying within well-defined, here-and-now limits.

The increasing conversion of health data to electronic formats and storage media makes this the perfect time to think seriously about how applying data mining in an organization can help target opportunities to improve clinical, administrative, and financial performance. Although currently small, the number of organizations and research groups achieving success and experience with data mining is growing. Each one demonstrates that data mining’s technology and tools offer not only a sound approach to turning a wealth of data into actionable information, but, consequently, accumulative returns for patients, providers, and organizations.


Leave a Reply