The capacity of digital data storage worldwide has doubled every nine months for at least a decade, at twice the rate predicted by Moore’s Law for the growth of computing power during the same period. This less familiar but noteworthy phenomenon, which we call Storage Law, is among the reasons for the increasing importance and rapid growth of the field of data mining.
The aggressive rate of growth of disk storage and the gap between Moore’s Law and Storage Law growth trends represents a very interesting pattern in the state of technology evolution. Our ability to capture and store data has far outpaced our ability to process and utilize it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again.
Data tombs also represent missed opportunities. Whether the data might support exploration in a scientific activity or commercial exploitation in a scientific activity or commercial exploitation by a business organization, the data is potentially valuable information. Without the next-generation data mining tools, most will stay unused; hence most of the opportunity to discover, profit, improve service, or optimize operations will be lost. Data mining – one of the most general approaches to reduce data in order to explore, analyze, and understood it – is the focus of this special section.
Data mining is defined as the identification of interesting structure in data. Structure designates patterns, statistical or predictive models of the data, and relationships among parts of the data. Each of these terms – patterns, models, and relationships – has a concrete definition in the context of data mining. A pattern is a parsimonious summary of a subset of the data (such as people who own minivans have children). A model of the data can be a model of the entire data set and can be predictive; it can be used to, say, anticipate future customer behavior (such as the likelihood a customer is or is not happy, based on historical data of interaction with a particular company). It can also be a general model (such as a joint probability distribution on the set of variables in the data). However, the concept of interesting is much more difficult to define.
What structure within a particular data set is likely to be interesting to a user or task? An algorithm could easily enumerate lots of patterns from a finite database. Indentifying interesting structure and useful patterns among the plethora of possibilities is what a data mining algorithms must do, and it must do it quickly over very large databases.
For example, frequent items sets (variable value occurring together frequently in a database of transactions) could be used to answer, say, which items are most frequently bought together in the same supermarket. Such an algorithm could also discover a pattern in a demographics databse with exceptionally high confidence that, say, all husbands are males. While true, however, this particular association is unlikely to be interesting. This same method did uncover in the set of transactions representing physicians billing the Australian Government’s medical insurance agency a correlation deemed extremely interesting by the agency’s auditors. Two billing codes were highly correlated; they were representative of the same medical procedure and hence had created the potential for double-billing fraud. This nugget of information represented millions of dollars of overpayment.
The quest for patterns in data has been studied for a long time in many fields, including statistics, pattern recognition, and exploratory data analysis. Data mining is primarily concerned with making it easy, convenient, and practical to explore very large databases for organizations and users with lots of data but without years of training as data analysts. The goals uniquely addressed by data mining fall into certain categories:
Scaling analysis to large databases. What can be done with large data sets that cannot be loaded and manipulated in main memory? Can abstract data-access primitive embedded in database systems provide mining algorithms with the information to drive a search for patterns? How might we avoid having to scan an entire very large database while reliably searching for patterns?
Scaling to high-dimensional data and models. Classical statistical data analysis relies on humans to formulate a mode then use the data to assess the model’s fit to data. But humans are ineffective at formulating hypotheses when data sets have large numbers of variables (possibly thousands in cases involving demographics and hundreds of thousands in cases involving retails transactions, Web browsing, or text document analysis). A model derived from this automated discovery and search process can be used to find lower-dimensional subspaces where people find it easier to understand aspects of the problem that are interesting.
Automating search. Instead of relying solely on a human analyst to enumerate and create hypotheses, the algorithms perform much of this tedious and data-intensive work automatically.
Finding patterns and models understandable and interesting to users. Classical methodologies for scoring models focus on notions of accuracy (how well the model predicts data) and utility (how to measure the benefit of the derived pattern, such as money saved). While these measures are well understood in decision analysis, the data mining community is also concerned with new measures, such as the understandability of a model or the novelty of a pattern and how to simplify a model of interpretability. It is particularly important that the algorithm help end users gain insight from data by focusing on the extraction of partners that are easily understood or can be turned into meaningful reports and summaries by trading off complexity for understandability.
Trends and Challenges
Among the most important trends in data mining is the rise of “verticalized”, or highly specialized, solutions, rather than the earlier emphasis on building new data mining tools. Web analytics, customer behavior analysis, and customer relationship management all reflect the new trend; solutions to business problems increasingly embed data mining technology, often in a hidden fashion, into the application. Hence, data mining applications are increasingly targeted and designed specifically for end users. This is an important and positive departure from most of the field’s earlier work, which tended to focus on building mining tools for data mining experts.
Transparency and data fusion represents two major challenges for the growth of the data mining market and technology development. Transparency concerns the need for an end-user-friendly interface, whereby the data mining is transparent as far as the user is concerned. Embedding vertical applications is a positive step toward addressing this problem, since it is easier to generate explanations from models built in specific context. Data fusion concerns a more pervasive infrastructure problem: where is the data that has to be mined? Unfortunately, most efforts at building the decision-support infrastructure, including data warehouses, have proved to be big, complicated, and expensive. Industry analysts report the failure of a majority of enterprise data warehousing efforts. Hence, even though the data accumulates in stores, it is not being organized in a format that is easy to access for mining or even for general decision support.
Much of the problem involves data fusion. How can a data miner consistently reconcile a variety of data sources? Often labeled as data integration, warehousing, or IT initiatives, the problem is also often the unsolved prerequisite o data mining. The problem of building and maintaining useful data warehouses remains one of the great obstacles to successful data mining. The sad reality today is that before users get around to applying a mining algorithm, they must spend months or years bringing together the data sources. Fortunately, new disciplined approaches to data warehousing and mining are emerging as part of the vertical solutions approach.
Emphasizing Targeted Applications
The six articles in this special section reflect the recent emphasis on targeted applications, as well as data characterization and standards.
Padhraic Smyth et al. explore the development of new algorithms and techniques in response to changing data forms and streams, covering the influence of the data form on the evolution of mining algorithms.
Paul Bradely et al. sample the effort to make data mining algorithms scale to very large databases, especially those in which one cannot assume the data is easily manipulated outside the database system or even scanned more than a few times.
Ron Kohavi et al. look into emerging trends in the vertical solutions arena, focusing on business analytics, which is driven by business value measured as progress toward bridging the gap between the needs of business users and the accessibility and usability of analytic tools.
Specific applications have always been an important aspect of data mining practice. Two overview articles cover mature and emerging applications. Chidanand Apte et al. examine industrial applications where these techniques supplement, sometimes sup plant, existing human-expert-intensive analytical techniques for significantly improving the quality of business decision making. Jiawei Han et al. outline a number of data analysis and discovery challenges posed by emerging applications in the areas of bioinformatics, telecommunications, geospatial modeling, and climate and Earth ecosystem modeling.
Data mining also represents a step in the process of knowledge discovery in databases (KDD) . The recent rapid increase in KDD tools and techniques for a growing variety of applications needs to follow a consistent process. The business requirement that any KDD solution must be seamlessly integrated into an existing environment makes it imperative that vendors, researchers, and practitioners all adhere to the technical standards that make their solutions interoperable, efficient, and effective. Robert Grossman et al. outline the various standards efforts under way today for dealing with the numerous steps in data mining and the KDD process.
Providing a realistic view of this still young field, these articles should help identify the opportunities for applying data mining tools and techniques in any area of research or practice, now and in the future. They also reflect the beginning of a still new science and the foundation for what will become a theory of effective inference from and exploitation of all those massive (and growing) databases.
By: Usama Fayyad and Ramasamy Uthurusamy
Source: Communications of the ACM