Published by MSR on February 03, 1999
A mine is a dark, uninviting, and sometimes dangerous place. Not a bad description for territory that humans encounter today when they attempt to analyze, understand, or navigate large data stores.
As its name implies, “data mining” is about unearthing nuggets of valuable information in the mountains of data stored in corporate, public, scientific, and even personal databases. As these databases grow in size, dimensionality and complexity, the challenge increasingly becomes one of how to extract useful patterns of information that can be used to reliably support decision making or predict or model interesting events.
These are the issues Usama Fayyad grapples with in his role as senior researcher in the Decision Theory & Adaptive Systems Group of Microsoft Research. “Now that databases are everywhere, the biggest problem facing people is how to access the contents. The traditional answer has been: ‘Here’s a query language. Write down a query and we’ll give you back a precise answer.’ The difficulty is that many interesting queries people are interested in are very hard to express,” he said.
“For example, if you’re a credit card company you might want to say, ‘what transactions in my database are likely to be fraudulent?’ The problem is that you could easily have hundreds, sometimes thousands of fields for each customer and their usage patterns, so it’s very hard to write a query to extract the right set of records. Traditional query languages (SQL) require that one be able to describe the target records exactly. What humans typically do in these situations is look at a few fields – three or four at a time – and simply ignore the rest. This is an artifact of humans only being comfortable spotting patterns in low-dimensional spaces, and it leads to missing better solutions involving many more dimensions.”
Usama views datamining as a way to provide a new interface between humans and databases; a database front-end, if you will, that lets people explore, analyze, visualize, and summarize the contents by communicating at a much higher level. For example, instead of specifying an exact query, a user can say: “find records that are similar to this set of records but different than that other set.” Another example is to say: “look for records that are similar to each other (clusters) embedded in high dimensions and show me how they differ from the rest of my data.”
In his research, Usama says there are two categories of challenges facing the development of effective data mining tools: scalability and automation. Scalability requires addressing problems in the development of data mining engines so they won’t “croak because the inquiry sucks up all of the computer’s memory.” The first thing a typical (statistical) data analysis system tries to do is “load data” into main memory: an unfortunate event if data is larger than core memory.
“However, if we are careful about how we design the mining algorithms, we can exploit native database capabilities to efficiently extract models from disk-resident data,” says Usama. “Scalability also involves addressing a host of issues in parallel and distributed environment, client-server architectures, scaling to computation on clusters of networked computers, and mining multimedia/multi-modal data. Automation challenges include making analysis tools usable by end-users. Traditional statistics did not succeed in this respect because statisticians wrote tools that were only usable by other statisticians. “It is important to allow users to easily formulate and solve analysis and exploration problems over their own databases,” says Usama.
Realizing that data mining has a big role to play beyond the analysis of science data sets, Usama, 33, joined Microsoft in early 1996. He hopes to push the research front of this new and growing field as well as help develop data mining capabilities to make computers easier to use and more effective tools for dealing with the data glut they helped create in the first place. In addition to data mining, Usama’s research interests include knowledge discovery in large databases, machine learning, statistical pattern recognition and clustering.
Usama was born in the historic town of Carthage (the birthplace of Hannibal, and the center of the Punic Wars). He received his Ph.D. in computer science and engineering from the University of Michigan, Ann Arbor in 1991. Before receiving his Ph.D., Usama thought he would be a student forever, collecting two B.Sc.’s in Engineering, an M.Sc. in CSE and another M.Sc. in Mathematics. This had the side benefit of allowing him to spend summers at interesting research labs. When he finally decided it was time to graduate, Usama joined the NASA-funded Jet Propulsion Laboratory at the California Institute of Technology. There he formed and headed the Machine Learning Systems Group, and focused on developing systems to help astronomers and geologists analyze large data sets.
His work was recognized with the JPL Lew Allen Award for Excellence in Research in 1993, and NASA’s Exceptional Achievement Medal in 1994.
In 1994 and 1995, Usama was program co-chair of the International Conference on Knowledge Discovery and Data Mining (KDD). He served as general chair of the KDD-96 conference, is an editor-in-chief of the new technical journal Data Mining and Knowledge Discovery (http://www.research.microsoft.com/datamine), and co-edited the recent MIT Press book: “Advances in Knowledge Discovery and Data Mining.”
In the winter, Usama loves to ski. In summer, he likes to cruise in Lake Washington, pursuing his new passion for boating. A long-distance varsity swimmer in college, he still enjoys swimming, and a good game of chess. Other hobbies include photography and a good deep sleep once in a while.