Collaboration to Revolutionize Database Processing
It may sound like a strange comparison, but to Usama Fayyad in Microsoft Research (MSR), most large databases are like the pyramids of ancient Egypt. While they are great feats of engineering, they’re also huge, tomblike structures that have locked up vast and valuable data treasures.
The information revolution has created large database “stores”, which can hold many terabytes of data. (A terabyte represents 10 to the power of twelve bytes.) And as the size and complexity of these databases grown, it’s becoming harder to access their contents and extract patterns of information that can be used to reliably support decision-making.
Typically, the answer has been to use traditional query languages that require users to describe the target records exactly. But it’s nearly impossible for human beings to visualize and understand the raw contents of a database that can hold millions of records, each one of which can contain hundreds or thousands of fields. As a result, many queries that people are interested in are very difficult to express.
For example, a simple question posed by a marketing manager such as “What items in my company are being purchased together each month?” would require an analyst to evaluate millions of data points – an impossible task.
That problem is now being addressed at Microsoft, thanks to a collaborative effort between database mining expert Fayyad and David Marshall, a longtime SQL Server contributor who now heads up the product’s Aurum team (http://aurum/). Its goal: to develop SQL Server algorithms that evaluate data and derive critical patterns a human analyst is likely to miss or is unable to unearth.
AURUM MINERS SEE GOLD IN THEM THAT DATABASES
Fayyad and Marshall are proponents of an emerging discipline known as KDD – knowledge discovery in databases – a field that combines elements of database theory, statistics and pattern recognition, artificial intelligence, and high-performance computing. Marshall’s Aurum team (aurum is Latin for gold) and Fayyad’s Socrates team (part of MSR’s Decision Theory & Adaptive Systems group) are harnessing the power of KDD and its related discipline, data mining.
“What data mining allows you to do is to use the power of the computer to find new patterns, new correlations within data,” said Marshall, whose SQL Server development contributions date back more than 10 years to the days of the Microsoft/Sybase collaboration. “A human being who is looking for these kinds of patterns can conceptualize only a few factors at a time. Beyond that, it’s hard to have an intuition about the relationships between the different data attributes.”
Marshall, whose Arum team is now staffing up with Fayyad’s group to build component-object models that implement data-mining algorithms for future version of Microsoft products. These include SQL Server, Excel, Access, and Commerce Server.
Among other tasks, the algorithms can be used to uncover natural groupings of data segments. They can also find unusual relations among items – for example of Access are also likely to buy Visual Basic.
Rather than specifying an exact query, a user would use an algorithm to find records that are similar to one set of records but different from a second set. Another example would be to look for records that are similar to each other and show how they differ from the rest of the data.
“A credit card company might want the database to provide an exact logical expression to help identify a fraudulent transaction,” Fayyad said. “Among millions of transactions, that’s currently a difficult, if not impossible task. But what if you could build a system that could automatically recognize what is and isn’t fraud? All of a sudden the end user would see the database becoming smarter, easier to use, and much more flexible.”
VISION IS KEY TO DATA MINING
Fayyad credits Bill Baker, head of the Decision Support Product Unit, and Pail Flessner, general manager of SQL Server, for recognizing the strategic value of data mining to SQL Server.
The trio’s basic approach is to establish and publicize a new, standard, and open application programming interface. They also hope to line up independent software vendor partners in a data mining alliance that will make SQL Server the platform of choice for rapidly building integrated, efficient solutions for many different data-mining programs.
This vision is now starting to come into fruition, and Fayyad said it will mean great things for Microsoft. “This is an area that we can come in way ahead of Oracle, IBM, and other competitors and deliver technology nobody else has.”
Author: Aaron Halabe