Panning for Data Gold – Usama M. Fayyad, Ph.D.

Nowadays nearly every organization from supermarkets to the police can boast a vast mine of electronic data. Separating the gold from the dross is the real challenge, as Robert Matthews reports.

SOMEWHERE among the two billion blobs of light captured in the Palomar Observatory’s Digital Sky Survey are quasars – distant galaxies that are among the brightest objects in the Universe. Astronomers would dearly like to know more about them and their mysterious power source, but which of those myriad blobs of light should they be looking at? It is like finding the proverbial needle in a haystack, and this haystack is the entire cosmos.

The astronomers’ predicament is shared by some unlikely bedfellows: supermarket executives, stock market analysts and detectives. All are faced with a sufeit of data, but a death of information.

Now a growing band of computer scientists say they can dig out nuggets of 24-carat knowledge from huge mountains of database dross. They call themselves “data miners”, and they are wielding some pretty impressive tools, drawn from esoteric fields such as artificial intelligence and statistical inference theory. But the impact of their efforts is anything but esoteric.

By identifying potential new customers – or ways of hanging on to existing ones – this information is worth millions in extra revenue. And this is just the start, according to Usama Fayyad of Microsoft Research and co-editor of a new book on data mining.

“Big corporations can obviously benefit most, as a small improvement in, say, prediction or modeling can easily add up to millions of dollars through sheer numbers,” hey says. “But data mining can be quite as powerful for small businesses too – like a restaurant owner who serves 500 meals a week and wants to know what new dishes to recommend to customers, based on their past choices.”

At first sight, data mining sounds like little more than an exercise in graph-plotting: just rummage through your customer data and find out who chose prawn cocktail and steak and chips, but eschewed the Black Forrest gateau in favour of apricot tart. But what if there are 5 starters, 10 main courses and 8 desserts? That’s 400 combinations for a start. Then there are the different permutation of age groups, social classes and income levels. I’s called the “curse of dimensionality”: the way in which just a handful of variables can produce a colossal number of permutations. Multiply it by the size of the customer base – which can easily be hundreds of thousands, even millions – and finding a trend stats to look impossible.

Yet data miners are now happy to tackle such daunting tasks. Dealing with such large volumes of data might seem to demand heavy-duty computer power and certainly some data mining companies, such as Bracknell-based White Cross Systems, wield big-hitting parallel processing computers to blast through huge corporate databases in seconds.

Other data miners try to take a leaf out of the old prospector’s book, and look for “promising ground” before they begin major excavations. “It’s not unusual for the first part of a data mining project to be concerned with how to get a suitable sample of the database to use,” says Dave Shuttleworth, senior consultant with White Cross. “I’ve personally experienced projects where people take months to decide which 98 per cent of their data to ignore.”

Data miners certainly have to be prepared for a fair bit of drudgery before they can get to work. The raw data often has to be “cleaned up” so that all the information is in a uniform state for mining. Shuttleworth recalls a case where a retailer’s database included 500 different ways of describing which American state the information came from.

Once the ground is prepared, data mining can begin in earnest. The challenge is not simply to dig out valuable information from huge amounts of data, it is also to pull out trends, groupings and connections that depend on many variables, lined in complex ways. Such patterns cannot be found using simple textbook methods such as linear regression, which finds the best straight line linking one variable to another.

Instead, data miners are turning to more powerful methods such as rule tree induction. This uses ideas taken from information theory and the laws of probability to extract –“induce” – rules that can best account for the data. For example, by looking at the numbers of customers who choose different starters, main courses and desserts, tree induction can reveal the most likely rules linking customers together: “IF prawn cocktail AND cheap plonk THEN steak and chips.”

Pattern matching

Patterns can also be dug out using neural networks, computers that crudely mimic the brain’s ability to find relationships in data by being shown many examples. Such networks are first trained on data samples showing, say, the relative proportions of customers who order particular starters, main courses, drinks and desserts. The network then tries to classify each type of customer according to their preferences. At first, the classification is inaccurate, but the neural network’s algorithms allow it to learn from its mistakes, revealing relationships between, say, orders for liqueurs after roast beef dinners.

Techniques such as tree induction and neural computing have been around for years. But data miners are discovering that they must do more than merely apply these old methods to huge databases. “Decisions based on data mining results may involve very large amounts of money,” says Beatriz de la Iglesia of the University of East Anglia. “And management is not enthusiastic about embracing ideas they cannot understand or analyze for themselves.”

This demand for lucidity is providing a challenge for data miners. For example, tree induction has a nasty habit of throwing up appallingly complex rules even with relatively simple databases – logical nightmares such as “IF chips AND (NOT steak AND peas AND (NOT ice-cream AND fruit cocktail)) AND…” on and on. In a recent analysis of customer behavior for a British financial institution, de la Iglesia and her colleagues Justin Debuse and Vic Rayward-Smith found that some induction programs produced huge decision trees with dozens of branches.

The situation with neural networks is even worse. Famed for their ability to find useful rules from complex and messy data, they are also notoriously opaque in their reasoning. Konrad Feldman of the London-based data analysis consultancy SearchSpace recalls developing a neural network for an Italian credit reference company that predicted which companies were most likely to file for bankruptcy with around 75 percent accuracy – a much higher score than traditional methods. “The problem was that the company then had to justify its predictions to clients, and they kept wanting to know exactly why a particular company was a bad risk – and what was a neural network anyway?”

Feldman and his colleagues started again, this time using “genetic algorithms”. They began with a set of guesses about which rules might apply, in the form of combinations of traditional financial measures such as share price to earnings ratios. Each guesses rule was then tested to see how well it performed on predicting bankruptcy. By weeding out the less successful ones and letting the better ones combine with each other, a Darwinian process of “survival of the fittest” led to increasingly reliable prediction rules.

Overall, the accuracy of the predictions was slightly worse than using the neural network. But Feldman’s customers still preferred the data mining process that used genetic algorithms, mainly because they could understand the rules it was using to make predictions. Understanding the origin of mined nuggets is about more than making clients feel comfortable, however. Rob Milne of Intelligent Applications, an artificial intelligence applications company in Livingston, West Lothian, points out that it can also help to protect data miners from unearthing fool’s gold.

He cites his own experiences analyzing the database of a leading financial services company offering pensions, insurance and investment policies. The company wanted to identify which customers were most likely to be poached by aggressive rivals. Milne and his colleagues began by using rule induction methods, but the results that were being produced did not match reality. “Suddenly from one combination of inputs, the accuracies jumped to over 95 percent accurate in predicting both the customer most likely to stay and the customers with a propensity to leave,” recalls Milne.

Had the data mining unearthed some amazing seam of marketing gold? “Our experience made us very suspicious,” says Milne, and he and his colleagues set about looking for an explanation. It turned out that the accuracy and market behavior that appeared during one short period. The bad news was that the rules applied only to that part of the dataset, and had no predictive power at all.

The moral of the story is clear, says Milne: the use of clear, rule-based methods let them trace the source of the spurious accuracy, and spot the fool’s gold. “If we had a black box approach to data mining – like neural networks – we would have no way to check on what basis the decisions were being made.”

Best bet

At the Thomas J. Watson Research Center in New York State, Chidanand Apte and Se Jung Hong have attacked the problem of intelligibility by using logic methods to find the simplest rules capable of spotting trends in data.

Their target was a familiar one: predicting which companies will do best on the American Securities Market. Apte and Hong used the same 40 financial indicators as those monitored by stock-brokers – such as average monthly earnings and investment opportunities. They then tried to find the simplest rules for spotting the best investment bet each month, using a technique known as disjunctive normal form logic, a way of connecting descriptions of data together so that any contradictions can be rapidly found.

The resulting simple investment rules worked very well, turning in a 270 per cent return over five years, compared to a market average of just 110 per cent. Not surprising, Apte and Hong would like their technique to be taken up by an investment house in the business of buying and selling stock.

One of the biggest delights of the data-miners’ work is finding a technique that uncovers information no one would have expected to find. Shuttleworth and his colleagues at White Cross, for example, unearthed a surprise for one customer in the telecommunications business that was looking for a good way to identify users who were unlikely to pay their bills on time.

Before carrying out the exercise, Shuttleworth and the telecommunications company expected that the people most likely to have trouble paying their bills would be those on low incomes. “We discovered that the ‘urban achievers’ – white collar, good salary, college educated – turned out to be among the worst offenders.”

In another project for the same telecommunications company, the White Cross team expected to show that the company could improve profits most by trying to encourage low-usage customers to make better use of the services. “In fact, the data mining showed that the highest growth sector was high-usage customers moving to even higher usage.”

While such success look set to trigger demand for data mining far beyond the financial sector, there is a problem holding up its progress. Most of the world’s data is still stored on paper, microfiche, or word processed documents in some obscure format. Simply reading such data at all is a major challenge facing data miners.

Epson-based Software Scientific have recently made a major breakthrough in this problem using natural-language processing – techniques for controlling computers that use normal words rather than programming language. The result is a data-mining software package that hunts for information in ordinary text. For example, given text files of hundreds of statements taken from criminal suspects, the program uses set theory and linguistic analysis to find key facts and relationships in the data. Detectives can simply ask the computer “Who is the most likely culprit?”, and the relevant extracts from the statements appear on the screen in a few moments.

High interest

With so many organizations having the bulk of their data locked up in ordinary text, such systems are attracting interest from many quarters, including police forces, say Lea. “Although it is obviously no replacement for police offices, it can be used to make better use of the resources.” While impressive, such techniques raise the spectre of data-mining falling prey to hype by those who seize on every new technology like a child with a new toy. “It is true that some simple data mining work can often result in great successes, but this by no means justifies people thinking the fundamental problems are solved,” warns Fayyad.

One of most pressing, he believes, is the fool’s gold problem. “Many patterns and trends are extractable from data – and most of these are likely to be junk, simply because the data is finite and computation is limited,” says Fayyad.

Statistical inference – techniques for drawing reliable conclusions from complex data – can help. One key idea is that the reliability of a finding is proportional to its plausibility. For example, Feldman and colleagues at SearchSpace found a connection between sales of dog food and fizzy drinks lurking in a supermarket’s database. The sheer implausibility of this connection led them to write it off as a quirk. But other connections may not be so easily dismissed.

Data mining is under threat from another bugbear of new technology: the reluctance of commercial users to trumpet any success they have using the technique. Instead, the best adverts for data mining are likely to come from those working in more open fields.

Last December, Fayyad and colleagues at the Jet Propulsion Laboratory in Pasadena, announced the discovery of a slew of new quasars among those myriad blobs of light on the Palomar sky survey – courtesy of data mining. Using decision tree and rule-based methods, the team trained algorithms to classify light sources as stars, galaxies or distant quasars. They were then able to tell astronomers searching for quasars which objects might reward closer study. The result: 16 previously undiscovered ancient quasars bagged in a fraction of the telescope time usually needed.

With funding for science under so much pressure, making discoveries by mining existing data could well prove to be an idea whose time has come. As Fayyad says: “It is a true new way of doing science.”

By: Robbert Mathews

Source: New Scientist

PDF

Nowadays nearly every organization from supermarkets to the police can boast a vast mine of electronic data. Separating the gold from the dross is the real challenge, as Robert Matthews reports.

Leave a Reply Cancel reply