Datamining poised to go mainstream – Usama M. Fayyad, Ph.D.

Published by https://www.datamation.com/ on October 1, 1999

It used to be that datamining was limited to high-end database marketing firms and Global 100 firms–the kind whose online transaction processing (OLTP) systems generated millions of rows of data daily. There’s always been an aura of mystery, even magic, associated with datamining. It was a science practiced on powerful UNIX systems overseen by unsmiling statisticians and brilliant mathematicians.

Today that’s changing. Many Web sites are generating log files and e-commerce transaction files that are eminently mineable. Last month, for instance, online retail giant Amazon.com made headlines with its “purchase circles,” based on the fundamental datamining technique of affinity grouping (clustering). When retail sites suggest specific items to customers based on their past purchases, the sites are using a combination of customer relationship management (CRM) and datamining to increase their revenues.

Datamining is part of a process called knowledge discovery, where the goal is to better understand the organization’s data in order to resolve business problems or capitalize on opportunities.

Sizing things up

Consider retail shoe vendor Just for Feet Inc. (www.feet.com) of Birmingham, Ala. The company has approximately 160 superstores, in addition to 170 Athletic Attic, Athletic Lady, and Imperial Sports stores. Each store carries from 3,000 to 6,000 different shoe styles. Multiply the styles by all the different sizes, and you’ll start to appreciate what the shoe industry refers to as the “size explosion.” And what better way to take advantage of all that data than with a data warehouse/datamining initiative?

Each Just for Feet store functions as its own distribution center. With the “in” styles changing so fast, and with regions–even neighborhoods–having different hot styles, it’s not hard to realize how important it is for Just for Feet to have the right kind of shoes in stock at the right location. As a result, it made sense for the company to focus its initial datamining efforts on product rather than customer data. “You can be item-centric or customer-centric,” says David Meany, CIO, referring to alternative approaches to designing and mining Just for Feet’s terabyte-scale data warehouse. But you can’t do both at once.

Datamining purists might say that when Just for Feet generates exception reports for its buyers, that’s not genuine datamining. But the company’s buyers are thrilled with these weekly and monthly reports on sales that allow them to spend more time on the more creative aspects of their jobs–predicting fashion trends and future demand. Meany explains that Just for Feet also does “real” datamining to find answers to issues. For example, the company analyzes distribution practices to see how they impact product sell-through.

The first two phases of the company’s multiphase data warehousing/datamining initiative are now in production, built with the help of ICL Plc (www.icl.com), a global IT services company based in London. Just for Feet used ICL’s Fast Track Development Toolkit to generate the schema for an Informix Corp. Dynamic Server release 8.0 database and perform the initial data population. Currently, Meany only keeps about a year’s worth of transaction-level data in Just for Feet’s data warehouse, which is stored in a Sun Microsystems Inc. Enterprise E6500 server. The system maintains aggregate data for 1997 and 1998.

Although the first stages of Just for Feet’s implementation have been inventory-focused, plans are already underway to expand the company’s analysis capabilities and better leverage the customer component of the data warehouse. Keeping up with the “in” styles is only part of the lure of customer data. Consumers can join the Just for Feet club, with the enticement of special savings. Membership is easy, all you have to do is enter a telephone number and the system does a reverse lookup to determine the address. Is Meany looking forward to mining all of this customer data? You’d better believe it.

And then there are companies like Fingerhut Companies Inc. (fingerhut.com), the $2 billion firm known for its catalog, direct marketing, and telemarketing ventures, that have spent years honing the process of datamining. The Minnetonka, Minn.-based company’s marketing analytics group maintains several hundred generic models that are used to build targeted segmentation models that generate mailing lists for catalogs.

Typically, the datamining team combines four models: a response model (will the customer respond?), a purchase model (how much will the customer buy?), a return model (is the customer likely to return merchandise?), and a payment model (is the customer a credit risk?). The company maintains data (almost 1,400 variables per customer) on more than 30 million customer households in a data warehouse that tops 7 terabytes.

The players, new and old

Although datamining isn’t new technology, it has only recently emerged from academia, research labs, and several dozen vendors. The availability of data warehouses and cheap storage have certainly contributed to the trend, but today’s keen interest in datamining is largely driven by the explosive growth of e-commerce. Sales and marketing departments want to leverage the data gleaned from Web traffic patterns to do one-to-one marketing.

If the prospect of mining customer data to increase revenues, reduce risk, or detect fraud isn’t enough to propel datamining into the mainstream, there’s always the Microsoft factor. Microsoft Corp. ventured into datamining when the Redmond, Wash., software maker announced work on the OLE DB Extensions for Data Mining specification in May 1999. The project is a joint effort between the Microsoft SQL Server group and Microsoft Research’s Data Mining & Exploration group led by Usama Fayyad in consultation with a select group of vendors (see “Who’s who in datamining”). OLE DB is a specification for a set of data access interfaces designed to enable access to heterogeneous data sources. It’s considered the successor of open database connectivity (ODBC) and has already been “extended” for online analytic processing (OLAP) and a variety of vertical markets.

The Microsoft OLE DB for DM endeavor will likely spawn compliant datamining products sometime in 2000. But that doesn’t mean you can’t do datamining against SQL Server (or any other database) today. In fact, Microsoft’s Site Server 3.0 already includes features such as an intelligent “cross-sell” based on historical sales baskets in stores, the contents of the current shopper basket, and the browsing behavior of the shopper. Site Server ranks products that are likely to be most interesting to the shopper.

Microsoft isn’t the only firm with interdependent products. IBM Corp.’s SurfAid Analytics (surfaid.dfw.ibm.com) relies on the company’s own Intelligent Miner for Data to deliver sophisticated Web site analytics for a fixed monthly fee that ranges from under $1,000 to about $30,000. SurfAid is a small, entrepreneurial e-business within IBM Global Services, which is based in Somers, N.Y. Clients upload daily Web log files to the SurfAid FTP site. RS/6000 AIX scripts handle preprocessing, which includes “stitching back together” navigation paths of individual Web visitors. Then, one of SurfAid’s RS/6000s runs the IBM Intelligent Miner datamining tool kit against the customer file, which may contain over 150 million hits per day. The result is a daily report that customers can access at a private URL. Because IBM DB2 for OLAP is running behind the scenes, users can “slice and dice” the data starting with almost a dozen different reports.

IBM, by the way, shipped its first datamining tool kit in 1995. Today, the company’s Intelligent Miner for Data and Intelligent Miner for Text are used by customers with large DB2 databases. IBM has also developed a graphical query language, query by image content (QBIC), which lets users make queries of large image databases based on visual image content–properties such as color percentages, color layout, and textures occurring in the images. It is used with Digital Library to do graphical datamining.

Shortly after Microsoft parted the curtains on its datamining spec, Oracle Corp. announced its purchase of leading datamining vendor Thinking Machines Corp. and its Darwin product family. The Redwood City, Calif.-based company hasn’t made any announcements about how Darwin will be integrated into its product line. Although Oracle already has its own text mining product called Oracle ConText, it’s likely that the company will weave Darwin into its marketing campaign and Oracle Applications product line. In another significant move toward consolidation, SPSS Inc. (www.spss.com) acquired Integral Solutions Ltd. (ISL) and its popular Clementine product.

Darwin and Clementine are two of six datamining tools suites that Stamford, Conn.-based Gartner Group, in an August 1999 report on datamining, identified as key players in the generic datamining market. The other four are Angoss’ Knowledge Suite, IBM’s Intelligent Miner for Data, SAS’s EnterpriseMiner, and SGI’s MineSet.

In the audio mining field, speech vendors such as Dragon Systems (http://dragonsystems.com) and Virage Inc. (http://www.virage.com) are working with all the major database vendors–including IBM–to support the technique, which is scheduled to be available later this year. Audio mining might be used to monitor call center traffic, customer service calls, or company voice mail (privacy issues aside) looking for anything from profanity to recurring customer service complaints to suspected industrial espionage.

E-commerce, CRM, and data warehousing will all help propel the datamining market forward. Standards such as extensible markup language (XML), the predictive modeling markup language (PMML), the cross-industry standard process for datamining (CRISP-DM), as well as Microsoft’s OLE DB for DM, will help, too. The evolving technology combined with such success stories as Just for Feet and Fingerhut will certainly drive the market into the mainstream. //

View online

Leave a Reply Cancel reply