Data Mining is a term that has become a buzzword in the last few years. However it’s only part of a much more interesting thing: Knowledge Discovery.
The need to extract knowledge automatically out of large databases is turning out to be more and more pressing, given the volume of data that accumulates continuously, whose treatment consumes an increasing amount of resources.
Data mining is one of the answers to this problem. Usama Fayyad, in the article (available in PDF format) “From Data Mining to Knowledge Discovery in Databases” defines Knowledge Discovery [in Databases or KDD] and Data Mining as:
- KDD: “ The process of discovering useful knowledge from data”
- Data Mining: “ The application of specific algorithms for extracting patterns from data”
Fayyad and his colleagues enunciate a series of important concepts that lead to an operative definition of Knowledge. This is done in a way that can be formalised mathematically. It’s worth reviewing them (in abridged form, see the book “Information Visualisation in Data Mining and Knowledge Discovery”, chap. 21. ).
- Data: A set of facts F.
- Pattern: An expression E in any language L that describes a subset of the data d, whenever it is simpler that just enumerating all facts in d
- Validity: The certainty that the pattern is still valid when applied to new data. It’s defined as a function C(E, F) that assigns a qualification (a numerical value) to the pattern.
- Novelty. A function N(E, F) that returns true if the pattern is not just a recombination of already detected patterns and false otherwise.
- Utility : This definition is more slippery and subjective. A pattern is useful if it allows us to act or decide upon it. Again it can be represented by a function U(E, F) that qualifies the utility. For example the money saved or won when discovering a purchase pattern in a supermarket.
- Understandability: Patterns have to be easily understood by human beings. Again a subjective concept and difficult to evaluate. Fayyad suggests as quantitative measure the simplicity of the pattern, again a function S(E, F) that returns a value.
All these concepts lead finally to the important concept of “interestingness” of a pattern. It is defined as a combination of Validity, Novelty, Utility and Understandability that allow us to assess and classify patterns.
i = I(E, F, N, U, S)
Needless to say that some aspects of this concept need human intervention, since they admit no objective quantification. Interestingness is fundamental for the definition of Knowledge.
- Knowledge: A pattern E is called knowledge if its interestingness i is above a certain threshold “t” defined by the user.
Although it could appear as a definition very far from our experience of what knowledge is, in reality it isn’t so much. Knowledge is made out of those patterns that we have learnt to detect and we have stored since they allow us to apply them to new data and, hence, to predict the behaviour of phenomena or the people around us.
From this comes the utility of knowledge. A clear example is medical diagnosis. Every illness has a set of symptoms, a pattern, that differentiates it from other illnesses allowing the physician to diagnose and prescribe the appropriate treatment. It takes years to build up the baggage of clinical patterns that allow him or her to become a good diagnostic physician.
Fraud follows patterns that deviate from the common behaviour of legal transactions in financial databases. In marketing, it is important to discover the groupings of users and their behaviour in order to define specific products and/or services with predictable results. For example, the users that buy item A and also item B probably will also buy item C.
At the end, Knowledge is not as magical as it sometimes appears. So we have means to approach it and find interesting patterns for many fields.
Author: Juan C. Dürsteler