
A variety of supervised classification algorithms have been developed and applied successfully to problems in machine learning and data mining. In general, these methods require a large amount of labelled data to achieve high performance. That is, supervised algorithms need to be supplied with a large number of training examples, each pre-labelled with the correct class, in order to build a classifier able to accurately predict the class of future unlabelled examples. This labelling is performed manually by a human annotator, and is thus slow, error-prone and expensive as it requires an expert knowledgeable in the application domain. For example, in the task of spam filtering, a large number of emails are needed, each email must be manually classed as either a spam or useful email.
Semi-supervised algorithms have tried to get around this limitation by requiring only a small amount of labelled data and using unlabelled data that is cheaply available. However, this approach has focused on using unlabelled data in the same format as the labelled data.
There usually exists a great deal of valuable information in other forms that is not necessarily in the same format as the instances in the original problem space. For example, web pages of companies and those that interest a particular user may help in spam filtering as words typical in spam emails may be found in the sales pitches of commercial sites, while information found in useful emails is likely to occur in what the user considers interesting websites. This alternative data is called 'background knowledge', and it is possible to expand semi-supervised learning with this rich data.
In this thesis, we show that existing machine learning methods can be improved through the incorporation of background knowledge. The value of background knowledge that has already been categorized by the data source is also demonstrated, even if the background knowledge is not necessarily grouped into classes that fit in exactly with those of the given problem task. We also introduce a new data quality measure, ‘ranked attribute information gain’, which can be used to reliably predict the relative usefulness of different types of background knowledge, thereby helping to select the most appropriate data set for a given classification task. Finally, we introduce a new approach that uses the unlabelled data to assist in the labelling process which can be used in most semi-supervised learning algorithms, and we empirically show that it can improve the classification of short text strings.