Max Bramer

Professor Max Bramer

Emeritus Professor of Information Technology, University of Portsmouth, UK
Honorary Secretary, International Federation for Information Processing
Chair, British Computer Society Specialist Group on Artificial Intelligence

home | qualifications | research | publications | sgai | ifip
data mining software | other professional activities | ai links | other links | contact
textbooks

Max Bramer

Research Students Supervised: Past and Present

Victoria Uren

Combining Text Categorisers

In over thirty years of study a great number of text categorization algorithms have been developed using a standard paradigm of normalization, feature selection, learning and prediction but progress in improving categorizer performance by using different machine learning algorithms has been quite slow.

This investigation started with the observation that, although some categorizers perform better than others, if performance for a number of categorizers is ranked on a set of individual classes the rankings show similarities to each other, implying that class dependent factors influence performance as well as algorithmic factors. This study addresses aspects of the data which might affect all types of categorizer.

A model of noise was developed using an experimental method based on standard machine learning techniques. This makes it clear that only features with a semantic link to the subject act strongly as noise. The model was incorporated into a combined system, JANE, with some success. Combined systems have not produced substantial improvements in performance for text categorization. Majority voting systems were selected as the vehicle of study and existing models for safety critical software were put to use to predict performance. Experiments intimated that combined systems do not improve significantly on the performance of single categorizers because individual records exist which have a very high probability of producing errors.

Thinking aloud studies with experienced human indexers were employed to investigate the causes of coincident errors. The findings point to records which contain insufficient information for a quantitative method to reach the correct conclusion.

A heuristic approach to selecting categorizers for combined systems is proposed and tested. The approach combines algorithms from the standard paradigm with semantically driven methods. Statistically significant improvements to performance were achieved.

University of Portsmouth
School of Computer Science and Mathematics
PhD Awarded

Next