In conjunction with IIiX 2008
Date: 18th October 2008
Venue: BCS London Office (directions guide)


We aim to bring together people from different research communities interested in exploring how corpus characteristics affect the behaviour of techniques in information retrieval and natural language processing, and to set out a roadmap for a shared research agenda.

It is well known in NLP and IR that the effectiveness of a technique depends on both the data on which it is deployed and its match with the task at hand. In 1973, Spärck-Jones attributed differing degrees of success at automatic classification to differences in dataset characteristics. Since Croft and Harper (1979), IR performance has repeatedly been related to collection size and other features, though no upper bound has been found.

The importance of data and task dependencies has been highlighted in IR, anaphora resolution, automatic summarization and recently, in word sense disambiguation. Many web/enterprise web retrieval systems rely on URL properties, link graph properties, click streams, and so on, with performance dependent on the degree to which this evidence is present and meaningful in a particular corpus.

Systematically exploring features that can be used effectively to characterise corpora, has been missing from IR/NLP research. This creates problems with replicability of experimental results and the development of applications.

The time is right to pursue this dependence systematically to address topics in tracking the effect of dataset profile on technique performance. Over the past 15 years, the approaches of several subject areas have converged with IR, as large corpora and test collections assume central importance in research methodologies. These areas have highlighted issues surrounding the role of data. A BCS-IRSG workshop in London is an ideal opportunity to start with articulating a cross-disciplinary research agenda, building on the high concentration in Western Europe of IR, NLP and also potentially semantic web groups.