Tech Reports
Tech Report kmi-11-02 Abstract
Unsupervised data linking using a genetic algorithm
Techreport ID: kmi-11-02
Date: 2011
Author(s): Andriy Nikolov,Mathieu d'Aquin,Enrico Motta
As commonly accepted identifiers for data instances in semantic datasets (such as ISBN codes or DOI identifiers) are often not available, discovering links between overlapping datasets on the Web is generally realised through the use of fuzzy similarity measures. Configuring such measures, i.e. deciding which similarity function to apply to which data properties with which parameters, is often a non-trivial task that depends on the domain, ontological schemas, and formatting conventions in data. Existing solutions either rely on the user's knowledge of the data and the domain or on the use of machine learning to discover these parameters based on training data. In this report, we present a novel approach to tackle the issue of data linking which relies on the unsupervised discovery of the required similarity parameters. Instead of using labeled training data, the method takes into account several desired properties which the distribution of output similarity values should satisfy. The method includes these features into a fitness criterion used in a genetic algorithm to establish similarity parameters that maximise the quality of the resulting linkset according to the considered properties. We show in experiments using benchmarks as well as real-world datasets that such an unsupervised method can reach the same levels of performance as manually engineered methods, and how the different parameters of the genetic algorithm and the fitness criterion affect the results for different datasets.
Future Internet
KnowledgeManagementMultimedia &
Information SystemsNarrative
HypermediaNew Media SystemsSemantic Web &
Knowledge ServicesSocial Software
New Media Systems is...
Our New Media Systems research theme aims to show how new media devices, standards, architectures and concepts can change the nature of learning.
Our work involves the development of short life-cycle working prototypes of innovative technologies or concepts that we believe will influence the future of open learning within a 3-5 year timescale. Each new media concept is built into a working prototype of how the innovation may change a target community. The working prototypes are all available (in some form) from this website.
Our prototypes themselves are not designed solely for traditional Open Learning, but include a remit to show how that innovation can and will change learning at all levels and in all forms; in education, at work and play.
Check out these Hot New Media Systems Projects:
List all New Media Systems Projects
Check out these Hot New Media Systems Technologies:
List all New Media Systems Technologies
List all New Media Systems Projects
Check out these Hot New Media Systems Technologies:
List all New Media Systems Technologies



