Musings on semantic enrichment – 1

6 11 2014

Semantic enrichment. Such a grand phrase. By the end of this post, I will hopefully of described what we mean by it in this context.  I have had several conversations with colleagues in our Social Science Research Unit (SSRU) in which we considered whether we could join forces and devise a service which would allow us to enrich our metadata with our specialist vocabulary in a way which required less than the usual amount of human intervention which manual indexing demands.

For some years, the IOE has used an in-house thesaurus called the London Education Thesaurus (LET) whose purpose and history is explained here, in order to subject index mainly printed works.  It’s available here under a Creative Commons Licence. This was a relatively expensive service which was becoming increasingly hard to justify as the move towards digital content progressed. The fact that the indexed records applied to an increasingly small subset of the content to which our library had access was also working against the case for continuing with this.

However, we were aware that commercial abstracting and indexing services do still exist and could see that there was still a potentially valuable query expansion service to provide which could not adequately be met by freestyle tagging and where controlled vocabulary was still valued, particularly by post graduate or Doctoral researchers working in a particular sector. Could we reduce the indexing effort by creating value-added tools which might help address this? If so, it was something that could potentially be an attractive proposition to a variety of knowledge organisations, particularly those who would like to retrospectively index large corpuses of digital content?

SSRU have experienced information scientists who already work in the area of creating systematic reviews and have an understanding of the computational challenges involved in collating data from myriad systems and presenting it in an ordered format for a specific purpose. Our thoughts centred around the following notion: Could we create a model in which a machine was trained using an existing vocabulary (in our case LET) which had already been applied by humans to a data set  (i.e. the IOE library catalogue)? Would it then be possible to apply this to full text documents whose metadata would benefit from such enrichment. It was envisaged that we might create a semi-automated process whereby potential terms were identified, presented in a meaningful visual format which the human brain understands more intuitively than a machine, in order to train the machine and thus to allow it to learn from its mistakes. The final iteration would perhaps be a list of terms which would be suggested for a document and an intuitive interface by which a subject indexer or specialist in the field could either accept or reject terms proposed. Ideally, the machine would continue to learn until one day it would simply accept a document, issue accurate terms and these would be used to enrich the metadata.

Now it would be inaccurate to say that we were naïve enough to think that it would be anything but challenging to actually achieve this utopian dream, but nevertheless, we did feel that it would benefit from further investigation.

In my next post, I will discuss our progress and findings…