Musings on semantic enrichment – 1

6 11 2014

Semantic enrichment. Such a grand phrase. By the end of this post, I will hopefully of described what we mean by it in this context.  I have had several conversations with colleagues in our Social Science Research Unit (SSRU) in which we considered whether we could join forces and devise a service which would allow us to enrich our metadata with our specialist vocabulary in a way which required less than the usual amount of human intervention which manual indexing demands.

For some years, the IOE has used an in-house thesaurus called the London Education Thesaurus (LET) whose purpose and history is explained here, in order to subject index mainly printed works.  It’s available here under a Creative Commons Licence. This was a relatively expensive service which was becoming increasingly hard to justify as the move towards digital content progressed. The fact that the indexed records applied to an increasingly small subset of the content to which our library had access was also working against the case for continuing with this.

However, we were aware that commercial abstracting and indexing services do still exist and could see that there was still a potentially valuable query expansion service to provide which could not adequately be met by freestyle tagging and where controlled vocabulary was still valued, particularly by post graduate or Doctoral researchers working in a particular sector. Could we reduce the indexing effort by creating value-added tools which might help address this? If so, it was something that could potentially be an attractive proposition to a variety of knowledge organisations, particularly those who would like to retrospectively index large corpuses of digital content?

SSRU have experienced information scientists who already work in the area of creating systematic reviews and have an understanding of the computational challenges involved in collating data from myriad systems and presenting it in an ordered format for a specific purpose. Our thoughts centred around the following notion: Could we create a model in which a machine was trained using an existing vocabulary (in our case LET) which had already been applied by humans to a data set  (i.e. the IOE library catalogue)? Would it then be possible to apply this to full text documents whose metadata would benefit from such enrichment. It was envisaged that we might create a semi-automated process whereby potential terms were identified, presented in a meaningful visual format which the human brain understands more intuitively than a machine, in order to train the machine and thus to allow it to learn from its mistakes. The final iteration would perhaps be a list of terms which would be suggested for a document and an intuitive interface by which a subject indexer or specialist in the field could either accept or reject terms proposed. Ideally, the machine would continue to learn until one day it would simply accept a document, issue accurate terms and these would be used to enrich the metadata.

Now it would be inaccurate to say that we were naïve enough to think that it would be anything but challenging to actually achieve this utopian dream, but nevertheless, we did feel that it would benefit from further investigation.

In my next post, I will discuss our progress and findings…

That iffy EPrints full text search – getting help from Primo

17 07 2014

We run three EPrints instances here and one thing it is not strong on is phrase searching. That sounds a bit like a minor issue, but when you have a corpus of born digital material which has been ingested but only has limited metadata, it becomes rather important. The default EPrints simply treats a phrase search as a simple boolean AND search. We decided, we needed something better – more like what a Google searcher might expect. At the same time we were working on our Discovery project and EPrints metadata was being ingested into our chosen discovery system Primo, from Ex Libris, branded here as IOE Librarysearch. This is a relatively trivial operation involving use of OAI-PMH. We had always intended to ingest our full text documents into Primo as a discovery
layer without some of your core content is something of a misnoma. We understood that this was achievable as long as we could expose the urls of the full text documents in the OAI metadata. That itself was no problem. However, it turned out that getting this into Primo and indexed was rather more tricky. The initial approach was to grant EPrints admin access to Primo and allow it to use the EPrints API to populate a special Oracle table set up for the purpose. Unfortunately the indexing software which was being used to extract the content from the pdfs was found to be unreliable. We were unable to ascertain why certain files indexed fine whilst others produced unintelligble errors and were skipped. We finally decided that this approach was going to be neither sustainable nor complete and therefore had to think of something different.

My next thought was what about simply including the full text in the metadata by extracting it ourselves using open source tools which we knew to work reliably. I started by creating a new eprints field called fulltext as follows and linking it to our single in-use content type. Here are the field settings

Multiple values Yes
Type: longtext
Required: No
Volatile field: Yes
Include in XML export: No
Index: No

Next, we needed a special perl script (to be run as a one-off) which would extract text from all relevant existing documents in the repository. I considered creating an EPrints plugin but that didn’t seem to fit the requirement being more suited to actions being performed on a single record than a batch. The perl script was called A prerequisite was the existence of the open source pdf toolkit (pdftk). On the test EPrints server I was using, this was already installed.

The basics of the script were to loop through the relevant records, call pdftk to extract the text and then to chunk it into portions for adding to the metadata records. The reason for needing to chunk it up is that the maximum length of the longtext field is 65000 bytes and the length of many of our pdf documents exceeds that. I successfully ran this on my small test set of records and the result is a very horrible looking long record. That itself is easily resolvable by configuring eprints not to display the field. We wouldn’t want to do so anyway as it would show all the nasty OCR errors that are often present. There is presumably a loading implication if a record is larger in size which may affect performance but I have not had time to test that yet.

The next thing is that we need to ensure that newly added pdfs similarly get text extracted and added to the metadata record. In this case, an EPrints plugin is going to be the answer. You could either have a button near the file upload screen in the document editing screen or possibly automate the running of this following a save action. I have not had time to develop this.

Next, we need to make an adjustment to the Primo Pipe for this resource in order to ingest the fulltext field and ensure it is searchable by Primo. At the time of writing, this is not resolved. It looks like an enrichment plugin needs to be set up such as described at on this Ex Libris Developers page but we are awaiting advice from Ex Libris as to how to achieve this.

Finally, the piece-de-resistance. We want to improve searchability. We already run Primo which has the sort of retrieval we are looking for (phrase searching supported for example), so the idea was to use the Primo API to replace the search in EPrints. Could this be done?

I started off by using the EX Libris Developer network to study the workings of the Primo API. Helpfully they had an example application which I was able to download and play with in order to begin to understand it. I decided to try this on our test DERA repository. The mechanics were reasonably straightforward. I cloned the index.html file from /usr/share/eprints3/archives/<repoid>/html/en and called it primoindex.html, placing it in the same folder. I placed the search part of the demo Primo API app in this page and limited it to search on Primo scope “DERA”. I also placed the custom css and js files from the API there.

I was able to get a working prototype running fairly quickly. It was trivial to change the links to results to those within the native EPrints repository as the eprints id was also in the metadata. This still has some way to go before it can be said to be a fully viable solution but does show what can be achieved by using the best parts of different systems in a suitable collaboration.

A librarian’s view of an archive conundrum – GTCE

8 04 2014

In 2012, the General Teaching Council For England (GTCE) was disbanded and we were given a disk drive full of documents together with a spreadsheet of associated metadata. Archivists are (for not very surprising reasons) being deluged with born digital material, much of it inaccessible because it simply can’t be appraised at such a volume by a limited number of staff. For this reason, much of it simply has to be made closed access. We  wondered whether there was anyway in which EPrints could help. The archive system in use (CALM) is not itself a digital repository but more suited for describing print collections in the same way the many legacy library management systems are. Was it possible to use EPrints to allow for some level of discovery to take place in CALM and then (in cases where it was permitted), to allow for appraisal and ultimately to direct the end user to the full text document, possibly via the request functionality to allow for copyright consent to be obtained and the document released?

One of the assumptions we made at the start was that both we and the donor organisation had a shared understanding of the terminology in use here. For example, the metadata supplied by GTCE assigned each of the documents a category of “open” or “closed” access. It became apparent that the traditional archival definition of closed access (not available for access) was not what had been meant by GTCE. For example, it sometimes referred to a password protected pdf for which no password had been supplied and it was therefore inaccessible in that form, even though not necessarily unreleasable. As an aside, this refers to a digital preservation issue which was outside the scope of our project.

From the technical side, it wasn’t too hard to agree on a field mapping and ingest the metadata and documents. We used a basic Dublin Core scheme largely because there was not time to devise anything more complex. There was an issue over reconciliation of filenames and paths which caused some problems, but once resolved it was all emminently doable.

The trouble started when we started to review the documents uploaded prior to release. It was then that the access issues described above began to become clear. In one case, it became apparent that EPrint’s full text search indexing was designed in a way which meant that a document would (even if not released) have been open to searching for personal names, potentially picking up those documents in which data protection issues existed. Therefore making the document itself unavailable was not sufficient protection. It highlights a limitation when using EPrints for digital archives, though to be fair that was not what EPrints was originally designed for. In the event, the project taught us some valuable lessons and was a useful and practical way of introducing ourselves to digital archives. It is a field which is still in its infancy, but there are some direct correlations with some of the work that we have been doing in libraries. For example, the creation of DERA was designed to preserve an at-risk digital collection of published official documents. The areas of overlap are ingestion – we concluded that a more controlled form of this would have made things more straightfoward, perhaps some filename verification; and a requirement to use more digital preservation techniques. Watch this space!

New POPE has been appointed

6 03 2014

Preservation of Official Publications in Education (POPE) is our JISC-funded project which started 1st November. This four month project will visit all those nasty dead links in MARC21 856 fields which relate to Official Publications, and attempt to trace and ingest the digital copies into our DERA eprints system. We are using the Open Government Licence to do so. Now this is an example of common sense in IPR which leads to real benefit to the authors (those for whom copyright is said to exist in the first place). Whoever the authors are that contributed to educational policy development under the Labour Administration, they can rest assured that their work will be saved for posterity and allow for historical research to be conducted in relation to that Government’s views on education in the UK.

Contrast this with the general position on IPR that UK (and sometimes EU) legislation imposes upon us. I went to a fascinating meeting at JISC the other week in which it was explained that text mining in order to extract metadata which gives alternate access points to content is probably illegal unless explicit permission is granted from the rightsholder. This sits in direct opposition to the great concepts being developed by organisations such as the Resource Discovery Taskforce which encourage us to repurpose metadata as linked data and allow new access applications to be built by other communities. More critically for us at the moment, this stifles innovation and may harm our digital economy when other jurisdictions have a more measured view of the purpose of this.

My vote would be that we librarians should support the British Library’s attempts to get text mining added to a list of exceptions in copyright law, which would allow text mining to take place. Otherwise, we will remain chained to our one point-of-view silo-based fragmented search systems with islands of open data here and there.

location location location?

27 09 2013

As a library which boasts (?) an in-house classification schema, it has a unique problem when it comes to users finding books. The users have little if any reference to help them understand it’s workings. By contrast, if you’ve come across Dewey in your public library, you have a head-start. Having said that, any schema takes some getting used to, particularly if you’ve never come across one before. If your schema has its quirks which are difficult to explain, the problem is compounded.

Todays users are largely digital natives. They consume visual representations as comfortably as my generation (of librarians at any rate) consume text. And so was born the idea of creating a floorplan system which would visually represent whereabouts in the library a book was located in order to help the user to find it. The end result looks like this:



I pursued this at home in my spare time and came across a rather fabulous piece of software called SweetHome 3D. It is really designed for helping DIY users design a new kitchen for example, but turned out to be very suitable for devising library floorplans. We were lucky enough to have some to-scale plans of a former incarnation of the library and importing this as a graphic allowed building features to be drawn without recourse to measuring everything. The interface is essentially drag and drop using objects (eg, walls, windows, doors, shelves). When complete, you delete the imported plan image and are able to manipulate the 2D image to provide a 3D perspective in any orientation you wish.

The next stage was how to link these to things like (multiple) classification schemas, call number ranges and collections. In order to do this, I wrote some php code. A MySQL database sits behind the system which has a table which has shelving rows linked to co-ordinates (of which more later).

The system depends upon an LMS which has a web service API which can be called. It needs to return the call number and the location. Our Sirsidynix Symphony LMS has this facility and by feeding a title id to it, we can retrieve the information we need.

We now return to SweetHome 3D. When you have drawn your plans and settled upon the 3D perspective you like, you generate a png file. This sits on the floorplan server and is retrieved by the php code and sits in a Skeleton css template, a responsive low-maintenance framework which works on a wide range of devices. Remember those co-ordinates? They are important because they tell the system where to place the arrow. In order to set these, you use a graphics package to find the co-ordinate on the image file where you wish the “arrow” to sit. The x and y co-ordinate need to be added to the record for that shelf stack in the database. When the user views the map, the arrow is superimposed over it:


Finally, it’s not going to be much use if you can’t display the link in the catalogue. Depending what system you use for your front end and how your web service works, the procedure to achieve this will differ. We did a very small customisation in the item details level page in our SirsiDynix system in order to achieve this.

So how has it been received? As a frequent member of staff on the Enquiry Desk we often receive favourable comments about how much easier this facility makes to find books and so anecdotally, it has been very successful. We shall have to see whether the statistical evidence bears this out at some point in the future. I suspect it will require a little judicious marketing to help us along.

Store forms

6 06 2013

We’ve been using bits of paper to record requests for material in the stores for years. Each one has to be checked by someone (who has to look up the item in the catalogue again) to ensure the user has completed it correctly. So I have done some work on our Sirsidynix Symphony system to automate this process and allow staff to do more interesting things.

The basic idea was to expose a link on categories of material which are located in store and provide a request button at item level. We also wanted it to be as user-friendly as possible. So if you click the link when in public mode, it will prompt for login and then take you back to the request form.

Amongst other things, it means requests will be creatable from outside (via the web) and we can start to offer a pre-ordering service.

When the request is created, it appears in the user account area so they can monitor progress. The requests are printed in batch (one page per request) each day and the “fetch” can happen as normal with the slips being inserted to the item.

A fair amount of customisation was needed both at the public interface and within the reports. Some of it ain’t pretty but it seems to work.

The request form is rendered entirely from javascript as it was the only way to pass the parameters from page to page. Data codes on one page are not necessarily available on another and finding which apply where is often a case of trial and error.

For the reports, we wanted the user’s name to appear at the top of the slip (in bold was not possible so we resorted to lots of stars). To do this, we actually modify the default finished output file with a custome “reformatting” report. Again, crude but seems to do the job.

Next step is to load this all onto the live server!