Monday 22 November 2010

Hot Topics in Scholarly Systems

Since I last wrote a blog post the world has been going through some harsh times where cutbacks and simplifications have been essential. The phrase "Throw money at it" no longer applies to anything and all of a sudden organisations as well as people seem far more keen to share than before (although we are still not fully open and sharing, mostly it's organisations wanting stuff without sharing themselves, but we'll get there).

Anyway enough of that, what is actually happening?

Well I am very proud to be at the forefront of an international effort to hold a series of scholarly technology meetings focussed on solving institutional problems. These meetings, known as the Scholarly Information Technical Summit (SITS) meetings, are being held in alongside many international conferences over the next 2 years and are being backed by all the major international funding bodies. See http://bit.ly/Scholarly_Infrastructure_Technical_Summit for more info.

There have now been 2 meetings, although SITS only came about because the first one was so successful. Each meeting conforms to the Open Agenda (see wikipedia) principal and is chaired likewise. This leads to the agenda being very pertinent to the people in the room and often creates conversation critical to the forward momentum of some of the technologies discussed. In the next few paragraphs I'm going to try and summerise the hot topics from the first meeting:

SWORD - Put stuff in a repository

SWORD has undoubtedly been a huge success, it's simple and well supported by many publishers and publishing software (including most notably the Microsoft office suite via the author add-in tool http://research.microsoft.com/authoring). There are however some problems which the community wants to address without making it more complex:
  • Packaging Formats - What exactly do you submit in your SWORD bundle, how should it be formed. There was no clear consensus other than we feel endpoints should try to support a multitude of formats depending on their users.
  • Endpoints are hard to find, for both users and the software, this could do with being addressed either via negotiation or meta tags of some sort.
  • URIs in the returned package are not well specified to say what they mean or what they should mean.
  • Not a complete CRUD model
  • No levels of compliance any more
  • SWORD uses basic auth (too basic?)
The general call was that these points need addressing without making the SIMPLE (that's what the S stands for) too complex. CRUD looks interesting.

OUTCOME: A follow on SWORD project has been funded by JISC (UK) along with a number of complementary (but separate) projects including DepositMO (http://blogs.ecs.soton.ac.uk/depositmo) and SONEX (http://sonexworkgroup.blogspot.com/).

Personally i'm involved in DepositMO which intends to use SWORD (+CRUD) at it's core and extend this even further (outside of SWORD) to be fully interactive with the users. More can be found on the levels of conformance via the DepositMO blog (http://blogs.ecs.soton.ac.uk/depositmo).

Package guidelines are to be set out by the new SWORD project along with tight definitions on what URIs mean and what it means to CRUD those URIs.

Being written in to both projects I hope to bring not only technical knowledge to the table but also real world usages.

There was also a call to look into technologies like OAuth and it's usages in SWORD, however this was a minor part of what became a major conversation at the second meeting.

Inverse Sword

This conversation started on workflows and a discussion on the opportunities for common workflows and their impact. The problem is that workflows tend to be very specific and quiet heavy weight in their approach to a problem, often constrained by the domain. This is the advantage of SWORD, it doesn't specify one, just a technique for transferring stuff. So what about reverse SWORD where you request a URI and a packaging format you want.

This basically then re-inforced the conversation on what it meant to have SWORD endpoints supporting full CRUD using content negotiation to agree on packaging formats. Clearly something to take forward... as it was!

Storage for Digital Repositories

Question was (not from me): What is their beyond the Akubra (now DuraCloud) and my two projects (one of which has been finished)?

It is clear that there are now a whole range of storage options and technologies with infinite numbers of APIs, luckily many of the cloud providers use the S3 API (which is good!). So what rules languages are there for expressing where things should be stored?

I briefly explained the EPrints implementation (labelled as mine but it isn't, it's EPrints property) which uses lightweight plug-ins to communicate with each service. These plug-ins implement 4 API calls (Store, Retrieve, Delete and one other necessary I won't bother explaining here). There is then an XML/XSLT based policy file which dictates which plug-ins are used to store what. Each file is then stored and metadata adjusted to state where it is stored in case policy changes. Upon a policy change, the files can be re-arranged to their correct locations again. This can also handle changes in storage architecture and whole services being off-lined. Advantage with this approach, which the community likes, is that you can use any number of storage solutions simultaneously and store as many copies of files on different ones as you like. For more see http://eprints.ecs.soton.ac.uk/17084/.

The actions from this were that others were going to look at this implementation to see if this rule based language could apply on other repository platforms. Further it would be nice to have some good reference architectures available from vendors.

Services and Configuration Languages (was Common Platforms/Tools on the day)

This was an interesting conversation which started around the idea of being able to re-use technologies by re-using/calling code libraries directly. The problem is here (as I see it) the number of coding environments and versions of these environments available.

The solution is REST (not SOAP) APIs on the web and abstraction APIs in the code (e.g. SOLR) which enable you to call functions from (say) the command line, without having to understand the code.

David Flanders perhaps summed it up best, there are levels of interaction, some easier than others:
  • Core System (hard)
  • Exposing structured data
  • End user interfaces (including APIs)
XML for configuration is a bit of a sticking point with users, but you need a machine readable language to configure the machine. Perhaps the point is here only use XML if you need it otherwise simple config files with "=" signs in is fine.

There is no real answer to this question other than try and keep it simple... stupid.

Author IDs (URIs)

Yes it's our favourite topic raising its ugly head again!

It is clear that there are many efforts in this area, none of which have fully succeeded yet. There is still much interest in this area however and it is clear that we should be prepared to handle multiple IDs for a single author and be able to align them (if allowed) at a later stage.

Currently the project to watch is ORCID which is a continuation of a previous project by Thompson (which failed commercially in this project).

The consensus was however that we are not wrong to mint URIs for our authors in our repositories.

Conclusions

Identification/Authorisation is a problem, can technologies like OAuth not only help with authorisation but also with identification? This could be a very interesting area.

SWORD being taken forward is a very positive outcome of the first SITS meeting.

Simple services with simple APIs are so much more effective than "project centric" solutions and bloatware.

Simple services are usable by lots of people!