Wednesday 8 April 2009

Less talk, less code, more data - The Preserv2 Data Registry

Yes, less talk more code (oxfordrepo.blogspot.com) is a good saying but i'm going to argue in this post that in fact we need more data! Having a ton of available services and a load of highly complex and well considered data models is all well and good but without data all of these services are useless; A repository is not a repository until it has something in it (Harnad).

If we look outside of the repository community for a minute we find the web community we are accumulating a whole ton of data, wikipedia being the main point of reference here. Yet in the repository community we are not harnessing this open linked data model to enhance our data.

I have been working in the area of digital preservation for a while now and the PRONOM file format registry (TNA UK) has been my friend for many years now and contains some valuable data. However I am concerned with the way I see it progressing. The main thing I use the PRONOM registry for is as a complement to DROID for file format information, and the data here is not even that complete. I am concerned however at the size of the new data model and the sheer effort which is going to be required to fill it with the data which it specifies.

Why not looked to the linked data web to see how to tie a series of smaller systems together to make a much more powerful and easier to maintain one!

This is where I have started with the preserv2 registry available at http://p2-registry.ecs.soton.ac.uk/.

The preserv2 registry is a semantic knowledge base (RDF triples based) with an SPARQL endpoint, RESTful services and a basic browser. Currently the data is focussed on file formats and is basically made up of the PRONOM database ported from a complex XML schema into simple RDF triples. On top of this i'm beginning to add data from dbpedia (wikipedia RDF'd) and making links between the PRONOM data and the dbpedia data!

Already this is helping is ascertain a greater knowledge base and the cost of gathering and compiling this data is very low. Other than that the registry took me less than a week to construct!

So "Go forth and make links" (Wendy Hall) is exactly what I'm now doing. With enough data you will be able to make complex OWL-S rules that can be used to deduce accurately facts such as formats which are at risk.