Daves thoughts on stuff: 2009

Monday 5 October 2009

#ipres09: Are You Ready? Assessing Whether Organizations are Prepared for Digital Preservation.

Planets (mainly Tesella) did an online survey targetd at libraries adn archives in europe and then opened it up to 2000 others. Got 206 responses (70% european), 1/3 libraries, 1/4 archvies. Roles - 15% DP, 16% preservation in general, 22% in curation, 16% in IT, rest other.

Being a DP survey the majority said they were aware of DP issues. 1/6 haven't really thought that much about it and 1/2 don't have a DP policy. Interesting stat, if you have a DP policy in place your are 3 times more likely to have a budget in place. However the majority of those budgets are project budgets and not institutional budgets to actively perform DP in the organisation.

Also in the survey people are having to preserve all sorts of resources including databases.

Do people have control over the formats they can accept. National archives - Yes, National Archives (No). Others are pretty balanced between yes and no.

Amount of content: most say that they have less than 100Tb of content to preserve but people see it expanding to much larger quantities [when it won't be any more scary to preserve than it is now, we'll soon have 1Tb memory keys - heard it here first].

85% of people are trying to solve the problem and are looking to use plug-and-play components as people don't want to replace their current systems.

Most important thing in preservation is being able to maintain authenticity [possible european stance]. Emulation is less important. In the middle of the importance graph was the the importance of using metadata standards, but no one can decide what standards to use.

[Good survey, probably worth looking up and answering it yourself, especially if you are involved in a DP project! How do the aims of your project align with the view of the wider community?]

#iPres09: An Emergent Micro-Services Approach to Digital Curation Infrastructure.

More stuff, smaller budget, more technology [good start]. We can gain though with digital tech through redundancy, meaning (through context), utility (through service) and value through use [totally agree with the last point esp. wrt OA publishing].

Projects imperatives: Do more with less, focus on the content, not the systems in which the content is managed [agree, focus on the data, not the software which people care less about]. This leads to the goal of the presentation which is micro-services, small services which provide maximum gain when they can be looped together, replaced, re-written etc.

There are 12 currently in 4 layers - Storage, Charecterisation/Inventory, Service (index, search, migration), Value (Usage) - [Good slide this] "LOCKSS, Lots of description keeps stuff meaningful, Lots of services keeps stuff useful, Lots of uses keeps stuff valuable".

Now describing some of the technologies which are being used. Simple restful api's to storage [however he hasn't mentioned the high level software they are using, based upon the first slide in the presentation said to use proven technology]. Emphasizing that storage has moved on (as it always will) and you can gain a lot of benefit though specialist products.

Storage is the main phase they have done and they have a few other mirco-services done. [Like the aim of it all however there are already lots of tools to provide each it is just the keeping them apart which is hard]

#iPres09: e-Infrastructure and digital preservation: challenges and outlook

e-infrastructure: Starts by defining infrastructure (see wikipedia) and e-infrastructure specific to a collection of European digital repositories. So basically we are looking at opportunities to build and supply services which are applicable to these repositories.

Background: EU is supplying lots of support for this and in germany they are researching national approaches, identifying activities and assign tasks to "expert" institutions. By introducing the current fields of project he is outlining that there is still a significant mismatch between the scale of the problem and the amount of effort being expended. From this he outlines that there is a significant lack of common approaches to solving problems. [I don't think this will ever go away, unless there is a mandate, and even then not everyone will want to sign up].

[Lots of argument] Funding is focused on many individual projects and thus doubles up the the argument that there are no commons. This led leads to a slide about interoperability and standards and the lack of them. [Which again, i don't think will ever go away and I think that we should be appreciative that people tend to pick XML to encode their data in, this makes it interoperable right].

[This is a start of project presentation, I don't seem to see that much output. They have some simple models as diagrams, again though at this stage it is hard to see how they are not just another project which will come up with (another) set of standards which no one will then want to adopt.]

Giving a set of examples now where they are going to re-use and extend existing software/projects. The goals are good, in terms of concrete steps for global infrastructure for registries, data formats, software deposits and risk management. [Just not sure how achievable all this is based upon the fact it has been the aim of many projects already]

Tuesday 8 September 2009

Thoughts on digitization, data deluge and linking

It's been a while since I've put a post up and this is probably due to being busy and also trying to tidy up a lot of stuff before starting on new projects.

In this post then: Digitisation

I never really gathered how big the area of digitisation is and how many non repository people are actively involved in digitisation. There are a great many projects >50 who are digitising resources and these include national libraries. Items being digitised include everything from postcards and newspapers to full books and old journals.

So what's the problem here ... simple ... how many people are digitising the same things? Yes I know that there is so much out there that this is unlikely to be the case however it brings me nicely to the problem of information overload. There is already more valuable information on the internet than we can possibly handle effectively, so how do you ensure that any resources you digitize for open access usage on the web can be found and used?

I don't normally say this but perhaps we should look at physical libraries for the answer. Libraries are a very good central point where you can find publications related to all subject areas, and if your local library does not have a copy then it will try and find a copy somewhere else.

How then does this map onto the web? Web sites become the library and links become the references to additional items or items this site does not contain, simple right? Unfortunately with 50+ projects I can count already, this leads to 50+ different web sites all with differing information presented in different ways. Due to the presentation of each web site being totally different this means that in fact they are not a library - that pride themselves on the standard way to organise resources -
thus web sites become books. Thus to find resources we have to rely on search engines and federation. Thus we are back to where we started and we have a problem with information overload.

Unfotunately I don't have an answer to this problem, however I do know that links hold the key to the solution. Each website at the moment is simply an island of infromation, what is desperately required is the authors and community to establish links to these resources. If digitisation houses are curating refereed resources then the simplist way to link to these would be to put information about them on wikipedia.

This would be my final point then, wikipedia is actually a good thing, simply because of the the community aspect. However it also provides many other huge benefits:

External resources such as photoes have to have a licience

In annotating a page/item you create links and establish facts which are available by semantic wikipedia (dbpedia)

Wikipedia is an easy way to establish your presence on the link data web (linkeddata.org)

So if you are digitising books by an author, add this link to their wikipedia page. If you are digitising a collection of World War images, add links to some of these to wikipedia and flikr.

Establish links and help yourself to help everyone else.

Thursday 7 May 2009

File Format Risk Analysis! How hard can it be?

The trouble with this is the sheer number of different formats out there (PDF,DOC,DOCX....) and the number of versions each of these has, each of which has it's own problems.

Enter the P2 data registry which now contains over 40,000 facts relating solely to format types. I have been working on this for a while now and a full list of services can be found on the registry homepage. Although the SPARQL Endpoint is a great place to start it doesn't help you much with risk analysis.

First thing is to determine which properties of a format or format super-type (PDF 1.6 has a super type PDF for example) are important when analyzing risk. Do you consider age, number of software tools, quality of documentation etc? And how do you add weightings to all of these properties as a factor of overall risk?

Using the registry as a starting point I have drawn up a profile involving these properties as well as many others which each return a high, medium or low risk level. These are then sum-mated and normalised to give a single risk score.

Enough of me talking: The live service (PDF 1.3) can be found here and here for PDF 1.4.

Wednesday 8 April 2009

Less talk, less code, more data - The Preserv2 Data Registry

Yes, less talk more code (oxfordrepo.blogspot.com) is a good saying but i'm going to argue in this post that in fact we need more data! Having a ton of available services and a load of highly complex and well considered data models is all well and good but without data all of these services are useless; A repository is not a repository until it has something in it (Harnad).

If we look outside of the repository community for a minute we find the web community we are accumulating a whole ton of data, wikipedia being the main point of reference here. Yet in the repository community we are not harnessing this open linked data model to enhance our data.

I have been working in the area of digital preservation for a while now and the PRONOM file format registry (TNA UK) has been my friend for many years now and contains some valuable data. However I am concerned with the way I see it progressing. The main thing I use the PRONOM registry for is as a complement to DROID for file format information, and the data here is not even that complete. I am concerned however at the size of the new data model and the sheer effort which is going to be required to fill it with the data which it specifies.

Why not looked to the linked data web to see how to tie a series of smaller systems together to make a much more powerful and easier to maintain one!

This is where I have started with the preserv2 registry available at http://p2-registry.ecs.soton.ac.uk/.

The preserv2 registry is a semantic knowledge base (RDF triples based) with an SPARQL endpoint, RESTful services and a basic browser. Currently the data is focussed on file formats and is basically made up of the PRONOM database ported from a complex XML schema into simple RDF triples. On top of this i'm beginning to add data from dbpedia (wikipedia RDF'd) and making links between the PRONOM data and the dbpedia data!

Already this is helping is ascertain a greater knowledge base and the cost of gathering and compiling this data is very low. Other than that the registry took me less than a week to construct!

So "Go forth and make links" (Wendy Hall) is exactly what I'm now doing. With enough data you will be able to make complex OWL-S rules that can be used to deduce accurately facts such as formats which are at risk.

Wednesday 21 January 2009

EPrints 3.2 - Amazon S3/Cloudfront Plug-in

A quick post to say that we have just successfully tested an EPrints 3.2 (svn) install with the new Storage Controller plugged into Amazon S3!

This has quiet a lot of implications for both EPrints and other projects wanting to provide external services which operate on objects in a repository. We hope to bring people more news on this at the upcoming Open Repositories 2009 conference in Atlanta.

For more information on this all check out storage section on the Preserv2 website @ www.preserv.org.uk.

Daves thoughts on stuff