Daves thoughts on stuff

Monday, 14 March 2011

Preservation Tools - Moving Forward

Over the last number of years, JISC and other bodies have funded a number of digital preservation projects which have resulted in some really valuable contributions to the area... now is the time to realise the benefits of this work and provide a digital preservation experience to everyday users.

To achieve this a not insignificant amount of work needs to be undertaken, namely to identify key applications and separate these from the complex systems into which they have been built. Alternatively many applications now need re-thinking and the best bits built into system which have super-ceded these applications.

File Format Identification Tools

File format identification now has a number of tools available, each with their own advantages and disadvantages, in no particular order they are:

DROID:

Started out as a tool to identify file types and versions of those types. :)
Each file version was assigned an identifier which could be referenced and re-used. :)
Identification of file was done via "signature", not extension matching. :)
Became complex as it was adjusted to suit workflows and provide much more complex information which few people understand or want :(
Added complexity increased the time required for each file classification, no longer a simple tool :(

FIDO:

A new cut down client which takes the DROID signature files and does the simple stuff again :)

FILE:

A built in Unix tool installed on every Unix based system in the world already! :)
Does not do version type identification :(
Does not provide a mime-type URI :(
Very quick to run :)
Has the capacity to add version type identification and there is a TODO in the code for it! :)

With the PRONOM registry now looking at providing URIs for file versions, why can't we stop coding new tools and change the FILE library. This way it could handle the version information and feed back the URIs if people want them. I've looked briefly into this and the PRONOM signatures should be easy to transport and use with the file tool.

If I get time I might well have a go at this and feed it back to the community.

Friday, 4 March 2011

Installing Kinect on Ubuntu (A full guide)

1) sudo apt-get install libglut3-dev build-essential libusb-1.0-0-dev git-core

2) mkdir ~/kinect && cd ~/kinect

3) git clone https://github.com/OpenNI/OpenNI.git

4) cd OpenNI/Platform/Linux-x86/Build

5) make && sudo make install

6) cd ~/kinect/

7) git clone https://github.com/boilerbots/Sensor.git

8) cd Sensor

9) git checkout kinect

10) cd Platform/Linux-x86/Build

11) make && sudo make install

12) go to this page at openNI to download the latest NITE release for your platform:NITE download page or for the impatient:
32-bit
64-bit

13) Save the NITE tarball to ~/kinect and untar it

14) cd ~/kinect/NITE/Nite-1.3.0.17/Data

15) Open Sample-User.xml,Sample-Scene.xml and Sample-Tracking.xml and replace the existing License line with the line below: NOTE: this is case sensitive!

< vendor="PrimeSense" key="0KOIk2JeIBYClPWVnMoRKn5cdY4=">

16) Repear step 15 and replace the existing MapOutputMode line with the line below in all 3 files.

< xres="640" yres="480" fps="30">

19) sudo niLicense PrimeSense 0KOIk2JeIBYClPWVnMoRKn5cdY4=

20) cd ~/kinect/NITE/Nite-1.3.0.17/

21) sudo ./install.bash

22) make && sudo make install

23) cd ~/kinect/NITE/Nite-1.3.0.17/Samples/Bin

24) sudo adduser YOURNAME video

25) nano /usr/etc/primesense/XnVHandGenerator/Nite.ini by uncommenting the two config parameters it contains

26) sudo nano /etc/udev/rules.d/51-kinect.rules

# ATTR{product}=="Xbox NUI Motor"
SUBSYSTEM=="usb", ATTR{idVendor}=="045e", ATTR{idProduct}=="02b0", MODE="0666"
# ATTR{product}=="Xbox NUI Audio"
SUBSYSTEM=="usb", ATTR{idVendor}=="045e", ATTR{idProduct}=="02ad", MODE="0666"
# ATTR{product}=="Xbox NUI Camera"
SUBSYSTEM=="usb", ATTR{idVendor}=="045e", ATTR{idProduct}=="02ae", MODE="0666"

27) sudo /etc/init.d/udev restart

28) cd ~kinect/Nite-1.3.0.17/Samples/Bin/

29) ./Sample-PointViewer and PLAY

Monday, 22 November 2010

Hot Topics in Scholarly Systems

Since I last wrote a blog post the world has been going through some harsh times where cutbacks and simplifications have been essential. The phrase "Throw money at it" no longer applies to anything and all of a sudden organisations as well as people seem far more keen to share than before (although we are still not fully open and sharing, mostly it's organisations wanting stuff without sharing themselves, but we'll get there).

Anyway enough of that, what is actually happening?

Well I am very proud to be at the forefront of an international effort to hold a series of scholarly technology meetings focussed on solving institutional problems. These meetings, known as the Scholarly Information Technical Summit (SITS) meetings, are being held in alongside many international conferences over the next 2 years and are being backed by all the major international funding bodies. See http://bit.ly/Scholarly_Infrastructure_Technical_Summit for more info.

There have now been 2 meetings, although SITS only came about because the first one was so successful. Each meeting conforms to the Open Agenda (see wikipedia) principal and is chaired likewise. This leads to the agenda being very pertinent to the people in the room and often creates conversation critical to the forward momentum of some of the technologies discussed. In the next few paragraphs I'm going to try and summerise the hot topics from the first meeting:

SWORD - Put stuff in a repository

SWORD has undoubtedly been a huge success, it's simple and well supported by many publishers and publishing software (including most notably the Microsoft office suite via the author add-in tool http://research.microsoft.com/authoring). There are however some problems which the community wants to address without making it more complex:

Packaging Formats - What exactly do you submit in your SWORD bundle, how should it be formed. There was no clear consensus other than we feel endpoints should try to support a multitude of formats depending on their users.
Endpoints are hard to find, for both users and the software, this could do with being addressed either via negotiation or meta tags of some sort.
URIs in the returned package are not well specified to say what they mean or what they should mean.
Not a complete CRUD model
No levels of compliance any more
SWORD uses basic auth (too basic?)

The general call was that these points need addressing without making the SIMPLE (that's what the S stands for) too complex. CRUD looks interesting.

OUTCOME: A follow on SWORD project has been funded by JISC (UK) along with a number of complementary (but separate) projects including DepositMO (http://blogs.ecs.soton.ac.uk/depositmo) and SONEX (http://sonexworkgroup.blogspot.com/).

Personally i'm involved in DepositMO which intends to use SWORD (+CRUD) at it's core and extend this even further (outside of SWORD) to be fully interactive with the users. More can be found on the levels of conformance via the DepositMO blog (http://blogs.ecs.soton.ac.uk/depositmo).

Package guidelines are to be set out by the new SWORD project along with tight definitions on what URIs mean and what it means to CRUD those URIs.

Being written in to both projects I hope to bring not only technical knowledge to the table but also real world usages.

There was also a call to look into technologies like OAuth and it's usages in SWORD, however this was a minor part of what became a major conversation at the second meeting.

Inverse Sword

This conversation started on workflows and a discussion on the opportunities for common workflows and their impact. The problem is that workflows tend to be very specific and quiet heavy weight in their approach to a problem, often constrained by the domain. This is the advantage of SWORD, it doesn't specify one, just a technique for transferring stuff. So what about reverse SWORD where you request a URI and a packaging format you want.

This basically then re-inforced the conversation on what it meant to have SWORD endpoints supporting full CRUD using content negotiation to agree on packaging formats. Clearly something to take forward... as it was!

Storage for Digital Repositories

Question was (not from me): What is their beyond the Akubra (now DuraCloud) and my two projects (one of which has been finished)?

It is clear that there are now a whole range of storage options and technologies with infinite numbers of APIs, luckily many of the cloud providers use the S3 API (which is good!). So what rules languages are there for expressing where things should be stored?

I briefly explained the EPrints implementation (labelled as mine but it isn't, it's EPrints property) which uses lightweight plug-ins to communicate with each service. These plug-ins implement 4 API calls (Store, Retrieve, Delete and one other necessary I won't bother explaining here). There is then an XML/XSLT based policy file which dictates which plug-ins are used to store what. Each file is then stored and metadata adjusted to state where it is stored in case policy changes. Upon a policy change, the files can be re-arranged to their correct locations again. This can also handle changes in storage architecture and whole services being off-lined. Advantage with this approach, which the community likes, is that you can use any number of storage solutions simultaneously and store as many copies of files on different ones as you like. For more see http://eprints.ecs.soton.ac.uk/17084/.

The actions from this were that others were going to look at this implementation to see if this rule based language could apply on other repository platforms. Further it would be nice to have some good reference architectures available from vendors.

Services and Configuration Languages (was Common Platforms/Tools on the day)

This was an interesting conversation which started around the idea of being able to re-use technologies by re-using/calling code libraries directly. The problem is here (as I see it) the number of coding environments and versions of these environments available.

The solution is REST (not SOAP) APIs on the web and abstraction APIs in the code (e.g. SOLR) which enable you to call functions from (say) the command line, without having to understand the code.

David Flanders perhaps summed it up best, there are levels of interaction, some easier than others:

Core System (hard)
Exposing structured data
End user interfaces (including APIs)

XML for configuration is a bit of a sticking point with users, but you need a machine readable language to configure the machine. Perhaps the point is here only use XML if you need it otherwise simple config files with "=" signs in is fine.

There is no real answer to this question other than try and keep it simple... stupid.

Author IDs (URIs)

Yes it's our favourite topic raising its ugly head again!

It is clear that there are many efforts in this area, none of which have fully succeeded yet. There is still much interest in this area however and it is clear that we should be prepared to handle multiple IDs for a single author and be able to align them (if allowed) at a later stage.

Currently the project to watch is ORCID which is a continuation of a previous project by Thompson (which failed commercially in this project).

The consensus was however that we are not wrong to mint URIs for our authors in our repositories.

Conclusions

Identification/Authorisation is a problem, can technologies like OAuth not only help with authorisation but also with identification? This could be a very interesting area.

SWORD being taken forward is a very positive outcome of the first SITS meeting.

Simple services with simple APIs are so much more effective than "project centric" solutions and bloatware.

Simple services are usable by lots of people!

Monday, 5 October 2009

#ipres09: Are You Ready? Assessing Whether Organizations are Prepared for Digital Preservation.

Planets (mainly Tesella) did an online survey targetd at libraries adn archives in europe and then opened it up to 2000 others. Got 206 responses (70% european), 1/3 libraries, 1/4 archvies. Roles - 15% DP, 16% preservation in general, 22% in curation, 16% in IT, rest other.

Being a DP survey the majority said they were aware of DP issues. 1/6 haven't really thought that much about it and 1/2 don't have a DP policy. Interesting stat, if you have a DP policy in place your are 3 times more likely to have a budget in place. However the majority of those budgets are project budgets and not institutional budgets to actively perform DP in the organisation.

Also in the survey people are having to preserve all sorts of resources including databases.

Do people have control over the formats they can accept. National archives - Yes, National Archives (No). Others are pretty balanced between yes and no.

Amount of content: most say that they have less than 100Tb of content to preserve but people see it expanding to much larger quantities [when it won't be any more scary to preserve than it is now, we'll soon have 1Tb memory keys - heard it here first].

85% of people are trying to solve the problem and are looking to use plug-and-play components as people don't want to replace their current systems.

Most important thing in preservation is being able to maintain authenticity [possible european stance]. Emulation is less important. In the middle of the importance graph was the the importance of using metadata standards, but no one can decide what standards to use.

[Good survey, probably worth looking up and answering it yourself, especially if you are involved in a DP project! How do the aims of your project align with the view of the wider community?]

#iPres09: An Emergent Micro-Services Approach to Digital Curation Infrastructure.

More stuff, smaller budget, more technology [good start]. We can gain though with digital tech through redundancy, meaning (through context), utility (through service) and value through use [totally agree with the last point esp. wrt OA publishing].

Projects imperatives: Do more with less, focus on the content, not the systems in which the content is managed [agree, focus on the data, not the software which people care less about]. This leads to the goal of the presentation which is micro-services, small services which provide maximum gain when they can be looped together, replaced, re-written etc.

There are 12 currently in 4 layers - Storage, Charecterisation/Inventory, Service (index, search, migration), Value (Usage) - [Good slide this] "LOCKSS, Lots of description keeps stuff meaningful, Lots of services keeps stuff useful, Lots of uses keeps stuff valuable".

Now describing some of the technologies which are being used. Simple restful api's to storage [however he hasn't mentioned the high level software they are using, based upon the first slide in the presentation said to use proven technology]. Emphasizing that storage has moved on (as it always will) and you can gain a lot of benefit though specialist products.

Storage is the main phase they have done and they have a few other mirco-services done. [Like the aim of it all however there are already lots of tools to provide each it is just the keeping them apart which is hard]

#iPres09: e-Infrastructure and digital preservation: challenges and outlook

e-infrastructure: Starts by defining infrastructure (see wikipedia) and e-infrastructure specific to a collection of European digital repositories. So basically we are looking at opportunities to build and supply services which are applicable to these repositories.

Background: EU is supplying lots of support for this and in germany they are researching national approaches, identifying activities and assign tasks to "expert" institutions. By introducing the current fields of project he is outlining that there is still a significant mismatch between the scale of the problem and the amount of effort being expended. From this he outlines that there is a significant lack of common approaches to solving problems. [I don't think this will ever go away, unless there is a mandate, and even then not everyone will want to sign up].

[Lots of argument] Funding is focused on many individual projects and thus doubles up the the argument that there are no commons. This led leads to a slide about interoperability and standards and the lack of them. [Which again, i don't think will ever go away and I think that we should be appreciative that people tend to pick XML to encode their data in, this makes it interoperable right].

[This is a start of project presentation, I don't seem to see that much output. They have some simple models as diagrams, again though at this stage it is hard to see how they are not just another project which will come up with (another) set of standards which no one will then want to adopt.]

Giving a set of examples now where they are going to re-use and extend existing software/projects. The goals are good, in terms of concrete steps for global infrastructure for registries, data formats, software deposits and risk management. [Just not sure how achievable all this is based upon the fact it has been the aim of many projects already]

Tuesday, 8 September 2009

Thoughts on digitization, data deluge and linking

It's been a while since I've put a post up and this is probably due to being busy and also trying to tidy up a lot of stuff before starting on new projects.

In this post then: Digitisation

I never really gathered how big the area of digitisation is and how many non repository people are actively involved in digitisation. There are a great many projects >50 who are digitising resources and these include national libraries. Items being digitised include everything from postcards and newspapers to full books and old journals.

So what's the problem here ... simple ... how many people are digitising the same things? Yes I know that there is so much out there that this is unlikely to be the case however it brings me nicely to the problem of information overload. There is already more valuable information on the internet than we can possibly handle effectively, so how do you ensure that any resources you digitize for open access usage on the web can be found and used?

I don't normally say this but perhaps we should look at physical libraries for the answer. Libraries are a very good central point where you can find publications related to all subject areas, and if your local library does not have a copy then it will try and find a copy somewhere else.

How then does this map onto the web? Web sites become the library and links become the references to additional items or items this site does not contain, simple right? Unfortunately with 50+ projects I can count already, this leads to 50+ different web sites all with differing information presented in different ways. Due to the presentation of each web site being totally different this means that in fact they are not a library - that pride themselves on the standard way to organise resources -
thus web sites become books. Thus to find resources we have to rely on search engines and federation. Thus we are back to where we started and we have a problem with information overload.

Unfotunately I don't have an answer to this problem, however I do know that links hold the key to the solution. Each website at the moment is simply an island of infromation, what is desperately required is the authors and community to establish links to these resources. If digitisation houses are curating refereed resources then the simplist way to link to these would be to put information about them on wikipedia.

This would be my final point then, wikipedia is actually a good thing, simply because of the the community aspect. However it also provides many other huge benefits:

External resources such as photoes have to have a licience

In annotating a page/item you create links and establish facts which are available by semantic wikipedia (dbpedia)

Wikipedia is an easy way to establish your presence on the link data web (linkeddata.org)

So if you are digitising books by an author, add this link to their wikipedia page. If you are digitising a collection of World War images, add links to some of these to wikipedia and flikr.

Establish links and help yourself to help everyone else.