Thursday 5 July 2012

Jason Scott and the Archive Team

If you haven't heard of the archive team I suggest you look them up and even become an archiving activist or even hero!

I had heard of the archive team before and like many, I believed that this rogue group are like over "energetic hackers" with an agenda to preserve the lives of people that companies destroy.

This is an agenda I fully support but had not been inspired to take part until recently when I had the absolute pleasure of listening to a keynote speech by Jason Scott, one of the founding members of the archive team. Jason sums up himself as an "energetic hacker" and he is a man who shares many of my beliefs which I sum up in this post.

1) Publishing is moving too fast for archives to keep up.

Jason compares archiving peoples lives on the web in systems like geocities, MobileMe and others like trying to catch fireflies. They fly up and while you have time to say "How Pretty", you have no time to realise what the firefly does and its value to the community.

Once you do realise the value, they are already gone.

This is true of nature, buildings and even people themselves! In many cases there is nothing we can do, but what about digital information?

Cathy Marshall presents a survey carried out at the Library of Congress of digital lives with some children and one quote stuck out:

"Why don't we just save facebook as this is our diary, yearbook, guinness book of records..."

This is absolutely something that can be done, and something that the archive team are doing...great... but why aren't others?

One word... Policy

2) Just get on with it, ask for forgiveness later.

It is much easy to ask for forgiveness that it is to reconstruct a building that has been knocked down. Once it is gone, nothing can be done. Something that has been saved, can still be removed...

Jason and archive team apply this policy, archiving whole websites and peoples public content, indexing it and making it available on torrent sites. And this has saved a lot of peoples content from the trash can.

Again, POLICY is a blocker to people just doing this, the other blocker is the belief that people will not want you to archive and share their data... let's address that...

3) People know what they want, so ask them!

Question for you to consider: Should your medical record be shared?

What if it was shared with researchers worldwide so that a cure might be found quicker?

What if your medical record became the top hit for your name of google?

It is this last point that seems to be key to most people, many people don't mind historical information being made available, however they just don't want such information to be any more prominent that it was.

Finally the hardest thing to archive is to allow the user to still be in control of their data and allow it to be deleted if they want it too. A big problem with all of these systems is that companies and archives assume ownership of peoples data, it is not their data, it belongs to the curators of the data and control should remain with those people.

If you lend a dinner set to a friend or neighbour for a big party they are planning, you expect to be able to get it back, not for them to take control of it and then throw it away instead of give it back to you.


It is time that policy was changed. It definitely should not get in the way of progress!

Wednesday 22 February 2012

Drupal 7 - From Blank to Working

I have never set up a site from scratch in Drupal before, but I am impressed how easy it is to use for users who don't want to delve into a terminal and try and understand templating that way. Problem is getting from a fresh install to basic site is actually quiet hard! This blog lists the steps I went through to make it work, it already assumes you have set up an admin account. The following steps look at building a simple site in the Bartik theme.

1) Set up your development environment - Make sure you have two browsers open, one in Incognito/Private browsing mode such that you can see changes to users who are not logged in.

2) Login and create a main page, this is a basic page and will be called something like node/1. Via Configuration -> Site Information you will need to set it as a home page (set the other options while you are here).

3) Change the theme to Bartik in Appearance and then click settings to customise your colour scheme and default icon (note that icons need to be the correct size).

4) Go to People and then click the Permissions tab. Make sure that all users can search your site. Your can then add the search block (via Structure -> Blocks) to the header (or other) section of your page.

5) Go to Structure -> Menus and click "list links" against "user menu". Then click add link and add a Login link with the url user/login and enable this. You can do the same with the url user/register to get a registration link. Refreshing the non-logged in page should now show these links at the top right. Turning off registration can be done via Configuration -> Account Settings.

6) From Structure -> Blocks create a new block with the following content and add it to the header section of your page:

<style type="text/css">
.breadcrumb {
display: none;

This should set you up a basic and usable site.

Friday 20 January 2012

Making Debian Changelogs from Github repositories

One of the many things that irks me is the gap between good developers who put all their code on platforms such as GitHub, and those who then actually bother to put some effort into packaging up their code for easy platform installation.

I have come to the realisation that this is mainly due to the pedantic nature of packaging formats and platform lock in. One such example is the exacting format of the debian changelog...

GitHib2Changelog is a bit of code that I knocked together to help in this situation. It takes a GitHub repository URL and builds a debian changelog from the repository commits and tags.

By looking at the tags and commits it works out which commits are related to which tags (something GitHub APIv3 doesn't do) and then outputs this directly to you already formatted.

The service is built in php, and is web based with both a pretty front end and API access.

Ironically, since i've now committed the code to GitHub here I now need to use the service on itself and build the easy to install packages. More on that soon...

Thursday 19 January 2012

DepositMOre - The Prototype

Building on the success of DepositMO and SWORDv2, I thought it would be a good idea to put a quick HTML5 client together to save myself some pain.

The basic premise of this web-based client is to automatically search for "your stuff" in a number of ways and then allow it all to be submitted to a repository in one click.

First target for me was This service is used as an online conference submission and review system. In a nut-shell if an author wants to get accepted into a conference, easychair is one system which they WILL have to battle with in order to submit their content. As a result there is a strong potential that easychair knows about many publications which should also be present in other systems.

From the main screen in easychair it is possible to navigate and find the many conference publications which you have submitted. Each publication is tied to a conference and it can take a substantial number of clicks to navigate between each publication.


DepositMOre is a modular system which is intended to be a home for many services which locate your publications. The first module to be developed is for easychair.

By simply providing your login credentials to the DepositMOre system, it will not only list all your authored items from easychair but also check if these are present in your locally detected repository. If they are not deposited, and they should be then one click will do this for you.

A combination of HTML5 and SWORD2 make this process quick and seemless! Multiple items can be submitted at once and as each are submitted you can instantly click a link to your item and can view it in the repository.

The following video gives a demo of the prototype in action. We hope to continue development with the support of a funded project.

Technologies Used

  • HTML/Javascript/JQuery/PHP

  • SWORD2 PHP Library - Stuart Lewis -

Monday 31 October 2011

A little preservation watch tool for DROID users

Ever wondered what has changed in each new signature file released for The National Archives DROID tool?

Want a way to find out what objects a new signature file might affect or reclassify?

I have collected together all the available DROID signature files (still want more) and produced a little preservation watch tool that surmises changes between signature file versions. A summary is produced which outlines the signatures and file formats added to each new signature file. Additionally by selecting any two of the signature files, a user is also able to compare two specific versions.

Soon to be added will be the ability to subscribe to an email or RSS feed which alerts of new signature files and changes, allowing active preservation watch.

In order to tailor this more to a users n eeds I'm contemplating allowing selection of specific extensions/formats which users care a lot about and producing an alerting service which focusses on changes to only these types.

Thoughts welcome...

Monday 18 July 2011

More on File Identification Tools

Since I last wrote a post on this back in March I have started some work for the Open Planets Foundation. As I said in my previous post, I see no reason to have too many unmaintainable tools when we could just pick the best one... the problem is making this choice (for some).

Which tool?

Simple - The one which is currently the most widely adopted... file.

Opinions may vary on this however ALL of these arguments talk about the feature set of a particular tool or the slowness of the tool when scanning billions of files.

Feature Sets and Ease of Use

File is a very simple tool which offers a mime-type and limited metadata exposing of the file types about which it knows. It only accepts single file execution, however you can wildcard it's input in the linux shell and it executes extremely quickly. In my testing file took 2.375s to identify 1000 files, that's 421 files a second (see the comparison to DROID and FIDO @

Other tools offer more power in other ways, so DROID fits well with The National Archives (UK) Digital Continuity project, providing a PRONOM identifier and mapping back to tools which can perform many operations on these files.

DROID is an ever improving tool as the underlying data (number of files it can identify) expands. However certain decisions have led to it becoming difficult in recent times to simply profile a single file. Rather it has become a tool which is now quiet heavily integrated with other systems rather than loosely coupled.

FIDO resulted from this realisation that DROID was becoming too slow and painful to profile a single file. Originally written in python, FIDO performs the same classification as DROID (using the same signatures) in a much shorter time and provides the PRONOM PUID as output. FIDO provided a great proof of concept that this operation could be quick however suffers from the problem that someone has wrapped the current release (0.9.5) of FIDO in java. This slows it down significantly due to having to launch the JAVA VM!

FITS - The file identification tool set.
This final tool pretty much wraps everything, some really useful and detailed output can be gained as a result of running not only all the identification tools, but also classification tools like exiftool and ffmpeg. Lots of detail, really slow! Also, more so than any of the others, FITS suffers from the problem of being up to date more than the others.

The problem with FITS is that it wraps the other tools to provide one output. In the case of FITS they still wrap DROID v4 with a very old signature file. They chose DROID v4 as this was the last version with decent command line execution, which is (i'm guessing) the way they call it. FITS wraps a great number of other tools in it's distribution as well, such as the exiftool, which already have package managed versions, thus all the tools FITS uses are being constantly dated by new versions which then required the effort to wrap them. FITS is a great attempt at a very valuable tool, however the problems with the tools it bundles being updated constantly is likely to cause many maintainability problems.

Along with all the other tools, FITS suffers from the fact that it doesn't update the DROID signature file (a format which hasn't changed) automatically. This is a simple way for a tool to keep up to date, we should not have to rely on the users doing this when the tool should just be doing it for you! People are lazy these days and expect the package to just work, or their to be an available update of the whole package (akin to App Store approaches). The users are right in this respect BTW!

File, the tool I haven't mentioned for a while is packaged managed by every widely used platform now, so if there is an update, people are alerted to it. From this point it only takes one click to download the latest, greatest and fastest version.


Like Weird Al's song "Albuquerque" I have finally got to the point (but not the conclusion) ... Packaging.

In order to keep users happy we MUST learn to allow them to download and use tools which suit them.

Personally I'm fed up of JAVA integration, when to install a package you have to first install Maven, then install something else, then do this..... blah blah bored....

We need to start packaging cross platform tools inside one click install MSIs (windows), RPM (redhat) and DEB packages (Ubuntu/Debian and the rest of Linux).

I don't care if these packages install dependencies but the user shouldn't have to take more than one step to install a tool.

Futher the tool should either self update, or the user should be prompted to update it via the package managing option which their operating system already contains.

Using packages, FITS could simply be a very small meta package which depends on other packages, thus keeping the whole suit of tools up to date... independently.


* Yes a tool should be fast
* Yes a tool should be feature rich
* Most of all, you should be able to install it and keep it up to date easily!

What's Next

The billion dollar question, for me I've done some performance testing of the various tools and decided that speed is due to features. As a package gets bloated and feature rich, it becomes slower! The faster it is the simpler it is.

What i'd like to be is to make the fastest one (file) feature rich without bloating it and slowing it down. Also file is already package managed which saves me what appears (according to the other tools) to be a very hard job.

Some investigation on this aim is likely to follow, along with some requirements gathering and classification on critical features before things move forward.

Monday 14 March 2011

Preservation Tools - Moving Forward

Over the last number of years, JISC and other bodies have funded a number of digital preservation projects which have resulted in some really valuable contributions to the area... now is the time to realise the benefits of this work and provide a digital preservation experience to everyday users.

To achieve this a not insignificant amount of work needs to be undertaken, namely to identify key applications and separate these from the complex systems into which they have been built. Alternatively many applications now need re-thinking and the best bits built into system which have super-ceded these applications.

File Format Identification Tools

File format identification now has a number of tools available, each with their own advantages and disadvantages, in no particular order they are:

  • Started out as a tool to identify file types and versions of those types. :)
  • Each file version was assigned an identifier which could be referenced and re-used. :)
  • Identification of file was done via "signature", not extension matching. :)
  • Became complex as it was adjusted to suit workflows and provide much more complex information which few people understand or want :(
  • Added complexity increased the time required for each file classification, no longer a simple tool :(
  • A new cut down client which takes the DROID signature files and does the simple stuff again :)
  • A built in Unix tool installed on every Unix based system in the world already! :)
  • Does not do version type identification :(
  • Does not provide a mime-type URI :(
  • Very quick to run :)
  • Has the capacity to add version type identification and there is a TODO in the code for it! :)

With the PRONOM registry now looking at providing URIs for file versions, why can't we stop coding new tools and change the FILE library. This way it could handle the version information and feed back the URIs if people want them. I've looked briefly into this and the PRONOM signatures should be easy to transport and use with the file tool.

If I get time I might well have a go at this and feed it back to the community.