Monday, 18 July 2011

More on File Identification Tools

Since I last wrote a post on this back in March I have started some work for the Open Planets Foundation. As I said in my previous post, I see no reason to have too many unmaintainable tools when we could just pick the best one... the problem is making this choice (for some).

Which tool?

Simple - The one which is currently the most widely adopted... file.

Opinions may vary on this however ALL of these arguments talk about the feature set of a particular tool or the slowness of the tool when scanning billions of files.

Feature Sets and Ease of Use

File is a very simple tool which offers a mime-type and limited metadata exposing of the file types about which it knows. It only accepts single file execution, however you can wildcard it's input in the linux shell and it executes extremely quickly. In my testing file took 2.375s to identify 1000 files, that's 421 files a second (see the comparison to DROID and FIDO @

Other tools offer more power in other ways, so DROID fits well with The National Archives (UK) Digital Continuity project, providing a PRONOM identifier and mapping back to tools which can perform many operations on these files.

DROID is an ever improving tool as the underlying data (number of files it can identify) expands. However certain decisions have led to it becoming difficult in recent times to simply profile a single file. Rather it has become a tool which is now quiet heavily integrated with other systems rather than loosely coupled.

FIDO resulted from this realisation that DROID was becoming too slow and painful to profile a single file. Originally written in python, FIDO performs the same classification as DROID (using the same signatures) in a much shorter time and provides the PRONOM PUID as output. FIDO provided a great proof of concept that this operation could be quick however suffers from the problem that someone has wrapped the current release (0.9.5) of FIDO in java. This slows it down significantly due to having to launch the JAVA VM!

FITS - The file identification tool set.
This final tool pretty much wraps everything, some really useful and detailed output can be gained as a result of running not only all the identification tools, but also classification tools like exiftool and ffmpeg. Lots of detail, really slow! Also, more so than any of the others, FITS suffers from the problem of being up to date more than the others.

The problem with FITS is that it wraps the other tools to provide one output. In the case of FITS they still wrap DROID v4 with a very old signature file. They chose DROID v4 as this was the last version with decent command line execution, which is (i'm guessing) the way they call it. FITS wraps a great number of other tools in it's distribution as well, such as the exiftool, which already have package managed versions, thus all the tools FITS uses are being constantly dated by new versions which then required the effort to wrap them. FITS is a great attempt at a very valuable tool, however the problems with the tools it bundles being updated constantly is likely to cause many maintainability problems.

Along with all the other tools, FITS suffers from the fact that it doesn't update the DROID signature file (a format which hasn't changed) automatically. This is a simple way for a tool to keep up to date, we should not have to rely on the users doing this when the tool should just be doing it for you! People are lazy these days and expect the package to just work, or their to be an available update of the whole package (akin to App Store approaches). The users are right in this respect BTW!

File, the tool I haven't mentioned for a while is packaged managed by every widely used platform now, so if there is an update, people are alerted to it. From this point it only takes one click to download the latest, greatest and fastest version.


Like Weird Al's song "Albuquerque" I have finally got to the point (but not the conclusion) ... Packaging.

In order to keep users happy we MUST learn to allow them to download and use tools which suit them.

Personally I'm fed up of JAVA integration, when to install a package you have to first install Maven, then install something else, then do this..... blah blah bored....

We need to start packaging cross platform tools inside one click install MSIs (windows), RPM (redhat) and DEB packages (Ubuntu/Debian and the rest of Linux).

I don't care if these packages install dependencies but the user shouldn't have to take more than one step to install a tool.

Futher the tool should either self update, or the user should be prompted to update it via the package managing option which their operating system already contains.

Using packages, FITS could simply be a very small meta package which depends on other packages, thus keeping the whole suit of tools up to date... independently.


* Yes a tool should be fast
* Yes a tool should be feature rich
* Most of all, you should be able to install it and keep it up to date easily!

What's Next

The billion dollar question, for me I've done some performance testing of the various tools and decided that speed is due to features. As a package gets bloated and feature rich, it becomes slower! The faster it is the simpler it is.

What i'd like to be is to make the fastest one (file) feature rich without bloating it and slowing it down. Also file is already package managed which saves me what appears (according to the other tools) to be a very hard job.

Some investigation on this aim is likely to follow, along with some requirements gathering and classification on critical features before things move forward.