Thursday 7 May 2009

File Format Risk Analysis! How hard can it be?


The trouble with this is the sheer number of different formats out there (PDF,DOC,DOCX....) and the number of versions each of these has, each of which has it's own problems.

Enter the P2 data registry which now contains over 40,000 facts relating solely to format types. I have been working on this for a while now and a full list of services can be found on the registry homepage. Although the SPARQL Endpoint is a great place to start it doesn't help you much with risk analysis.

First thing is to determine which properties of a format or format super-type (PDF 1.6 has a super type PDF for example) are important when analyzing risk. Do you consider age, number of software tools, quality of documentation etc? And how do you add weightings to all of these properties as a factor of overall risk?

Using the registry as a starting point I have drawn up a profile involving these properties as well as many others which each return a high, medium or low risk level. These are then sum-mated and normalised to give a single risk score.

Enough of me talking: The live service (PDF 1.3) can be found here and here for PDF 1.4.