Thursday, 4 September 2008

The challenge of grey information in a connected world

The media would love the world to be black and white, but we all know in reality that everything is shades of grey. The same is true for the authority of geospatial data.

Some data-sets are authoritative in the sense that they are the master copy, curated by a reputable organisation with a mandate to maintain a particular geospatial data-set. One might say that anybody using a different instance from the authoritative one had better have a good reason. But what if the organisation only provides Internet 1 style access, so a user has to take a copy (e.g. ftp download) for their own use and then reformat it to suit the needs of their analysis software. The copy they are using is no-longer the same as the original. And what if a colleague needs to use the same data set a week, a month or a year later and needs it in the same format - when should they regard the local, most convenient copy to be inappropriate for their use. That depends on a whole range of things - not least the effort required to update the local copy, the expected rate of change of the original, and the relevance of the anticipated changes to the analysis. So there maybe valid reasons for using grey versions of data-sets with well defined formal authority. What is the citation for this usage? When a paper is published about some results derived from the data-set, do we cite the authoritative source and the date at which the original copy was taken and leave it at that, do we fully describe the process(es) that were used to reformat the data-sets before we got to it? Do we actually know in a fully reproducible way what those processes were - or do we trust the skills of the person who did it? To cover ourselves do we take a copy of the copy and archive it on some off-line media to ensure that we can return to the analysis - and then would we cite the copy we used or the copy we archived? etc etc. After all, beyond sharing knowledge, the point of formal scientific publication and citation is reproducibility of results. The challenge of grey data.

But the world of science is full of data-sets that are authoritative in the sense that nobody holds a better version, but their authority is informal, known and respected by specialists in the particular field of science, but not maintained with the same formal rigour or necessarily updated to a regular published schedule. This is reality it isn't a criticism of those involved. In these circumstances. Such data-sets may be used only infrequently and the money - it always comes down to money - may not be there for full descriptive documentation. So how do we cite such data-sets. By proxy through the first or most recent occasion that the data-set was mentioned in a published documentation, as pers. comm. and the name of the owner - and these assume that you are using the original verson and not an evolved copy as explored in the previous paragraph.

Despite the shortcomings, the solutions I have described for citation have been deemed just sufficient for traditional published material, but what happens in a digitally connected Internet 2 world? This is the domain of Digital Repositories for Scientific Data and Persistant Identifiers, or in a nutshell, a collaborative space to put and use data and a means to reference or cite data in a repository that won't change over time. These are core subjects for projects such as ANDS (Australian Natonal Data Service).

But we need to go at least one step further, and of course from a NZ perspective we havent collectively taken the first step yet. Data is useful for its own sake, but its real value in a scientific sense arises when it can be used for further analysis. As mentioned above information is processed and analysed so we need a means to reference the processing steps. With traditional published papers, this has been reason for the method section. But in a digitally connected world, we should be able to go one further. Imagine having a reference, in a paper say or on a webpage - it might look like any other link, that when you click on it allows you to actually execute all or part of the analysis that the original researcher performed. Well people are working on that too - enter the world of Workflows, Files and Packs at myExperiment, recently augmented by WHIP and Bundles, which have emerged from a colaboration with the Triana project team - a real acronym soup of progress! So what does all this mean and how does it relate to grey information?

For a start myExperiment is a repository for a wide range of Files that scientists can upload and share, but it has two key features relevant to this discussion Workflows and Packs. I'll explain Packs first because they are simpler - a Pack is a persitant description for a set of digital objects, some of which might be stored in myExperiment as Files others may be external to myExperiment. It is like the shopping list you create before you go shopping rather than the car full of stuff you bring home after the shopping expedition. But the items in the list are fully described, so that anybody can take it on a shopping expedition and come back with the same stuff. So a Pack reference (or URI) in myExperiment has many of the characteristics needed for a citation.

A Workflow is the digital equivalent of the method section of published paper. With one vital difference, if all the data is digital, and the processing steps are available as web-services, then the Workflow can be executed, ie the method can be repeated, by other myExperiment colleagues. Even better, these colleagues can substitute their own data or an alternative to one of the method steps and run the method again - so now myExperimant is a shared digital laboratory. This is where WHIP and Bundles come in. Bundles are the result of going shopping with a Pack that contains a Workflow and all the Files it uses. It is not just the shopping list, but the car full of stuff, and WHIP is a myExperiment add-on that knows how to unpack the shopping basket and make it all work for you with a single mouse-click.

So now we have Packs that can be cited and when a Pack contains a Workflow and its Files, we have a means for other scientists to repeat or extend the original method. So in a web connected world we are close to solving the problem of grey data and analytical processing , that is very difficult to solve for ordinary desktop processing.

Where does geospatial fit into this - well as yet it doesnt - the Workflow tools that are supported or about to be supported (ie Taverna and Triana) by myExperiment, Bundles and WHIP, dont yet deal to geospatial processing. That is what we need to do next.

No comments: