Thursday, 11 September 2008

AHM08/RS5: Regular Session 5

Jeremy Cohen: An e-Science Platform for on-demand management and processing of environmental data

based on the Message - Mobile Environmental Sensing System Across Grid Environments

sensors connected by bluetooth, wifi & 3g

100,000s of sensors imply highly variable rates of data aquisition .. using amazon elastic cloud commodity computing to cope with the varying load.

on-demand modelling ..

.. requires on-demand computing .. ogsa-dai, ogsa-dqp and ec2

.. on-demand being driven by both back end data availability and front end user request

www.message-project.org

www.imperial.ac.uk/

Nathan Cummngham: Optimising Antarctic research cruises


"Big Data" Nature

"Growing your data" Lynch 2008, Nature 455:7209

planning is influenced by real time ice information, realtime pengiun tracks - showing where polar fron is, real time chlorophile imaging etc

delivery is over 128k sat link to research ship.

www.edp.ac.uk/ .. environmental data portal.

Liangxiu Han: FireGrid: an eInfrastructure for next generation Emergency Response Report

http://firegrid.org/

Jim Myers: A Digital Synthesis Framework for Virtual Observatories


Context
.. Ocean Observatories Initiative
.. National Ecological Obs Net
.. WATERS Net

want 3d immersive equivalent to goingthere but augmented by data richness

even though there is general initial user agreement of the concept, in fact great deal of variation in specifics

as soon as you read about it you should be able to action it via workflow.

Digital Synthesis Framework .. data playground

Semantic Content Management
Streaming Data Management
CyberCollaboration Portal
Cyberintegrator.
Content Repositories
.. all stored in RDF

Restful, using restlets

front end widgets uses google toolkit
interwidget interactions

Dynamic Analysis Environment

eg Corpus Christi Bay Dashboard

Community Annotation.

code at:

svn.ncsa.uiuc.edu/svn/cyberintegrator &dse

AHM08: Sharing and Collaboration

Jeremy Cohen: Sharing and Collaboration (in the laboratory) Blogs, Logs and Pods

Laboratory e-Notebook

leverage off things we already do
.. COSHH
.. Process todo vs plan vs record - all integrated diagramatically in PDA
.. Integration of lab records with building management system
. so that PDAs etc can subscribe to building message broker
. after all the building mgmt system knows such things as lab room temp
.. results can be made available WITH data in databases to provide remotely sources for validation
.. record the units
. eg bridge from Germany to Switzerland, didnt meet by 10cm, because elevations against diff sea levels, which differed by 5cm but the sign was got wrong

BioBlog .. http://chemtools.chem.soton.ac.uk/projects/blog/blogs.php/blog_id/15

bioblog templates essential

barcode to url convertion
or 2d-array barcode plus phone conversion to url and retrieval

http://simile.mit.edu/welkin .. Welkin is a graph-based RDF visualizer

discover that the use of blog actually improves the quality of what is recorded.

.. comment by sketch .. chemists are scriblers

also cf .http://wikispaces.com/, which is wiki based

finally still need to link all the component parts from publication to conversations to lab notebook.

LaBlog/wiki .. myExperiment vs ourExperiment.org

semantic web .. data deluge .. maintaining & communicating context
. major problem of communicating meaning
. eg ppt arrows on hebrew system which is a right to left language . which arrow is forward?
ie people and their backgrounds needed..

maybe call this 'semiotics of semantic web' or the 'semiotic web'?

AHM08/W9-3: The Global Datacentric View

Ian Atkinson: ARCHER Data Services

HERMES - generic datagrid tool
PLONE tools

cf archer.edu.au

hermes ..
http://commonsvfsgrid.sf.net

plone srb & ICAT
http://eresearch.jcu.edu.au/wiki

PJ Kirsch: Developing a common data discovery, browsing and access framework

BAS multi-disciplinary requirerments.

used be scientist (mis)managed

need to have a framework for the data its documentation

tech drivers - 24/7 link & bandwidth even to remote data sensors
free client tools - eg google maps etc

must have
- efficient discovery
- appropriate visualisation .. what does appropriate mean - user perspective & data dependance
- access to data
- aaaccess of ancilliary/auxilliary data

- sometimes reference to accession num for no digital holdings

initial response to query is a timeline showing availability and quality indicator and list of associated other docs in a subversion db eg s/w, code, reports etc

as you 'zoom' in on the data the timelines may show additional variants such as region, raw vs processed, instrument variants etc.

provider nominated visualisation, as scrollable zoomable time or space display .. linked to dataset download.

iso 690 - ref for citing data.


Andrew Treloar: ANDS - what are we doing, why are we different, whether we are courageous.

Platforms for Collaboration ..

follow on arrow dart and archer

blueprint: 'towards the australian data commons'
.. why data - because data deluge need to spend more and more
.. need for standardisation
. s/w & h/w gets cheaper, wetware more expnsive
.. role of data federations
. cross disciplinary opportunity opens door to new research
. but it is difficult

cf australian code for the responsible conduct of research
.. institutional and researcher obligations.
.. signed up to by all chancellors etc . so serious
.. funding will become tied to compliance

ANDS programmes
.. developing frameworks
.. providing national utilities
. discovery
. persistent identifier - pilin
. collections registry

discovery - iso2146 - high level architecture for registry
collection, party/people, activity, service
expose this to google web service harvest

.. seeding the commons, ie work with lead exemplars
.. building (human) capabilities
. train the traners

1st review .. 'strategic roadmap aug 2008.pdf'
cf p21 ,p22, p23, p23

http://ands.org.au

AHM08: Cloud data mining

Robert Grossman: The Emergence of the Data Centre as a scientific instrument

Diff between google and escience

.. scale to datacentre . google, esci, health
.. scale over datacentre . esci only
.. support large data flows . esci only
.. user and file security . google, health

For Sector

implies transport and routing services needed in a addition to google's stack -
. so developed UDT 'UDP based Data Transport'

UDF map reduce applied across this stack

sector / sphere is fast, easy to program customisable 2-3x, 4-6x faster than hadoop

sphere is the compute cloud, sector is the data cloud

sector's security based on SSL and also the audit tracking that is needed.

AHM08: Visualising the Future

Chris Johnson: Scientific Computing & Imaging Institute, Utah .. pronounced 'ski'

Not retrospective visualisation of the results, but integrated visualisation in the problem solving process

GPU - massively parallel architecture
.. scaling many times faster than multi-core cpu
.. now have high precision floating pt gpu from nvidia

using GPUs to process petabytes of neuro slice data.

volume rendering ...
traditional 'maximum intensity projection' (MIP) to 'full volume rendering'

new was too computationally extensive, but with GPU becomes tractable +
multi-dimensional transfer function - mapping derivatives & integrals across multi slices to rgb
... s/w called seg3d .. bioimage .. hardest part was making it usefull !

time dependant visualisation

isosurface extraction
.. marching cubes Lorensen & Cline 1987
.. but ow pisa, rtrt, noise, octree up to 10^4 faster algorithm, but not available .. ie opensource,

pisa .. livnat & tricoche '04 .. if the triangle is too small to see, dont calc it.

ray-tracing
as # objects goes up ray tracing becomes more efficient than raster (traditional gpu) algorithm
DOE asci c-safe .. simulate explosion from first principles & vis it.
.. manta - real time ray tracer
.. how to simulate the right colours of flames correctly, rather than map temp to colour ramp
.. perception of shadows .. currently base on phong and gurow in 1970s but todays hardware is faster
.. so if solve maxwell for realism .. need to artificially introduce an appropriate light source into say cat scan .. not always obvious how to do it.

3d vis of error and uncertainty ..
.. working on it . no one way to do it
.. what about mapping rgb to fuzzyness or sensitivity or confidence
.. uncertainty animation

in 2003 as much info was generated as was published in all preceeding human history, and have repeated that every year since.

cf www.vistrails.org with taverna & myexperiment .. visualisation of differences due to technique variation.

http://www.sci.utah.edu/vaw2007/ .. book from Visualisation and Analytics Workshop

Wednesday, 10 September 2008

AHM08/RS1: Regular Session

Jeremy Cohen: ICENI II

Coordinate forms:

declarative workflow lnguage
.. describe what not how
.. much easier to logically analyze the flow

use of coordination forms for matching

workflow execution .. bpel, scufl etc

declarative workflow generation tuned to users normal activities

.. automated workflow generation
.. extract from a users real-time use of their natural software - matlab etc

workflow execution with performance .. performance repository .. used to drive planning of optimal execution plan

ICENA II plan

Daniel Goodman: Decentralised Middleware and Workflow Enactment for the Martlet Workflow Language


Middleware comprises:
.. Process Coordinator
.. Data Store
.. Data Processor

Essentially introduces an efficient protocol for P2P communication between PCs and DPs such that each node becomes aware in changes in state and availability of the network as a whole in a decentralised robust efficient way.

Ahmed Algaoud: Workflow Interoperability


API for workflow interoperablity providing direct interaction
.. based on WS-eventing .. asynchronous
.. look to implement in eg Triana Taverna Kepler

WS-Eventer set up witth four types
.. subscriber, sink service, subscribe manager, source servce

also use WSPeer & working with NAT issues.

Asif Akram: Dynamic Workflow in GRID Environment


Imperial College

part of ICENI project
GridCC incl QoS, BPEL, ActiveBPEL

introduce QoS language

QoS criteria incl security, performance (from performance criteria)

Used WS Addressing engine (WSA) to achieve dynamic redefinition of the BPEL partner link within the BPEL.

BPEL Editor / Monitor

Conclusion .. QoS can be injected into BPEL which makes dynamic workflow much easier to achieve, and this can be achieved within existing standard specification.

Jos Koetsier: A RAPID approach to enabling domain specfic applications

User prefers domain specific portlet, but there is quite a lot of work creating domain specific portlets so ..
OK so approach is to build a custom portlet generator ..

have written one based on jsdl and jsdl xml file (GridSAM)

Uses OMII.uk s/w

obtain at http://research.nesc.ac.uk/rapid

Martin Dove: MaterialsGrid: An end-to-end approach for computational projects
3yr 5fte project www.materialsgrid.org

based on CASTEP to simulate the behaviour of materials to predict the properties of material.

results are contributed to a database .. which may also hold measured properties.

so database content is computed on demand for groups of users that dont want to know the computational under the hood stuff.

workflow using scitegic pipeline pilot instead of bpel, partly because the bpel std wasnt uniformly implemented.

cml.sourceforge.net .. chemical ml from cmlcomp.org

cml2sql & www.lexical.org golem to construct cml

.. jquery allows mix of pulldown and autocompletion & constrains to allowed values ..

AHM08/W9-2 : The Global Data Centric View

Jon Blower: A Framework to Enable Harmonisation of Globally-Distributed Environmental Data holdings using Climate Science Modeling Language


How we use the climate science modeling language.

data from many instruments .. need to combine them all to:
.. validate numerical models
.. calibrate instruments
.. data assimilation - formal method for combing data and model ..
.. making predictions - eg floods, climate, drift at sea and search and rescue

The need for harmonisation leads to Scientists spend lots of time (up to 80% of some post docs) dealing with low-level technical issues .. need a common view onto all appropriate datasets

OGC aim to describe all geographic data . mandated by inspire .. but fiendishly complex evolved from maps

Need to bridge the gap: CSML
both abstract data model & xml encoding

provides a new view of existing data, doesnt actually change it.

14 feature types ..
classified by geometry not their content

Harmonise two datasets with CSML plugs into GeoServer (like GeoSciML)

Second way via Java-CSML
.. aim to reduce the cost of doing analysis
.. high-level analysis/vis routines completely decoupled from the data

Java-CSML Design attempts
.. transform CSML xml schema to java codeusing automated tool
.. leads to v complex code
.. OGC geoapi but incomprehensible & geoapi is a moving target

.. based on well-known java concepts
.. reduce the users code
.. you can always wrap something
.. wrappers for wfs, netcdf, opendap etc to make them all look the same
.. also have plotting routines

Problem is that the more you abstract the more info you loose, so need some more specific profiles that inherit the parent profile and add the extra know for a specific instance.

Wider lessons ..

.. intolerable data formats not necesarily suitable for storage
.. trade-offs between scope and complexity
.. symbiotic relationship between stds, tools & applications.


Aside more opendap services than wcs services for raster data.

Alistair Grant: Bio-Surveillance: Towards a Grid Enabled Health Monitoring System

Problem .. SQL SELECT blah, count from databases where diagnosis = 'X'

databases is a set of databases with non-std schemas

OGSA-DAI used to solve this.

RODSA-DAI was one solution ..
Views canbe implemented in a database, but Views can also be hosted at an ogsa-dai service layer

.. this allows both security to be implemented remote from the database, also allows remote organisations to see a view without requiring hosts to support a particular view or set of views

.. output transformed as required to google maps/earth

.. ogsa-dai view are slower, but not so much slower as to work against the disadvantages.
cf www.ogsadai.org.uk
www.phgrid.net
www.omii.ac.uk

Chris Higgins report that SEE GEO has implemented OGSA-DAI wrapper for WFS.

Lourens E Veen: Virtual Lab ECOGrid: Turning Field Observations into Ecological Understanding

ECOGrid
also www.science.uva.nl/ibed-cge

Species Behaviour
Biotic and abiotic data, incl human behavior
Field Data
Statistical analyses

Organisations incl govt, infrastructure & conservation, & private volunteers

Different datamodels:
Approach incorporated a hierarchical approach of
.. Core data
.. Extended attribute
.. Set Specific extensions to preserve original data

info goes back at least to the 50s, but also earier data if available.

Tamas Kukla, Tamas Kiss, Gabor Terstyananszky: Integrating OGSA-DAI into Computational Grid Workflows

University of Westminster

want to expand workflows in two ways ...

Major problem of all the common system is limited - mainly file or v limited database
eg Triana, Taverna, Kepler, P-Grade Portal

Workflow level interoperation of grid data resources

OGSA-DAI is sufficiently generic for it to be a good candidate.

Data staging
Static vs semi dynamic vs dynamic

static staging - in spec and access before and out spec and access after but not during
semi-dynamic - in and out specified before and in out executed during
dynamc - all access during the workflow **

ogsa-dai integration , tool, workflow editor vs workflow engine

only integration into the engine provides fully dynamic access

either implemented at the port or within the node - chosen within the node - which provides better integration

required functionality .. everything is too complex.

more specific support tool &/or totally generic - chose to support both styles of access.

Chose P-Grade Portl workflow engine, based on GridSphere with extended DAG workflow engine
in P-Grade nodes are jobs, ports represent files and links file transfer

direct submission not possible .. need an application repository so

Chose GEMLCA application repository, which is also a job submitter part of Globus.

This approach has advantage is that GEMCLA is sufficiently generic that it can be used in a range of other workflow systems.

cf http://ngs-portal.cpc.wmin.ac.uk/

Tuesday, 9 September 2008

AHM08/W5: Frontiers of High Performance and Distributed Computing in Computational Science

Chris Higgins: Spatial Data e-Infrastructure SEE-GEO


What can GRID offer for scaleability of EDINA's services?

Grid was right from the outset interested in security and trans organisational issues. so what does grid offer to SDI that contributes to SDI and its scaleability.

Registries for publish, find and bind fundamental

Demonstrators produced were:

e-Social Science exemplars built:

dont hold hold the data instead link & bind to it and use it from source.
OGC services wrapped into OGSA-DAI

focusing on adding security - using SPAM-GP .. Security Portlets simplifying Access to and Managing Grid Portlets

.but not planning to give security control to portal provider therefore need finer grain security

as a result of project an agreement between OGC and www.opengeospatial.org/ in terms of on-going memorandum of understanding.

Owain Kenway: Distributed Computing using HARC & MPIg

HARC
High Available Resource co-Allocator - HARC proved to be very reliable

MPIg
globus implementation of MPI that allows topology discovery, so that it know what protocols are available for communication between ant two nodes in a multy-distributed cluster architecture.

Approach used for three different applications, two of which benefited very well over distributed sites vs expanded resource at a single resource, and the third, also benefited though not as significantly.

AHM08/W9-1: The Globa Datacentric View

Laurent Lerusse - from Grenouille to Polar Bears

Managing metadata and data capture for the Astra-Gemini 0.5 PW laser

Astra-Gemini is part of CLF
STFC - -Science & Tech Facilities Council

Grid-enabling information resource that follows a project from proposal to experiment to analysis, results and publication .. driven by central metadata store.

CLF data flow - ELk + DAQ + PolarBear(metadata) -> NeXus Writer

PolarBear needs to know the whole laser light path for the experiment and all the detectors that will be generating data.

Learnt:

- defining complex systems not easy with xml - but can be done
- scientists not used to editing raw xml - tools need to be provided!
- recording metada is time consuming but pays dividends
- evolution not revolution - continuous beta

Q? why not semantic language .. A:unfamiliarity
Q? how do you have surity that the physical world is as described by the metadata, given that all equip isnt fully tagged eg barcodes etc? .. A: that is difficult.
Q? how do you capture experiential knowledge .. this is what ELk is there for but it still has to be used, which is optional A: provision of the capability is essential

AHM08 - Crossing Boundaries - Opening

Peter Coveney - Welcome


Hey day of attendance was 2004 - but then it was compulsary to attend if you had fundng.

But this year the maximum number of papers were submitted.

Paper flyers kept to a minimum from sponsors, by distributing them oll on a 1GB usb flash drive

Gregory Crane et al - "Cyberinfrastructure for Global Cultural Heritage"


et al - 10 co-authors, 6 Organisations, UK, EU, US

Qualitatively new instruments eg treebanks .. database of language / word relationships

"Greatest Classicist of 20th Century" is probably / reputably an Islamic leader of Teheran .. but that hypothesis is untestable in a classical studies sense!

How man scholars could work on the question - what is the influence of Plato and the Classicists on Islamic thought in Teheran? - no tools available today - too much data, too many languages

Text mining came be used within a language .. but v difficult for Plato's quotations present in modern Arabic or Farsi!

ePhilology -- production of objectified knowledge from textual sources - eg a million books, including historic texts in there many historic editions through multiple languages.

eg 25k days in a lifetime, book a day reading = 40 lifetimes, harvard has 10m books = 400 lifetimes to read.

but what about 10 thin poetry books in 10 languages - just misunderstanding them requires not only languages, but also the back social history of each of the 10 authors.

Classics Goals 5-10 yrs

Memes .. cultural analogue of gene.

.. million book library of memes .. facts and fantasy and religion and texts and organisations and words and their evolution in meaning over history and place

.. Memographs / Memologies .. but creating these will require automatable and uncheckable - by human - eg do we have ocr of syriac

.. so technically one could now create a Plato memography across all languages and time .. would take time and $$s but we believe we have the tools.

.. for the first time we can confront Plato's challenge .. written words are inert, like a statue, it may be lifelike but if you ask it a question it is silent .. for the first time we can start to pose questions of text and have a machine extract answer from the text , the written word.

.. pdf is true incunabular form .. it is digital but essentially the same as their printed predecessors.

.. what does a post-incunabular digital document look like? ,, 'books talking to each other' in an equivalent way that the authors of a set of books talked and discussed and that lead to their writing. ie 4th gen digital collections knows the difference between Washington uk vs Washington us place and person, from context and automatically links to look-up & explain if the user wans it. they include 3d models of inscriptions .. scanned .. ocr .. xml all together engineered as a unit.

library vs archive

library concept changes with time originally had written , then printed, now digital actionable objects with open computation fundamental

archive is static

google books is a large archive

open content alliance is a digtal library - with a lousy front end, but it is actionable.

min features of publication -- peer review, sustainable format (eg TEI XML), open licensing (creative commons), sustainable storage - persistence.

"Scaife digital library" does the above.

AHM08/BoF: e-Infrastructure: Tool for the elite or tool for verybody

Dr Jean-Claude Bradley:


Open Notebook Science in this case chemistry, suitable for anything where IP issues are Open rather than closed:
http://usefulchem.wikispaces.com/All+Reactions?f=print

Useing video and photos published through YouTube & Flickr & Googledocs for results & Wiki for notes & ChemSpider & JoVE for publishing results .. all of which are free and hosted elsewhere, so no overhead in hosting or software maintenance etc.

Anticipate that in future (10yrs ?) many of these experiments will be able to be done with far greater replication, so longevity of data availability isnt an issue but immediacy of availability is. In those circumstances this type of distribution is suitable.

Shentenu Jha


http://wiki.esi.ac.uk/Distributed_Programming_Abstractions


Distributed Appl. programming still hard!

May actually get harder in future because of changing infrastructure - XD, PetaCloud, PRACE

No simple mapping from the application class and its staging to the application type - grid aware vs grid unaware approaches.

In fact for dynamic distributed systems such as Kalman-Filter solutions, where need to embed the scheduler inside the program.

Break-out discussion follows:


What is e-Infrastructure?

Participants representative of Arts, Medical, Geospatial - researchers, providers, developers

Getting beyond usefulness for early adopters to usefulness for mainstream science, is fundamentally about trust ..

Trust that what is learnt will be able to be reused in future as a skill
Trust that a service that is provided will be available in future -
Trust that data storage provision will at least match the longevity of the research funders for data maintenance.
Issue of any digital executable object will have dependencies and the longevity and persistance of those dependencies
Trust in terms of availability of redundant storage sources
Secure in terms of knowledge that service provider is disinterested .. eg not Google.

Evidence of this Trust is driven by perceptions of continuing $$$s

Other questions addressed were:
What do you think e-Infrastructure is and what should it be? For example, is it a tool of use only for tackling the 'grand challenges' in research or could it (& should it) be useful for all kinds of research problem?

Do Researchers need a clearly defined ICT environment and tool suite, or can usage be opportunistic, picking up on functionality that becomes available using light-weight "glue" and pragmatic organisational arrangements? ie Cathedral vs Bazaar

What would be needed to truly embrace the use of e-Infrastructure in your work across the whole research life-cycle?

Saturday, 6 September 2008

Workflows dissected

In New Zealand the concept of web-service or grid Workflow is very new, with a morass of new nomenclature, that I have found difficult to grasp all in one. So I have attempted to relate objects, names and concepts in the workflow world to their functional equivalents in traditional programming development and execution environments, that are more widely known. This is not to pretend that a web service and a file, for example are the same, but instead to recognise that within the two different domains they fulfill functionally equivalent roles. By seeing things in this way, it becomes easier to understand how all the new nomenclature fits together. Of course sometimes the functional fit is very loose and at other times the equivalence is very close. So this is the conclusion that I have come to, if it helps you as well, then thats is usefull, if I have missed something fundamental, then I'm happy to be corrected and to adjust the table – so if you are an expert feel free to comment, but bear in mind that this is a table to emphasize functional similarities from the perpsective of newbies to the workflow space. Following blogs will hopefully expand on key differences.

OK first attempt at the table - as yet incomplete:
Functional RoleTraditional Environment Web-service based Workflow - Taverna Grid based Workflow - Triana Web-service based Workflow - Sedna
Scripting tools AML, shell script SCUFL ? Domain PEL & Scientific PEL
Programming Language C++, Fortran, Java n/a ? BPEL
Integrated Development Environment MS Visual Studio Taverna Triana Sedna plugin to Eclipse IDE
Callable object DLL file Web Service Java Unit Web Service
Executable Object EXE file Taverna workflow Triana workflow BPEL bpr archives
Process launch & control, or enactment Windows, Linux Freefluo workflow enactor GAP ActiveBPEL engine
File/data objects File, database Web service Grid service protocol GridFTP Web service

table v0.1, Sep 5th, 2008

Thursday, 4 September 2008

The challenge of grey information in a connected world

The media would love the world to be black and white, but we all know in reality that everything is shades of grey. The same is true for the authority of geospatial data.

Some data-sets are authoritative in the sense that they are the master copy, curated by a reputable organisation with a mandate to maintain a particular geospatial data-set. One might say that anybody using a different instance from the authoritative one had better have a good reason. But what if the organisation only provides Internet 1 style access, so a user has to take a copy (e.g. ftp download) for their own use and then reformat it to suit the needs of their analysis software. The copy they are using is no-longer the same as the original. And what if a colleague needs to use the same data set a week, a month or a year later and needs it in the same format - when should they regard the local, most convenient copy to be inappropriate for their use. That depends on a whole range of things - not least the effort required to update the local copy, the expected rate of change of the original, and the relevance of the anticipated changes to the analysis. So there maybe valid reasons for using grey versions of data-sets with well defined formal authority. What is the citation for this usage? When a paper is published about some results derived from the data-set, do we cite the authoritative source and the date at which the original copy was taken and leave it at that, do we fully describe the process(es) that were used to reformat the data-sets before we got to it? Do we actually know in a fully reproducible way what those processes were - or do we trust the skills of the person who did it? To cover ourselves do we take a copy of the copy and archive it on some off-line media to ensure that we can return to the analysis - and then would we cite the copy we used or the copy we archived? etc etc. After all, beyond sharing knowledge, the point of formal scientific publication and citation is reproducibility of results. The challenge of grey data.

But the world of science is full of data-sets that are authoritative in the sense that nobody holds a better version, but their authority is informal, known and respected by specialists in the particular field of science, but not maintained with the same formal rigour or necessarily updated to a regular published schedule. This is reality it isn't a criticism of those involved. In these circumstances. Such data-sets may be used only infrequently and the money - it always comes down to money - may not be there for full descriptive documentation. So how do we cite such data-sets. By proxy through the first or most recent occasion that the data-set was mentioned in a published documentation, as pers. comm. and the name of the owner - and these assume that you are using the original verson and not an evolved copy as explored in the previous paragraph.

Despite the shortcomings, the solutions I have described for citation have been deemed just sufficient for traditional published material, but what happens in a digitally connected Internet 2 world? This is the domain of Digital Repositories for Scientific Data and Persistant Identifiers, or in a nutshell, a collaborative space to put and use data and a means to reference or cite data in a repository that won't change over time. These are core subjects for projects such as ANDS (Australian Natonal Data Service).

But we need to go at least one step further, and of course from a NZ perspective we havent collectively taken the first step yet. Data is useful for its own sake, but its real value in a scientific sense arises when it can be used for further analysis. As mentioned above information is processed and analysed so we need a means to reference the processing steps. With traditional published papers, this has been reason for the method section. But in a digitally connected world, we should be able to go one further. Imagine having a reference, in a paper say or on a webpage - it might look like any other link, that when you click on it allows you to actually execute all or part of the analysis that the original researcher performed. Well people are working on that too - enter the world of Workflows, Files and Packs at myExperiment, recently augmented by WHIP and Bundles, which have emerged from a colaboration with the Triana project team - a real acronym soup of progress! So what does all this mean and how does it relate to grey information?

For a start myExperiment is a repository for a wide range of Files that scientists can upload and share, but it has two key features relevant to this discussion Workflows and Packs. I'll explain Packs first because they are simpler - a Pack is a persitant description for a set of digital objects, some of which might be stored in myExperiment as Files others may be external to myExperiment. It is like the shopping list you create before you go shopping rather than the car full of stuff you bring home after the shopping expedition. But the items in the list are fully described, so that anybody can take it on a shopping expedition and come back with the same stuff. So a Pack reference (or URI) in myExperiment has many of the characteristics needed for a citation.

A Workflow is the digital equivalent of the method section of published paper. With one vital difference, if all the data is digital, and the processing steps are available as web-services, then the Workflow can be executed, ie the method can be repeated, by other myExperiment colleagues. Even better, these colleagues can substitute their own data or an alternative to one of the method steps and run the method again - so now myExperimant is a shared digital laboratory. This is where WHIP and Bundles come in. Bundles are the result of going shopping with a Pack that contains a Workflow and all the Files it uses. It is not just the shopping list, but the car full of stuff, and WHIP is a myExperiment add-on that knows how to unpack the shopping basket and make it all work for you with a single mouse-click.

So now we have Packs that can be cited and when a Pack contains a Workflow and its Files, we have a means for other scientists to repeat or extend the original method. So in a web connected world we are close to solving the problem of grey data and analytical processing , that is very difficult to solve for ordinary desktop processing.

Where does geospatial fit into this - well as yet it doesnt - the Workflow tools that are supported or about to be supported (ie Taverna and Triana) by myExperiment, Bundles and WHIP, dont yet deal to geospatial processing. That is what we need to do next.

Wednesday, 3 September 2008

What is Geospatial

In its broadest sense Geospatial can be applied to:
  • data or information that describe terrestrial features, eg any data you might find in Google maps, or
  • software that works with geospatial data, eg Google maps, any software found at OSGeo or OGC, or
  • analysis using geospatial software or terrestrial data,
  • standards, and specifications for any of the above.
In my case the data is typically New Zealand, Pacific or Antarctic data describing the terrain, soils, climate, vegetation or other living species. Some of this data is served up though Landcare Research's GISPortal, and we have lots more that we are thinking of providing.

For software I have been using ArcInfo / ArcGIS for over 20yrs  and also MapInfo, Genamap and various other commercial products, and am now making much more use of Open Source software such as PostGres, PostGIS, GeoServer etc

But for me the really interesting stuff is in analysis, doing things collaboratively over the net, geospatial mashups, workflows using web-services, the implications for standards and protocols and changing the geospatial paradigm from one person and their desktop to teams working together using dispersed Internet 2 style geospatial resources.

The SCENZ-Grid project


This post is to provide the briefest intro to the SCENZ-Grid project whichI lead and which will be the context for many future blog posts. Core SCENZ-Grid team members include Niels Hoffmann, Stephen Campbell and Chris McDowall all from Landcare Research and Ben Morrison and Paul Grimwood from GNS Science. There is also a growing number of colleagues at other institutes round the world who I will refer to as the blog evolves. More SCENZ-Grid info can be found at:
  • SCENZ-Grid home on the SEEGrid twiki: check it out for some project background and explore some of the Australian SEEGrid & Auscope related work that SCENZ-Grid depends on elsewhere on the twiki - many thanks Robert Woodcock and others in your team,
  • pilot SCENZ-Grid demo hosted on Bestgrid: have a look at our first geospatial web-service workflow and check out the BestGrid community who we are collaborating with - many thanks to Mark Gahegan, Nick Jones and the BestGrid team,
  • SCENZ-Grid on the KAREN wiki: KAREN is the NZ 10GB/s research network and KAREN's operators REANNZ provided the seed funding for SCENZ-Grid, so check the rest of the KAREN wiki for other NZ projects using KAREN.
SCENZ-Grid and its context have been described in a number of presentations recently:
  • at GOVIS: 'Reusing Digital Information - Landcare Research': live video of 50min presentation using Google Earth as presentation tool from May 2nd,
  • at eFramework: 'GIS in NZ' ppt that I presented at the workshop on July 24th,
  • at APAN26: Niels presented in the Natural Resources session on Aug 6th and I presented in the Middleware session on Aug 7th.
SCENZ-Grid's own hardware was delivered in late July and is being assembled while I blog. Of course it arrived just as different key members set off on various conference and annual leave trips, so it is likely to be late October before SCENZ-Grid is operating on its own hardware and in its own web-space. For those interested in specs, the hardware is an SGI cluster comprising one XE250 2U head node and six XE320 1U compute nodes, delivering 104 cores, 0.4TB RAM and 1.6TB of local /tmp. The cluster will have be connected to KAREN and have dual fibre access to our SUN StorageTek SAN for persistant storage, comprising approx 20TB in the first instance.

What's this blog?

The dawn of Robert's Geospatial Gibberish. I am a Geospatial Scientist from Landcare Research in New Zealand, my initials are RGG, and my full name can probably be guessed from the first few letters of the words in my blog name. I am prone to leaps of lateral thinking and this blog is intended to be a vehicle for expressing them, allowing myself to keep track of my own thoughts and others to react and maybe to extract what might be useful. This is my first foray into blogs and the world of internet 2. One of my key interests is the intersection of traditional GIS, GRID and Internet 2. I'm also interested in music, singing and recording so there may be the odd postings about that as well, and I take photographs wherever I go and some of those end up in picasaweb as private or public albums.