Robert's Geospatial Gibberish: 2008

Friday, 14 November 2008

India Day 3 – Udaipur to Jodhpur

Sunday and we spent most of the day traveling from Udaipur to Jodhpur. Went for an early morning walk, round..

We had decided not to book a bus (there is no direct rail link), but instead booked a car and driver, who duly arrived at our hotel at ten o’clock as arranged. It took about twenty minutes to clear the traffic through the outskirts of Udaipur and then out into the country. Green fields with corn and sugar-cane, it was hot (up to 30deg) but we decided to travel with the windows down, without a/c so that we could smell and see everything and take photos as
we went. We went through many villages, dusty, noisy, bike repair and scooter outlets on the roadside, plus cows and dogs wandering across our bows – following the squassums tradition – tally for the day: two dogs, no cows. In the country we passed camel trains, with children perched on top and baby goats squeezed into saddle bags. We also met flocks of sheep and goats being herded along the road, grazing the hillside, lead by fierce looking tall shepherds with bright crimson turbans and stout
staves, some particularly long with small scythes on the end. The road passed over a mountain range, through a national park. We stopped for half an hour at a very ornately carved marble Jain temple at Ranakpur, on the edge of the national park, where we each received a bright yellow spot on our forehead. Then we went on to a lunch spot where we were offered a smörgåsbord lunch of Indian vegetable and chicken curries including an ice cold bottle of Kingfisher beer – yum. Still in the national park we stopped at a bend where a tree, that overshadowed the road, was full of black-faced monkeys, we managed to take a photo or two, but had to be quick to close the window as they approached rapidly for food – apparently this is a spot where they have learnt to wait for tid-bits – but we didn’t have any.

Coming down to the plains the countryside became less green and
noticeably more barren – Jodhpur is on the edge of the Thar Dessert. As
we joined a main road, it got much busier and our driver demonstrated
the art of weaving between trucks and the many other types, sizes and
speeds of road users. One advantage over NZ, is that everyone expects to
encounter you on the wrong side of the road, so there is no hint of
road-rage at having to take evasive action such as slowing right down
and/or moving off the road altogether because there are one or more too
few lanes of tarmac. Light relief as we approach Jodhpur on a toll-road,
with much less traffic and smooth tarmac. It has been a long hot day and
we are glad to arrive at our hotel in the middle of Jodhpur on Old Loco
Road, now a quiet distance from the railway, to pay off our driver and
relax in the evening in a courtyard built round our family run hotel
built in the 20s, decorated in style, and refurbished with modern
plumbing and bathroom facilities and a 8’6” wide double bed with
prominent art deco wooden bed-head – Gay reports that she cant reach me
in the night to prod me out of my mild snoring! The home cooking is
superb, we are going to have to skip lunch and just keep the water going
during the day in order to properly enjoy the evening meal. Shortly
after our arrival we were rung by Baiji – who had known my grandparents
when they were in India at the end of WWII, had gone back to London with
them, and whose family runs the Palace and Mehrangarh Fort that
dominates the hill above the city. At 90, she continues to show people
round the palace, arranged tickets for us for the fort for the following
day and suggested we meet at the palace on our second day in Jodhpur.
Our driver’s sister also lives in Jodhpur, and his brother-in-law is a
tuk-tuk driver, so arrangements were made for him to be our driver the
following day.

Friday, 17 October 2008

NEDF Overview

Report to NZ Geospatial Office.

National Elevation Data Framework (NEDF) workshop, 18^th March 2008, AAS, Shine Dome, Canberra, sponsored by the Australian Academy of Science (AAS) and ANZLIC.

Since this is rather a long report compared to typical blog entries, I've split it into 4 posts to make it more manageable.

ContentsNEDF Part 1: The Australian National Elevation Data Framework
NEDF Part 2: Implications for New Zealand
NEDF Part 3: Strawman NZ Elevation Data Framework
NEDF Part 4: Recommendations for a Plan of Action

Acknowledgements

I would like to thank NZGO staff for inviting me and the other NZ participants to attend the NEDF workshop, and thank the other participants for sporadic discussions since then. Where available I have attached their individual reports as appendices. I have also attached an independent otherwise unpublished proposal for KiwiDEM from Paul Hughes at DoC. Having said that any failings, shortcomings or omissions in the report are mine and not theirs. Finally I acknowledge that the PGSF funded SpInfo II research project has funded my time to write this report, which is an additional output beyond the original terms of the contract.

NEDF Part 4: Recommendations for a Plan of Action

The following recommendations should be seen as a checklist of actions that collectively will move NZs elevation infrastructure fully into a digitally wired Web 2.0 world. For this to be achieved every contributor needs to move their elevation assets and knowledge into a digital web-enabled form. Standards need to shift from official published documents describing the circumstances for the standard and containing formulae and data references, to authoritative web-services that actively support the embody best practice of the standard in use. While this might traditionally be approached in a grand design top-down organised way it can also be approached as a bottom-up grass-roots movement where each contributor progressively establishes a suit of web-services associated with their own elevation assets and knowledge. Such an approach is an anathema to traditionalists who need to organise, but the beauty of Web 2.0 is that provided each participant approaches the solution to their part of the problem using appropriate standards (eg OGC WFS, and WCS standards etc), with an expectation that everything will be in a state of continuous evolution as they incrementally respond to market needs – ie the development principle of continuous just-in-time beta releases rather than occasional massive version changes. Then collectively we will converge on a working solution with minimal grand-design overhead effort and significantly reduced risk of failure. Success in a Web 2.0 environment is directly related to shortened time to market. Don’t ‘talk and plan’ just ‘do it and do it again’.

There are three types of actions in the following recommendations, those associated with new improved data, those associated with licensing and pricing, and those associated with web-service enabling existing and new digital elevation assets. All are important in the long run, but provision of web-services is actually the easiest to achieve quickly and will drive the imperative for the other two, by generating demand and equally importantly making the need more transparent. So wherever there are digital assets and process knowledge that are already in the public domain – eg central govt data and standards, there is the opportunity to make a very significant start. Each agency with elevation assets will know their digital assets better than I do and will be able to take the principles outlined here and below and convert them into an appropriate implementation plan that will almost certainly deviate from the details that I have suggested and outlined below. The most important thing is that each agency takes on-board the principles above and considers their assets in the light of Web 2 thinking.

Access to Existing Data

With most high resolution data owned by local government, with a range of different licensing arrangements for access to data for other than the original purpose, work is needed to:

1. establish a web-service based on-line catalogue of all elevation data primary sources, their ownership and licensing. This includes LiDAR data and previous data sources such as contours and spot height measurements.

2. negotiate licensing arrangements for access to existing data where possible

3. establish web-services for on-demand data access and delivery

4. establish protocols for ensuring that future high resolution elevation data is licensed for widest possible access,

5. encourage all owners of elevation data assets to participate in making their data available. This includes non-traditional contributors such as Transit NZ and road engineering contractors who have very detailed before and after data associated with highway construction, road realignment etc. or such as architects and construction companies who build buildings whose outside dimensions (footprint and height) are needed to convert Digital Surface Models (DSM) to bare-earth Digital Elevation Models (DEM).

Reference Frame Solutions

Precise conversion between existing reference frames is limited by the state of our current knowledge of the reference frames, so a programme of work is required to resolve at least the uncertainties in existing knowledge and establish protocols for continuing refinement of our reference frame knowledge.

geoid reference: current knowledge of the geoid reference is based on a set of disconnected historic high precision level surveys, that followed the roads of the day, predating, for instance the Haast Pass road. Two possible solutions present themselves
1. extend the high precision surveys, using modern equipment, to close the loops that are currently open, and link neighbouring surveys. This will allow the existing survey data to be recomputed, reducing the uncertainty in the existing data.
2. investigating the option of adding a levelling payload to the existing road (& rail) condition surveying equipment. This equipment regularly traverses all major roads, recording road pavement condition as a function of location. If the survey vehicle had level recording gear added to its payload and all data from successive surveys were accumulated, the frequency of the measurements, would probably mean that even a lesser precision individual measurement, could result in greater overall precision.
3. establish web-services for on-demand data access and delivery of all the historic and real-time raw data gathered
4. establish web-processing services to provide on-demand standard reference analysis of this data.
sea-level reference: the key to precision in sea-level based reference frames, is the time-span of the measured baseline coupled with the quality of the reference to the associated land based bench-mark(s). A number of the existing sea-level stations are based on relatively short baseline times under a year. Two years of intensive measurement is normally considered the minimum to properly model the tidal pattern. Modelling for sea-level change, requires continuous, but less frequent monitoring. The suggested solution is to:
1. determine the configuration of an optimal network of port and open-coast monitoring stations
2. establish permanent sea-level monitoring stations with data-loggers
3. establish web-services for on-demand data access and delivery of all the historic and real-time raw data gathered
4. establish web-processing services to provide on-demand standard reference analysis of this data.
ellipsoidal reference: New Zealand uses many 'standard ellipsoids', some unique to NZ and others that are also used widely internationally. Unlike the geoid and sea-level references, ellipsoids are generally mathematically defined and not subject to ongoing refinement through measurement. The one exception is the family of ellipsoids based on NZGD2000, that are designed to allow for differential tectonic movement resulting in/from distortions to the NZ landmass. NZ has a network of permanent highest precision differential GPS stations established to monitor and define these distortions.
1. establish web-services for on-demand data access and delivery of all the historic and real-time raw data gathered
2. establish web-processing services to provide on-demand standard reference conversions between the ellipsoid used in NZ
3. establish web-processing to provide the standard reference reduction of the data from the GPS stations, so that people can use the difference between the standard ellipsoid and the distortion of the NZ terrain at any date within the range of the observations.

Elevation Surface Interpolation Solutions

There are many of these, some geared to particularly source data types – eg contour to DEM, and Stereo image to DSM, others geared to production of elevation models with particular characteristics – eg drainage enforcement, optimising height and or slope accuracy, or removal of certain subtle artefacts. Ultimately the wider the selection the better. Some are available in open source codes others are licensed – obviously the Open Source ones are more amenable to being published as a web-service, the important thing is to get the codes in use.

1. establish web-services using open-source codes for interpolation of raw elevation data into a raster elevation model for a user nominated extent and resolution.

2. stand up existing ‘best of breed’ derived elevation datasets as web-services, eg as OGC WCS compliant service, so users can extract subsets as needed. Initially these datasets will be disconnected from their source data and codes, but in the longer term as the full processing workflow becomes available they will be pre-computed elevation datasets being constantly updated from all the available web-based primary data sources and software codes.

Reduction from surface model to bare-earth model

As has been noted earlier, this is a particular issue with processing LiDAR datasets and can account for up to 30% of the total cost of production of a bare earth elevation model. It is also often the most contentious part of the data delivery contract and therefore where most gain can potentially be made, and where there is least precedent for how to approach an optimal solution. In other words this is likely to be the hardest part to achieve.

1. establish web-services for known surface objects. With LiDAR, it is usually thought that surface objects (eg buildings, bridges) can be automatically identified from the raw LiDAR data and then removed. To a certain extent this is true, but if a city council, for instance already has 3D models of downtown buildings at a dimensional precision that exceeds the precision of the LiDAR, then it makes sense to use that data source. Also if a city utility already has data about assets in its drainage network – eg pipes and culverts under roads etc that can’t be directly observed in the LiDAR, then that can be very useful data to have as input to a drainage enforcement algorithm when attempting to create a surface elevation model for drainage or flood modelling. So data describing all of these known objects should also be available as web-services.

NEDF Part 3: Strawman NZ Elevation Data Framework

First we need to acknowledge that elevation data is the first cab off the rank in terms of fundamental priority geospatial data to be addressed within the context of The NZ Geospatial Strategy (Jan 2007), which establishes most, but not all, of the basis for the elevation data framework. Consequently we need only revisit those aspects where the strategy falls short. So to reiterate the Strategy states that:

The NZ Geospatial Strategy (Jan 2007)

4.1 Vision

Trusted geospatial information that is available, accessible, able to be shared and used to support the: safety and security of New Zealand; growth of an inclusive, innovative economy; and preservation and enhancement of our society, culture and environment.

To achieve this vision, government needs to lead the development of appropriate ongoing interventions and incentives for consistent creation, exchange and maintenance of geospatial information.

4.2 Purpose

This Strategy provides the principles, goals and governance structure required to achieve the vision. It aims to: define the approach needed to ensure New Zealand’s geospatial information infrastructure meets the ongoing business needs of government; provide the framework for the leadership and direction needed for managing geospatial information; optimise the collective benefit from public investment in geospatial infrastructure; ensure quality fundamental (i.e. priority) geospatial data is available to all.

4.3 Key principles

The key principles that have been identified to guide decision-making for achieving the vision are: Geospatial information is collected once to agreed standards to enable use by many; Discovery and access of geospatial information is easy; Within the appropriate context, geospatial information is easy to understand, integrate, interpret, and use; Geospatial information that government needs is readily available, and its use is not unduly restricted; Geospatial content is appropriately preserved and protected.

5.1 Four Strategic Goals

I Governance – establish the governance structure required to optimize the benefits from government’s geospatial resources.

II Data – ensure the capture, preservation and maintenance of fundamental (priority) geospatial datasets, and set guidelines for nonfundamental geospatial data.

III Access – ensure that government geospatial information and services can be readily discovered, appraised and accessed.

IV Interoperability – ensure that geospatial datasets, services and systems owned by different government agencies can be combined and reused for multiple purposes.

The contexts that are likely to trigger deviations or extensions with respect to this description are likely to be driven by:

differences in perspective – to what extent does elevation data fit the model for a priority fundamental dataset;
differences in funding models – to what extent is the primary funding coming from central government vs local government vs research vs business;
differences in technology – to what extent has the technological context moved on from that envisioned by those who wrote the strategy;
societal expectations – to what extent has societies expectations of spatial data moved on from the timeframe of the strategy.

The very fact that I am raising these possible issues so soon after the strategy was written is an indication of the rate of societal and technological change that we need to acknowledge in establishing a framework that is going to withstand future shocks.

Elevation data as a priority fundamental data set:

One of the key issues raised by the workshop was the diversity of expectation and definition that can be used for elevation. There is no doubt that elevation in its broadest sense is a priority fundamental issue for Government, what isn’t clear is that a dataset can or even should be identified to match these expectations. Elevation implies a vertical measure of a set of locations above a reference surface. But as the Australian workshop identified, the selection and definition of both the set of locations and the reference surface is open to wide discussion and there is no single correct answer, rather a family of possible sets of locations and of possible reference surfaces. Further the locations are dynamic – due to building changes, coastal and hill erosion, vegetation and land use changes. The references surfaces are both dynamic (tectonic movement and sea level change) and poorly defined (sea surface specifications). Finally what is being defined has no intrinsic infrastructural value, so there is no basis on which to define any particular location or reference surface as of paramount importance to Government as a whole. Even setting aside the choice of elevation specification, the data formats that are usually used for elevation are themselves suspect, being based on a two (strictly 2.5) dimensional view of a three or even four (if time is included) dimensional space. This view is adequate for many purposes and has served the geospatial community well for the last few decades, but with the increasing use of dynamic three dimensional viewers (eg Google Earth) it is worth questioning whether the historic approach remains the best equipped to cope with modern demands (eg full spectrum LiDAR) and therefore whether a NZ EDF should be more forward than backward looking.

Existing investment patterns

Historically elevation was only available as contours and survey spot heights, which were all sourced from within Government supported by registered surveyors in the business community and aerial photography suppliers, but over the last decade many Regional and City Councils have made significant investment in elevation data by purchasing LiDAR from business suppliers, even commodity GPS units can provide elevation as an integral part of location, and boats routinely have depth sounding and logging capability. So the variety of potential sources of elevation data in the broadest sense is greatly increased and the Government is a relative newcomer in using these newer technologies as a source of primary elevation data. Elevation data is also available from stereo satellite imagery. Consequently it is currently the local government and business sectors that have the greatest equity in elevation data not central government. The principles of the geospatial strategy should still apply for public access, since the citizens within each region that have paid, but the right of central government to bulk access is a matter for debate and a funding mechanism needs to be established to achieve a partnership between Central Government and Local Government for elevation data investment and ownership.

Technological changes.

When the strategy was written even widespread access to broadband (<10mb/s)> on-demand processed data as an integral part of a that infrastructure. This shifts the goal post from having to think of elevation in terms of a single formally defined / managed dataset describing the elevation surface of one set of locations (eg bare earth) against one reference surface (eg WGS84 ellipsoid), albeit at multiple nested resolutions, to a system that matches an evolving set of raw data sources with their computational codes (processing workflows) to create data in a variety of formats, with a choice of reference surface, and where possible conversion between different location specifications (surface vs bare-earth etc). This is the on-demand managed workflow approach being pioneered by the SCENZ-Grid project, supported by a significant resource of GRID based computational capability.

Societal expectations.

New Zealanders are quick to pick up new technologies such as Google Maps (2D) and Google Earth (2.5D) with their ability to integrate with on-line photo albums, videos, blogs, geoRSS feeds, uploaded GPS tracks etc. There is a general expectation that 'of course' local and central government agencies and researchers have access to far better data than they see and use at home for free. Do they? Should they? What needs to happen to ensure that we all have access to the best information around and where widespread use generates demand for continuous improvement?

NEDF Part 2: Implications for New Zealand

Australian NEDF implications for New Zealand

Essentially New Zealand and Australia face a similar set of issues. Elevation data was identified in the NZ Geospatial Strategy as one of the fundamental datasets that NZ needs and the Geospatial Office has a work programme initially around a review of LiDAR data in New Zealand. Australia is further ahead in assessment of its needs while New Zealand, primarily because of its size, starts with higher resolution national elevation datasets. However both countries broadly face the same set of issues in moving forward from the present suite of diverse marine and land based datasets and reference frames, a mixed history of digital and analogue source material and the same suite of modern technologies for acquiring new elevation data.

Where the countries differ significantly is in the pattern of national and local government agencies, the respective roles of potential research and industry partners, the interests of non-governmental organisations in the spatial sector and the government funding models that the respective governments are comfortable with. The differences in size, population and economies are also significant because they lift the opportunity for industry players to build a significant sustainable business model around spatial data acquisition, processing, services and added value. Despite the differences there is enormous potential for collaboration building on synergies between the two countries. Many New Zealand and Australian geospatial companies are significant players in both countries, there are strong ties between the research agencies in the two countries and also significant informal and formal dialogue between national and local government agencies.

Towards a NZ Elevation Data Framework

User Need Analysis

The NEDF User Need Analysis (ref 3), provides an excellent indication of the likely range of NZ users’ needs and issues. The top five Australian issues reported from participants at the series of workshop were: Standards, One-stop elevation data portal, Closing the data gap between land and sea, A common vertical datum for land and sea, Leadership – ie a strategy not just projects. If used as the starting point of a NZ study, it would short-circuit preparatory work and allow a study team to quickly identify types of users across all sectors and focus directly on differences between NZ and Australia.

Business Case

A number of government ministries have already established elevation data requirements (eg MfE, NZDF, DoC, TransitNZ & MCDEM) that are beyond what is readily available and MfE and many Regional and City Councils have invested significantly in LiDAR surveys – leading to the recent establishment of gLiDAR (Government LiDAR User Group). Since LINZ Topo50K dataset became readily available many industry players have created NZ DEMs of various resolutions (50m-15m) that are available in the market. There is no ‘free’ national elevation dataset and few of the commercial DEMs that are available have good documentation from a perspective of being an authoritative source of elevation data.

The NEDF Business Plan (ref 2) focuses on direct cost benefit to all layers of Government of a coordinated approach to elevation data, indirect benefit to the wider community of a free elevation dataset and the economic benefit of a vibrant business community providing location services augmented by elevation information.

The NZ experience confirming that this is likely to be true for NZ as well is probably best illustrated by the dramatic change in use of elevation data that followed the reduction in cost of the LINZ Topo50K elevation data in the mid 90s. Now – over ten years later – it is time for a next generation elevation data framework to trigger another explosion in use and wide availability of products such as Google Earth are whetting the public’s appetite for such services.

Science Case

At the time of the workshop, the NEDF Science Case (ref 5) was the weakest part of the NEDF justification. These weaknesses were considered to be readily addressed and this was expected to be done as part of the NEDF strategy review process following the workshop. New Zealand’s science system is relatively compact compared to Australia’s and so building a science case should be relatively easy. NIWA, GNS Science and Landcare Research have been the most significant participants historically but others in the science system have also generated and used elevation data for a wide range of studies such as: river flow, catchment delineation and processes, soil formation and description, automated satellite image interpretation, climate modelling, ecosystem modelling, biosecurity threat modelling, coastal processes, lahar modelling, wetland modelling.

Some current NZ elevation data activities

NZ Geospatial Office – is undertaking a LiDAR study with the expectation of producing an on-line metadata catalogue of NZ LiDAR datasets.

Local Government NZ – has formed gLiDAR a local government LiDAR users group (Ref 10).

Ministry for Environment – is purchasing extensive LiDAR for all of its Kyoto forest plots. This data will be available under a Whole of Government licence. (Ref 9)

Ministry for Civil Defence and Emergency Management – needs high quality coastal elevation (LiDAR) data for tsunami inundation threat study. (Ref 8 )

Dept of Conservation – has proposed the creation of KiwiDEM – a low resolution public, IP free, elevation dataset for use by the environmental sector. (Appendix 2)

Land Information NZ – has undertaking a detailed study to determine the feasibility of creating a unified land and bathymetric DEM with heights referenced to the GRS80 geoid used by NZGD2000. (Ref 7)

Land Information NZ – has developed a draft standard for a New Zealand Vertical Datum 2008, an equipotential surface equivalent to mean sea level, with reference to NZGD2000 / GRS80. (Ref 6)

KiwiImage Consortium – is using the 30m military SRTM DEM to ortho-rectify its QuickBird imagery, but the DEM itself isn’t available for use.

Landcare Research – has a PGSF research contract (SpInfo II) to produce an algorithm for deriving 5m or better DEM surfaces from ALOS PRISM stereo satellite imagery.

Landcare Research – has done a study concluding that the 30m SRTM elevation dataset consistently under-estimates high elevations. (Ref McNiell)

Landcare Research & Regional Councils – are negotiating to commence a study on the application, use and development of a managed on-line LiDAR workflow based on work in the US (GEON LiDAR Workflow) in New Zealand initially using Landcare Research’s new SCENZ-Grid cluster (104 cores, 400GB RAM, 20TB storage) and the 10Gb/s KAREN network.

Google – originally had very low quality elevation data (maybe 250m) for New Zealand in its Google Earth and Terrain shaded Google Maps products, but now has data of order 25m–30m. They don’t publish the specification or origin of their data, but this resolution is comparable to either the 30m SRTM dataset or DEM derived from the LINZ Topo50K data. The elevation data is an integral part of their products but isn’t explicitly available as elevation data for use other than as visualisation in their products.

NEDF Part 1: The Australian National Elevation Data Framework

Background

The Australian NEDF national workshop follows a process of wide consultation, regional needs assessment workshops and report preparation to support a case for significant investment in an enduring high resolution elevation data framework, encompassing both the marine and land environments of the Australian continent. Very broadly, Australia’s current national elevation dataset is the 2^nd edition 250m (or 9”) resolution DEM produced by Michael Hutchinson at ANU. As in New Zealand the national DEM is augmented by many sub-metre resolution LiDAR surveys typically in built up /coastal areas acquired and funded on the basis of local need rather than national strategy and not widely accessible outside the (local government) agency that acquired them.

Throughout this document, the term NEDF will refer to the Australian NEDF, references to a possible New Zealand equivalent will explicitly refer to New Zealand.

The Proposed NEDF

The Australians have consulted widely, and produced very creditable draft business plan & user needs analysis documents and by their own assessment a not so creditable draft science case – which are being reviewed by a four person panel of AAS, ANZLIC, CSIRO, and University senior experts who recognise the shortcomings and will recommend approaches to resolve them. The shortcomings in the science case are considered to be superficial and easily addressed rather than fundamental. The existing documents are available for download (ref 1-5) revised documents will be circulated when ready.

The key characteristics of a successful NEDF vision are:

formal governance structure,
a national nested multi-resolution ‘bare earth’ land and marine elevation dataset,
nationally consistent specifications relating user-need to required elevation precision and formalised best practice,
processes to ensure that needs are assessed and prioritised and resources and systems are in place to ensure the data is collected to meet needs as they evolve in the long term,
robust authoritative metadata providing fitness for purpose,
central searchable data catalogue,
essentially free availability of elevation data,
nationally accessible federated distributed data storage facilities
vibrant elevation research and industry communities that

contribute to GDP significantly beyond the level of Government investment
provide feed-back contributing to advancing both needs and solutions

The Proposed NEDF Dataset Structure

There was a very strong desire that the NEDF should be enduring and forward looking, pushing the existing Australian Spatial Data Infrastructure (SDI) to the next level. However the solution as discussed is a traditional SDI solution augmented by a national strategy and governance for a suit of nested 'product' datasets that would satisfy 80/20 needs of users and be made 'freely available' through a web portal. Elevation products would be made available at resolutions of 9”, 3”, 1”, 1/3^rd“, 1/9^th“ ... horizontal resolution hierarchy corresponding to 250m, 90m, 30m, 10m, 3m, 1m ... as a rule of thumb vertical resolutions are typically 1/3^rd of the horizontal resolution. Discussion focuses on relationships between user needs and elevation data requirements, the diversity of special uses and how they would be addressed, prospects for technical breakthroughs, the possibility of solving all needs with a single national LiDAR or similar high resolution survey, the contrast between expectations and what exists now: a national DEM (bare earth) at 250m resolution, augmented by 90m and 30m SRTM DSM (surface) products with restricted access to the 30m product due to 'counter terrorism' concerns.

The fundamental issue with any proposal based on product datasets, is the effort required to produce a solution (ie dataset) other than one of the core free datasets. Fitness for purpose is never a binary function, it is always a matter of degree, with inherent uncertainty and error. So the 80/20 rule is misleading since it superficially implies that 80% of needs are fully satisfied, whereas it is more likely to mean that hypothetically 40% of needs are fully satisfied, 45% are partially satisfied and 15% completely unsatisfied. Further the costs of exploring even a subtly different solution are very high – because the knowledge, raw data, processing capability and processing capacity aren’t readily available.

The Proposed NEDF Elevation Surface

Participants recognised that while the majority might be happy with one solution, there is significant need for a variety of surfaces – including bare earth (DEM), surface (DSM), terrain features (DTM) and a choice of data formats including rectangular prism, sloped tops, point heights … these differences are fundamental and will persist into the future – there is no one data product to satisfy all needs. Conversions between DEM, DSM and DTM are non trivial and often require very significant processing and or additional data. For LiDAR it was reported that DSM to DEM conversion can represent 30% of the costs. Information such as building footprints and elevations, urban trees, open drainage channels etc may be most appropriately sourced from city or council infrastructure datasets and used to inform the DSM to DEM conversion rather than being inferred from the LiDAR-DSM raw data. So raw elevation data should include height information from many ancillary datasets as well as the raw LiDAR cloud point heights.

The Proposed NEDF Reference Frame

Much was talked about of the complications as one goes from 10m vertical accuracy to sub-metre accuracy, especially from a national perspective. Differences in the specification for zero elevation become critical at these resolutions.. These include – ellipsoid shape (GPS reference frame), geoid shape (gravitational reference frame), mean sea level (topographic zero contour), mean high water mark (topographic coastline), mean high water springs (cadastral coastline), lowest astronomical tide (bathymetric zero) and variations between state and national approaches to providing solutions. There is wide variation in the precision to which these reference frames are known, the extent to which they are available in digital form and even the extent to which the differences can be reconciled by applying current technology. Some current best available data is based on historic essentially local arbitrary reference frames that cannot be recovered at precisions that would satisfy modern usage. There was also the recognition that existing technologies are least effective in the coastal/surf zone which impacts on the ability to reconcile differences between bathymetric and land based reference systems. Further climate change will result in a continually changing sea level model. Collectively these differences will be the subject of significant refinement from both theoretic and observational perspectives over the next few decades, with the consequence that any dataset that is part of a data-product centric NEDF, generated at a fixed point in time will be out of date shortly after is publication – resulting in a significant proportion of the user community being forced to use solutions that are outside the NEDF solution.

Beyond Data to Automated Workflow

A radical realisation started to emerge at the workshop, that the issues of continual change and refinement and of a diversity of need might be resolved by taking a managed source data & processing workflow approach, with both components of the solution being available for web-portal users to mix and match at their whim to suit their needs and $$ constraints. There wasn’t time at the workshop to thoroughly explore the full implications of such a shift, but 'workflow' issues were discussed by many participants during the afternoon breakout sessions and all three breakout session chairs mentioned workflow as part of their 5min summary reports at the concluding session.

To use the hypothetical example introduced earlier, a managed on-line workflow solution would allow all the partially or completely unsatisfied users to obtain variants of the solution that would more closely satisfy their requirements.

In Australia the computational and storage infrastructure is to a large degree already in place for such a solution – each state has a High Performance Computing facility and the NCRIS (National Collaborative Research Infrastructure Strategy) is designed to coordinate development and delivery of the required software systems. However representatives from the HPC community weren’t present at the workshop. New Zealand probably has appropriate HPC resources but there is no overarching top down strategy equivalent to NCRIS. None of New Zealand’s Digital Strategy, Digital Content Strategy and Geospatial Strategy are as forward thinking as NCRIS.

NEDF Funding Model Options

The other major theme to emerge was the impact & desirability of a whole of government public/private partnership approach as opposed to a purely government led solution. And that such a solution could still result in apparently free data use – in that a government led approach would be funded from tax, whereas a public/private solution might be funded by tax (the govt paying for 'early adopter bulk access' substantially augmented by in-line advertisements (ala Google adwords). In such a scenario, the cost to the government might be substantially reduced, likely by in excess of 1/10^th, though costs had not yet been done by the private sector since the structure of the partnership would have a very great influence on the revenue flows and therefore investment strategies. It was stressed that key features of any successful private contribution would be; a predictable market unfettered by government intervention, other than as a 'guaranteed early adopter purchaser', and full industry involvement in the user needs phase so that everyone understood what the deliverable was. With such a proviso, there were considered to be no capacity constraints in the private sector to deliver whatever was required – even radical solutions such as those involving national very high resolution products.

Thursday, 11 September 2008

AHM08/RS5: Regular Session 5

Jeremy Cohen: An e-Science Platform for on-demand management and processing of environmental data

based on the Message - Mobile Environmental Sensing System Across Grid Environments

sensors connected by bluetooth, wifi & 3g

100,000s of sensors imply highly variable rates of data aquisition .. using amazon elastic cloud commodity computing to cope with the varying load.

on-demand modelling ..

.. requires on-demand computing .. ogsa-dai, ogsa-dqp and ec2

.. on-demand being driven by both back end data availability and front end user request

www.message-project.org

www.imperial.ac.uk/

Nathan Cummngham: Optimising Antarctic research cruises

"Big Data" Nature

"Growing your data" Lynch 2008, Nature 455:7209

planning is influenced by real time ice information, realtime pengiun tracks - showing where polar fron is, real time chlorophile imaging etc

delivery is over 128k sat link to research ship.

www.edp.ac.uk/ .. environmental data portal.

Liangxiu Han: FireGrid: an eInfrastructure for next generation Emergency Response Report

http://firegrid.org/

Jim Myers: A Digital Synthesis Framework for Virtual Observatories

Context
.. Ocean Observatories Initiative
.. National Ecological Obs Net
.. WATERS Net

want 3d immersive equivalent to goingthere but augmented by data richness

even though there is general initial user agreement of the concept, in fact great deal of variation in specifics

as soon as you read about it you should be able to action it via workflow.

Digital Synthesis Framework .. data playground

Semantic Content Management
Streaming Data Management
CyberCollaboration Portal
Cyberintegrator.
Content Repositories
.. all stored in RDF

Restful, using restlets

front end widgets uses google toolkit
interwidget interactions

Dynamic Analysis Environment

eg Corpus Christi Bay Dashboard

Community Annotation.

code at:

svn.ncsa.uiuc.edu/svn/cyberintegrator &dse

AHM08: Sharing and Collaboration

Jeremy Cohen: Sharing and Collaboration (in the laboratory) Blogs, Logs and Pods

Laboratory e-Notebook

leverage off things we already do
.. COSHH
.. Process todo vs plan vs record - all integrated diagramatically in PDA
.. Integration of lab records with building management system
. so that PDAs etc can subscribe to building message broker
. after all the building mgmt system knows such things as lab room temp
.. results can be made available WITH data in databases to provide remotely sources for validation
.. record the units
. eg bridge from Germany to Switzerland, didnt meet by 10cm, because elevations against diff sea levels, which differed by 5cm but the sign was got wrong

BioBlog .. http://chemtools.chem.soton.ac.uk/projects/blog/blogs.php/blog_id/15

bioblog templates essential

barcode to url convertion
or 2d-array barcode plus phone conversion to url and retrieval

http://simile.mit.edu/welkin .. Welkin is a graph-based RDF visualizer

discover that the use of blog actually improves the quality of what is recorded.

.. comment by sketch .. chemists are scriblers

also cf .http://wikispaces.com/, which is wiki based

finally still need to link all the component parts from publication to conversations to lab notebook.

LaBlog/wiki .. myExperiment vs ourExperiment.org

semantic web .. data deluge .. maintaining & communicating context
. major problem of communicating meaning
. eg ppt arrows on hebrew system which is a right to left language . which arrow is forward?
ie people and their backgrounds needed..

maybe call this 'semiotics of semantic web' or the 'semiotic web'?

AHM08/W9-3: The Global Datacentric View

Ian Atkinson: ARCHER Data Services

HERMES - generic datagrid tool
PLONE tools

cf archer.edu.au

hermes ..
http://commonsvfsgrid.sf.net

plone srb & ICAT
http://eresearch.jcu.edu.au/wiki

PJ Kirsch: Developing a common data discovery, browsing and access framework

BAS multi-disciplinary requirerments.

used be scientist (mis)managed

need to have a framework for the data its documentation

tech drivers - 24/7 link & bandwidth even to remote data sensors
free client tools - eg google maps etc

must have
- efficient discovery
- appropriate visualisation .. what does appropriate mean - user perspective & data dependance
- access to data
- aaaccess of ancilliary/auxilliary data

- sometimes reference to accession num for no digital holdings

initial response to query is a timeline showing availability and quality indicator and list of associated other docs in a subversion db eg s/w, code, reports etc

as you 'zoom' in on the data the timelines may show additional variants such as region, raw vs processed, instrument variants etc.

provider nominated visualisation, as scrollable zoomable time or space display .. linked to dataset download.

iso 690 - ref for citing data.

Andrew Treloar: ANDS - what are we doing, why are we different, whether we are courageous.

Platforms for Collaboration ..

follow on arrow dart and archer

blueprint: 'towards the australian data commons'
.. why data - because data deluge need to spend more and more
.. need for standardisation
. s/w & h/w gets cheaper, wetware more expnsive
.. role of data federations
. cross disciplinary opportunity opens door to new research
. but it is difficult

cf australian code for the responsible conduct of research
.. institutional and researcher obligations.
.. signed up to by all chancellors etc . so serious
.. funding will become tied to compliance

ANDS programmes
.. developing frameworks
.. providing national utilities
. discovery
. persistent identifier - pilin
. collections registry

discovery - iso2146 - high level architecture for registry
collection, party/people, activity, service
expose this to google web service harvest

.. seeding the commons, ie work with lead exemplars
.. building (human) capabilities
. train the traners

1st review .. 'strategic roadmap aug 2008.pdf'
cf p21 ,p22, p23, p23

http://ands.org.au

AHM08: Cloud data mining

Robert Grossman: The Emergence of the Data Centre as a scientific instrument

Diff between google and escience

.. scale to datacentre . google, esci, health
.. scale over datacentre . esci only
.. support large data flows . esci only
.. user and file security . google, health

For Sector

implies transport and routing services needed in a addition to google's stack -
. so developed UDT 'UDP based Data Transport'

UDF map reduce applied across this stack

sector / sphere is fast, easy to program customisable 2-3x, 4-6x faster than hadoop

sphere is the compute cloud, sector is the data cloud

sector's security based on SSL and also the audit tracking that is needed.

AHM08: Visualising the Future

Chris Johnson: Scientific Computing & Imaging Institute, Utah .. pronounced 'ski'

Not retrospective visualisation of the results, but integrated visualisation in the problem solving process

GPU - massively parallel architecture
.. scaling many times faster than multi-core cpu
.. now have high precision floating pt gpu from nvidia

using GPUs to process petabytes of neuro slice data.

volume rendering ...
traditional 'maximum intensity projection' (MIP) to 'full volume rendering'

new was too computationally extensive, but with GPU becomes tractable +
multi-dimensional transfer function - mapping derivatives & integrals across multi slices to rgb
... s/w called seg3d .. bioimage .. hardest part was making it usefull !

time dependant visualisation

isosurface extraction
.. marching cubes Lorensen & Cline 1987
.. but ow pisa, rtrt, noise, octree up to 10^4 faster algorithm, but not available .. ie opensource,

pisa .. livnat & tricoche '04 .. if the triangle is too small to see, dont calc it.

ray-tracing
as # objects goes up ray tracing becomes more efficient than raster (traditional gpu) algorithm
DOE asci c-safe .. simulate explosion from first principles & vis it.
.. manta - real time ray tracer
.. how to simulate the right colours of flames correctly, rather than map temp to colour ramp
.. perception of shadows .. currently base on phong and gurow in 1970s but todays hardware is faster
.. so if solve maxwell for realism .. need to artificially introduce an appropriate light source into say cat scan .. not always obvious how to do it.

3d vis of error and uncertainty ..
.. working on it . no one way to do it
.. what about mapping rgb to fuzzyness or sensitivity or confidence
.. uncertainty animation

in 2003 as much info was generated as was published in all preceeding human history, and have repeated that every year since.

cf www.vistrails.org with taverna & myexperiment .. visualisation of differences due to technique variation.

http://www.sci.utah.edu/vaw2007/ .. book from Visualisation and Analytics Workshop

Wednesday, 10 September 2008

AHM08/RS1: Regular Session

Jeremy Cohen: ICENI II

Coordinate forms:

declarative workflow lnguage
.. describe what not how
.. much easier to logically analyze the flow

use of coordination forms for matching

workflow execution .. bpel, scufl etc

declarative workflow generation tuned to users normal activities

.. automated workflow generation
.. extract from a users real-time use of their natural software - matlab etc

workflow execution with performance .. performance repository .. used to drive planning of optimal execution plan

ICENA II plan

Daniel Goodman: Decentralised Middleware and Workflow Enactment for the Martlet Workflow Language

Middleware comprises:
.. Process Coordinator
.. Data Store
.. Data Processor

Essentially introduces an efficient protocol for P2P communication between PCs and DPs such that each node becomes aware in changes in state and availability of the network as a whole in a decentralised robust efficient way.

Ahmed Algaoud: Workflow Interoperability

API for workflow interoperablity providing direct interaction
.. based on WS-eventing .. asynchronous
.. look to implement in eg Triana Taverna Kepler

WS-Eventer set up witth four types
.. subscriber, sink service, subscribe manager, source servce

also use WSPeer & working with NAT issues.

Asif Akram: Dynamic Workflow in GRID Environment

Imperial College

part of ICENI project
GridCC incl QoS, BPEL, ActiveBPEL

introduce QoS language

QoS criteria incl security, performance (from performance criteria)

Used WS Addressing engine (WSA) to achieve dynamic redefinition of the BPEL partner link within the BPEL.

BPEL Editor / Monitor

Conclusion .. QoS can be injected into BPEL which makes dynamic workflow much easier to achieve, and this can be achieved within existing standard specification.

Jos Koetsier: A RAPID approach to enabling domain specfic applications

User prefers domain specific portlet, but there is quite a lot of work creating domain specific portlets so ..
OK so approach is to build a custom portlet generator ..

have written one based on jsdl and jsdl xml file (GridSAM)

Uses OMII.uk s/w

obtain at http://research.nesc.ac.uk/rapid

Martin Dove: MaterialsGrid: An end-to-end approach for computational projects
3yr 5fte project www.materialsgrid.org

based on CASTEP to simulate the behaviour of materials to predict the properties of material.

results are contributed to a database .. which may also hold measured properties.

so database content is computed on demand for groups of users that dont want to know the computational under the hood stuff.

workflow using scitegic pipeline pilot instead of bpel, partly because the bpel std wasnt uniformly implemented.

cml.sourceforge.net .. chemical ml from cmlcomp.org

cml2sql & www.lexical.org golem to construct cml

.. jquery allows mix of pulldown and autocompletion & constrains to allowed values ..

AHM08/W9-2 : The Global Data Centric View

Jon Blower: A Framework to Enable Harmonisation of Globally-Distributed Environmental Data holdings using Climate Science Modeling Language

How we use the climate science modeling language.

data from many instruments .. need to combine them all to:
.. validate numerical models
.. calibrate instruments
.. data assimilation - formal method for combing data and model ..
.. making predictions - eg floods, climate, drift at sea and search and rescue

The need for harmonisation leads to Scientists spend lots of time (up to 80% of some post docs) dealing with low-level technical issues .. need a common view onto all appropriate datasets

OGC aim to describe all geographic data . mandated by inspire .. but fiendishly complex evolved from maps

Need to bridge the gap: CSML
both abstract data model & xml encoding

provides a new view of existing data, doesnt actually change it.

14 feature types ..
classified by geometry not their content

Harmonise two datasets with CSML plugs into GeoServer (like GeoSciML)

Second way via Java-CSML
.. aim to reduce the cost of doing analysis
.. high-level analysis/vis routines completely decoupled from the data

Java-CSML Design attempts
.. transform CSML xml schema to java codeusing automated tool
.. leads to v complex code
.. OGC geoapi but incomprehensible & geoapi is a moving target

.. based on well-known java concepts
.. reduce the users code
.. you can always wrap something
.. wrappers for wfs, netcdf, opendap etc to make them all look the same
.. also have plotting routines

Problem is that the more you abstract the more info you loose, so need some more specific profiles that inherit the parent profile and add the extra know for a specific instance.

Wider lessons ..

.. intolerable data formats not necesarily suitable for storage
.. trade-offs between scope and complexity
.. symbiotic relationship between stds, tools & applications.

Aside more opendap services than wcs services for raster data.

Alistair Grant: Bio-Surveillance: Towards a Grid Enabled Health Monitoring System

Problem .. SQL SELECT blah, count from databases where diagnosis = 'X'

databases is a set of databases with non-std schemas

OGSA-DAI used to solve this.

RODSA-DAI was one solution ..
Views canbe implemented in a database, but Views can also be hosted at an ogsa-dai service layer

.. this allows both security to be implemented remote from the database, also allows remote organisations to see a view without requiring hosts to support a particular view or set of views

.. output transformed as required to google maps/earth

.. ogsa-dai view are slower, but not so much slower as to work against the disadvantages.
cf www.ogsadai.org.uk
www.phgrid.net
www.omii.ac.uk

Chris Higgins report that SEE GEO has implemented OGSA-DAI wrapper for WFS.

Lourens E Veen: Virtual Lab ECOGrid: Turning Field Observations into Ecological Understanding

ECOGrid
also www.science.uva.nl/ibed-cge

Species Behaviour
Biotic and abiotic data, incl human behavior
Field Data
Statistical analyses

Organisations incl govt, infrastructure & conservation, & private volunteers

Different datamodels:
Approach incorporated a hierarchical approach of
.. Core data
.. Extended attribute
.. Set Specific extensions to preserve original data

info goes back at least to the 50s, but also earier data if available.

Tamas Kukla, Tamas Kiss, Gabor Terstyananszky: Integrating OGSA-DAI into Computational Grid Workflows

University of Westminster

want to expand workflows in two ways ...

Major problem of all the common system is limited - mainly file or v limited database
eg Triana, Taverna, Kepler, P-Grade Portal

Workflow level interoperation of grid data resources

OGSA-DAI is sufficiently generic for it to be a good candidate.

Data staging
Static vs semi dynamic vs dynamic

static staging - in spec and access before and out spec and access after but not during
semi-dynamic - in and out specified before and in out executed during
dynamc - all access during the workflow **

ogsa-dai integration , tool, workflow editor vs workflow engine

only integration into the engine provides fully dynamic access

either implemented at the port or within the node - chosen within the node - which provides better integration

required functionality .. everything is too complex.

more specific support tool &/or totally generic - chose to support both styles of access.

Chose P-Grade Portl workflow engine, based on GridSphere with extended DAG workflow engine
in P-Grade nodes are jobs, ports represent files and links file transfer

direct submission not possible .. need an application repository so

Chose GEMLCA application repository, which is also a job submitter part of Globus.

This approach has advantage is that GEMCLA is sufficiently generic that it can be used in a range of other workflow systems.

cf http://ngs-portal.cpc.wmin.ac.uk/

Tuesday, 9 September 2008

AHM08/W5: Frontiers of High Performance and Distributed Computing in Computational Science

Chris Higgins: Spatial Data e-Infrastructure SEE-GEO

What can GRID offer for scaleability of EDINA's services?

Grid was right from the outset interested in security and trans organisational issues. so what does grid offer to SDI that contributes to SDI and its scaleability.

Registries for publish, find and bind fundamental

Demonstrators produced were:

e-Social Science exemplars built:

dont hold hold the data instead link & bind to it and use it from source.
OGC services wrapped into OGSA-DAI

focusing on adding security - using SPAM-GP .. Security Portlets simplifying Access to and Managing Grid Portlets

.but not planning to give security control to portal provider therefore need finer grain security

as a result of project an agreement between OGC and www.opengeospatial.org/ in terms of on-going memorandum of understanding.

Owain Kenway: Distributed Computing using HARC & MPIg

HARC
High Available Resource co-Allocator - HARC proved to be very reliable

MPIg
globus implementation of MPI that allows topology discovery, so that it know what protocols are available for communication between ant two nodes in a multy-distributed cluster architecture.

Approach used for three different applications, two of which benefited very well over distributed sites vs expanded resource at a single resource, and the third, also benefited though not as significantly.

AHM08/W9-1: The Globa Datacentric View

Laurent Lerusse - from Grenouille to Polar Bears

Managing metadata and data capture for the Astra-Gemini 0.5 PW laser

Astra-Gemini is part of CLF
STFC - -Science & Tech Facilities Council

Grid-enabling information resource that follows a project from proposal to experiment to analysis, results and publication .. driven by central metadata store.

CLF data flow - ELk + DAQ + PolarBear(metadata) -> NeXus Writer

PolarBear needs to know the whole laser light path for the experiment and all the detectors that will be generating data.

Learnt:

- defining complex systems not easy with xml - but can be done
- scientists not used to editing raw xml - tools need to be provided!
- recording metada is time consuming but pays dividends
- evolution not revolution - continuous beta

Q? why not semantic language .. A:unfamiliarity
Q? how do you have surity that the physical world is as described by the metadata, given that all equip isnt fully tagged eg barcodes etc? .. A: that is difficult.
Q? how do you capture experiential knowledge .. this is what ELk is there for but it still has to be used, which is optional A: provision of the capability is essential

AHM08 - Crossing Boundaries - Opening

Peter Coveney - Welcome

Hey day of attendance was 2004 - but then it was compulsary to attend if you had fundng.

But this year the maximum number of papers were submitted.

Paper flyers kept to a minimum from sponsors, by distributing them oll on a 1GB usb flash drive

Gregory Crane et al - "Cyberinfrastructure for Global Cultural Heritage"

et al - 10 co-authors, 6 Organisations, UK, EU, US

Qualitatively new instruments eg treebanks .. database of language / word relationships

"Greatest Classicist of 20th Century" is probably / reputably an Islamic leader of Teheran .. but that hypothesis is untestable in a classical studies sense!

How man scholars could work on the question - what is the influence of Plato and the Classicists on Islamic thought in Teheran? - no tools available today - too much data, too many languages

Text mining came be used within a language .. but v difficult for Plato's quotations present in modern Arabic or Farsi!

ePhilology -- production of objectified knowledge from textual sources - eg a million books, including historic texts in there many historic editions through multiple languages.

eg 25k days in a lifetime, book a day reading = 40 lifetimes, harvard has 10m books = 400 lifetimes to read.

but what about 10 thin poetry books in 10 languages - just misunderstanding them requires not only languages, but also the back social history of each of the 10 authors.

Classics Goals 5-10 yrs

Memes .. cultural analogue of gene.

.. million book library of memes .. facts and fantasy and religion and texts and organisations and words and their evolution in meaning over history and place

.. Memographs / Memologies .. but creating these will require automatable and uncheckable - by human - eg do we have ocr of syriac

.. so technically one could now create a Plato memography across all languages and time .. would take time and $$s but we believe we have the tools.

.. for the first time we can confront Plato's challenge .. written words are inert, like a statue, it may be lifelike but if you ask it a question it is silent .. for the first time we can start to pose questions of text and have a machine extract answer from the text , the written word.

.. pdf is true incunabular form .. it is digital but essentially the same as their printed predecessors.

.. what does a post-incunabular digital document look like? ,, 'books talking to each other' in an equivalent way that the authors of a set of books talked and discussed and that lead to their writing. ie 4th gen digital collections knows the difference between Washington uk vs Washington us place and person, from context and automatically links to look-up & explain if the user wans it. they include 3d models of inscriptions .. scanned .. ocr .. xml all together engineered as a unit.

library vs archive

library concept changes with time originally had written , then printed, now digital actionable objects with open computation fundamental

archive is static

google books is a large archive

open content alliance is a digtal library - with a lousy front end, but it is actionable.

min features of publication -- peer review, sustainable format (eg TEI XML), open licensing (creative commons), sustainable storage - persistence.

"Scaife digital library" does the above.

AHM08/BoF: e-Infrastructure: Tool for the elite or tool for verybody

Dr Jean-Claude Bradley:

Open Notebook Science in this case chemistry, suitable for anything where IP issues are Open rather than closed:
http://usefulchem.wikispaces.com/All+Reactions?f=print

Useing video and photos published through YouTube & Flickr & Googledocs for results & Wiki for notes & ChemSpider & JoVE for publishing results .. all of which are free and hosted elsewhere, so no overhead in hosting or software maintenance etc.

Anticipate that in future (10yrs ?) many of these experiments will be able to be done with far greater replication, so longevity of data availability isnt an issue but immediacy of availability is. In those circumstances this type of distribution is suitable.

Shentenu Jha

http://wiki.esi.ac.uk/Distributed_Programming_Abstractions

Distributed Appl. programming still hard!

May actually get harder in future because of changing infrastructure - XD, PetaCloud, PRACE

No simple mapping from the application class and its staging to the application type - grid aware vs grid unaware approaches.

In fact for dynamic distributed systems such as Kalman-Filter solutions, where need to embed the scheduler inside the program.

Break-out discussion follows:

What is e-Infrastructure?

Participants representative of Arts, Medical, Geospatial - researchers, providers, developers

Getting beyond usefulness for early adopters to usefulness for mainstream science, is fundamentally about trust ..

Trust that what is learnt will be able to be reused in future as a skill
Trust that a service that is provided will be available in future -
Trust that data storage provision will at least match the longevity of the research funders for data maintenance.
Issue of any digital executable object will have dependencies and the longevity and persistance of those dependencies
Trust in terms of availability of redundant storage sources
Secure in terms of knowledge that service provider is disinterested .. eg not Google.

Evidence of this Trust is driven by perceptions of continuing $$$s

Other questions addressed were:
What do you think e-Infrastructure is and what should it be? For example, is it a tool of use only for tackling the 'grand challenges' in research or could it (& should it) be useful for all kinds of research problem?

Do Researchers need a clearly defined ICT environment and tool suite, or can usage be opportunistic, picking up on functionality that becomes available using light-weight "glue" and pragmatic organisational arrangements? ie Cathedral vs Bazaar

What would be needed to truly embrace the use of e-Infrastructure in your work across the whole research life-cycle?

Saturday, 6 September 2008

Workflows dissected

In New Zealand the concept of web-service or grid Workflow is very new, with a morass of new nomenclature, that I have found difficult to grasp all in one. So I have attempted to relate objects, names and concepts in the workflow world to their functional equivalents in traditional programming development and execution environments, that are more widely known. This is not to pretend that a web service and a file, for example are the same, but instead to recognise that within the two different domains they fulfill functionally equivalent roles. By seeing things in this way, it becomes easier to understand how all the new nomenclature fits together. Of course sometimes the functional fit is very loose and at other times the equivalence is very close. So this is the conclusion that I have come to, if it helps you as well, then thats is usefull, if I have missed something fundamental, then I'm happy to be corrected and to adjust the table – so if you are an expert feel free to comment, but bear in mind that this is a table to emphasize functional similarities from the perpsective of newbies to the workflow space. Following blogs will hopefully expand on key differences.

OK first attempt at the table - as yet incomplete:

Functional Role	Traditional Environment	Web-service based Workflow - Taverna	Grid based Workflow - Triana	Web-service based Workflow - Sedna
Scripting tools	AML, shell script	SCUFL	?	Domain PEL & Scientific PEL
Programming Language	C++, Fortran, Java	n/a	?	BPEL
Integrated Development Environment	MS Visual Studio	Taverna	Triana	Sedna plugin to Eclipse IDE
Callable object	DLL file	Web Service	Java Unit	Web Service
Executable Object	EXE file	Taverna workflow	Triana workflow	BPEL bpr archives
Process launch & control, or enactment	Windows, Linux	Freefluo workflow enactor	GAP	ActiveBPEL engine
File/data objects	File, database	Web service	Grid service protocol GridFTP	Web service

table v0.1, Sep 5th, 2008

Thursday, 4 September 2008

The challenge of grey information in a connected world

The media would love the world to be black and white, but we all know in reality that everything is shades of grey. The same is true for the authority of geospatial data.

Some data-sets are authoritative in the sense that they are the master copy, curated by a reputable organisation with a mandate to maintain a particular geospatial data-set. One might say that anybody using a different instance from the authoritative one had better have a good reason. But what if the organisation only provides Internet 1 style access, so a user has to take a copy (e.g. ftp download) for their own use and then reformat it to suit the needs of their analysis software. The copy they are using is no-longer the same as the original. And what if a colleague needs to use the same data set a week, a month or a year later and needs it in the same format - when should they regard the local, most convenient copy to be inappropriate for their use. That depends on a whole range of things - not least the effort required to update the local copy, the expected rate of change of the original, and the relevance of the anticipated changes to the analysis. So there maybe valid reasons for using grey versions of data-sets with well defined formal authority. What is the citation for this usage? When a paper is published about some results derived from the data-set, do we cite the authoritative source and the date at which the original copy was taken and leave it at that, do we fully describe the process(es) that were used to reformat the data-sets before we got to it? Do we actually know in a fully reproducible way what those processes were - or do we trust the skills of the person who did it? To cover ourselves do we take a copy of the copy and archive it on some off-line media to ensure that we can return to the analysis - and then would we cite the copy we used or the copy we archived? etc etc. After all, beyond sharing knowledge, the point of formal scientific publication and citation is reproducibility of results. The challenge of grey data.

But the world of science is full of data-sets that are authoritative in the sense that nobody holds a better version, but their authority is informal, known and respected by specialists in the particular field of science, but not maintained with the same formal rigour or necessarily updated to a regular published schedule. This is reality it isn't a criticism of those involved. In these circumstances. Such data-sets may be used only infrequently and the money - it always comes down to money - may not be there for full descriptive documentation. So how do we cite such data-sets. By proxy through the first or most recent occasion that the data-set was mentioned in a published documentation, as pers. comm. and the name of the owner - and these assume that you are using the original verson and not an evolved copy as explored in the previous paragraph.

Despite the shortcomings, the solutions I have described for citation have been deemed just sufficient for traditional published material, but what happens in a digitally connected Internet 2 world? This is the domain of Digital Repositories for Scientific Data and Persistant Identifiers, or in a nutshell, a collaborative space to put and use data and a means to reference or cite data in a repository that won't change over time. These are core subjects for projects such as ANDS (Australian Natonal Data Service).

But we need to go at least one step further, and of course from a NZ perspective we havent collectively taken the first step yet. Data is useful for its own sake, but its real value in a scientific sense arises when it can be used for further analysis. As mentioned above information is processed and analysed so we need a means to reference the processing steps. With traditional published papers, this has been reason for the method section. But in a digitally connected world, we should be able to go one further. Imagine having a reference, in a paper say or on a webpage - it might look like any other link, that when you click on it allows you to actually execute all or part of the analysis that the original researcher performed. Well people are working on that too - enter the world of Workflows, Files and Packs at myExperiment, recently augmented by WHIP and Bundles, which have emerged from a colaboration with the Triana project team - a real acronym soup of progress! So what does all this mean and how does it relate to grey information?

For a start myExperiment is a repository for a wide range of Files that scientists can upload and share, but it has two key features relevant to this discussion Workflows and Packs. I'll explain Packs first because they are simpler - a Pack is a persitant description for a set of digital objects, some of which might be stored in myExperiment as Files others may be external to myExperiment. It is like the shopping list you create before you go shopping rather than the car full of stuff you bring home after the shopping expedition. But the items in the list are fully described, so that anybody can take it on a shopping expedition and come back with the same stuff. So a Pack reference (or URI) in myExperiment has many of the characteristics needed for a citation.

A Workflow is the digital equivalent of the method section of published paper. With one vital difference, if all the data is digital, and the processing steps are available as web-services, then the Workflow can be executed, ie the method can be repeated, by other myExperiment colleagues. Even better, these colleagues can substitute their own data or an alternative to one of the method steps and run the method again - so now myExperimant is a shared digital laboratory. This is where WHIP and Bundles come in. Bundles are the result of going shopping with a Pack that contains a Workflow and all the Files it uses. It is not just the shopping list, but the car full of stuff, and WHIP is a myExperiment add-on that knows how to unpack the shopping basket and make it all work for you with a single mouse-click.

So now we have Packs that can be cited and when a Pack contains a Workflow and its Files, we have a means for other scientists to repeat or extend the original method. So in a web connected world we are close to solving the problem of grey data and analytical processing , that is very difficult to solve for ordinary desktop processing.

Where does geospatial fit into this - well as yet it doesnt - the Workflow tools that are supported or about to be supported (ie Taverna and Triana) by myExperiment, Bundles and WHIP, dont yet deal to geospatial processing. That is what we need to do next.

Robert's Geospatial Gibberish