|
Home
For Data Providers
Technical Resources
|
|
NSF Proposal, Grant Award DBI-03454000
Developing a New Infrastructure for Dynamic Access
to Multi-institutional Biodiversity Data
This document is also available as a PDF.
Project Summary
This project
will improve online access to the biodiversity data resources held by
NatureServe and its network of natural heritage data centers, which
collectively are the nation’s leading source for detailed data on rare and
endangered species and threatened ecosystems. Electronically confederating
this distributed network of state biodiversity databases will promote open
access to these data sets for the research and education communities and
help inform conservation and environmental management. The project builds on
work already carried out in developing and implementing a robust data model,
creating technological tools for data management and data publishing, and
addressing key institutional and intellectual property rights issues. The
technology framework uses XML Web Services as a programmatic interface among
four distinct technology layers: a gateway site, an enterprise geodatabase,
an authentication/access control subsystem, and the distributed local
databases. Project implementation is phased to allow initial access to
network-wide geospatial data directly from the enterprise geodatabase, with
a subsequent version allowing distributed queries of web-enabled local
databases.
Intellectual
Merit
This project
will create enabling technology to promote open access of data from the
NatureServe network’s 75 participating institutions. By making large amounts
of previously inaccessible data on the status and distribution of species
populations and ecological community stands available to the research
community, the project will create new opportunities for data exploration,
analysis, and synthesis. New applications of these data will help advance
scientific understanding of fundamental patterns and processes of the
nation’s biodiversity. This new infrastructure is also designed to promote
interoperability between NatureServe’s distributed database network and
other major informatics initiatives, such as the Global Biodiversity
Information Facility (GBIF). At present, the spatial representation of
biodiversity in most distributed database initiatives is fairly basic. The
rich geospatial attributes inherent in the NatureServe network’s data
provides an opportunity for this project to help advance general efforts to
integrate standards and best practices from the geographic information
systems (GIS) community with those of the taxonomy/collections community.
Broader
Impacts
Improving
online access to these unique biological data resources will have
significant broader impacts by: 1) promoting conservation and sustainable
environmental management; 2) advancing national goals for electronic
government (E-Gov); and 3) enriching educational opportunities at K-12 and
college levels. Natural heritage data are widely used for conservation
planning and for environmental regulation and management. The improved
access to these data provided by this project will substantially enhance
their application in meeting societal goals for biodiversity protection and
sustainable development. Most natural heritage programs are operated by
state government agencies, and enabling online access to their data will
improve government service to businesses, citizens, and other agencies,
representing a major contribution to the nation’s E-Gov infrastructure. The
educational community at both K-12 and collegiate levels is already a major
user of NatureServe’s web resources, and the enhanced data and functionality
available through this project will open up new opportunities for
biodiversity information to serve as the basis for innovative lesson plans
and geographically oriented student research. The project also seeks to
build capacity in underrepresented communities by providing training
opportunities at institutions with significant minority enrollment.
I.
Introduction
The
deteriorating condition of the nation’s biological diversity is widely
recognized as a major societal issue (Raven and Williams 2000). Stabilizing
and reversing these declines while contributing to the sustainable use of
biological resources will necessarily involve many sectors of society. The
scientific community has a central role to play in documenting and
understanding the species and ecosystems that represent the basic units of
diversity, and for curating and disseminating the resulting knowledge to
achieve broad societal benefit.
NatureServe
is a non-profit research organization dedicated to providing scientific and
technological support for conservation and environmental management. At
NatureServe’s core is a distributed database network consisting of natural
heritage programs and conservation data centers that operate in all 50 U.S.
states, the Navajo Nation, Puerto Rico, U.S. Virgin Islands, all Canadian
provinces, and eight Latin American countries. Most U.S. participants in
this public-private partnership are operated by state government agencies
with natural resource responsibilities. NatureServe and its member programs
focus on assessing the conservation status of species and ecosystems,
documenting their localities through targeted field inventories, and
managing spatially explicit databases that emphasize those species and
ecosystems at greatest risk (Stein and Davis 2000). These data are widely
used for conservation, land management, and regulatory decisions (Groves et
al. 1995).
The creation
of networks for integrating and disseminating biological data is a major
scientific priority (PCAST 1998, Edwards et al. 2000), and a number of
initiatives are underway in both the United States and globally that are
designed to establish interoperable networks of biodiversity information
(e.g., GBIF, NABIN, NBII). Similarly, using the Internet to provide improved
government service—known as electronic or digital government, or E-Gov—is a
priority at federal, state, and local levels (NRC 2002).
In operation
for nearly 30 years, the NatureServe network is widely regarded as one of
the most well-established and fully operational examples of a
multi-institutional, geographically distributed biodiversity information
network. NatureServe has made virtually all its centrally developed and
managed data publicly available online
(www.natureserve.org/explorer).
The most spatially explicit data—precise localities for rare and endangered
species and unusual ecological communities—are managed primarily at the
state-agency level, and with a few exceptions these data are not available
online. This lack of online accessibility greatly limits the broader
potential of this resource for meeting the needs of the research and
education communities, and for informing conservation and environmental
management.
II.
Project Goal
This
proposal seeks to improve access to the proven biodiversity data resources
held by NatureServe’s network in order to increase their utility to
researchers, educators, and environmental managers. We propose to accomplish
this by developing and implementing a new infrastructure for data access
that will electronically confederate this existing community of biodiversity
databases. Electronically linking these largely state-agency databases
through a Web Services architecture will provide for enhanced online data
exploration, extraction, and analysis, and constitute an important
contribution to the nation’s digital government infrastructure. This project
will build on work already carried out by NatureServe and its partners on
developing and implementing a robust data model, creating technological
tools for data management and data publishing, and addressing key
institutional and intellectual property rights issues.
This
proposal is a re-submittal of NSF proposal
#0237697
submitted
in July 2002, and addresses the issues raised by reviewers of that proposal
(see Section IX for summary of comments).
III.
Background
The creation
and maintenance of
biological
data networks involves an interplay among data, technology, and
institutional issues. In 2000 NatureServe and Colorado State University held
an NSF-funded planning workshop to explore the priority issues and
constraints involved in establishing a distributed data access system for
the network of natural heritage programs, and to
design
a conceptual framework for such a system. Attended by 79 technologists,
scientists, and administrators from 45 private organizations, government
agencies, and universities, the workshop participants were enthusiastic
about the prospect of creating a system for improving online access to
natural heritage data, and agreed that such as system would greatly benefit
many
users.
The workshop participants highlighted the importance of creating
technological solutions to promote participation through incorporating
appropriate data access and security measures, and the importance of
developing institutional agreements designed to establish trust and ensure
mutually beneficial relationships.
Following up
on the results of this workshop NatureServe and its partners have invested
considerable effort in clearly defining roles and responsibilities of
network participants to ensure effective collaboration and data sharing.
These roles are embodied in a new Data Sharing Agreement (DSA) negotiated
between NatureServe and its member programs that builds on lessons learned
over the past five years under an earlier agreement. Consistent with norms
established by the Global Biodiversity Information Facility (GBIF), this new
agreement encourages open access to the program’s spatially explicit data,
but affirms the rights of member programs as data custodians. The technology
infrastructure proposed here is a critical enabler for meeting the goals of
this new data sharing and access agreement, and for promoting open access to
these data sets for researchers, educational users, and environmental
managers.
Data,
Data Models, and Metadata
NatureServe’s
primary goal is to improve the availability and use of biological and
ecological information for informing conservation and land use decisions.
Fundamentally, information is required that answers three questions:
what
exists, how is it faring, and where is it
found.
The information required to
address
these questions is encapsulated in a data model that has been the focus of
refinement and evolution over more than 25 years, and is supported by a set
of inventory and data management standards and protocols adhered to by
network
participants.
Element-referenced objects incorporated in the data model include
information that relates to a species’ (or community’s) identity (including
name and classification), status, general distribution, and life history
characteristics. Spatial entities in the data model include the location and
bounds of a species population or community stand, sites of ecological,
scientific, or conservation interest, and areas under protective
management.
Of
particular significance is the concept of the
element
occurrence
(EO),
the spatial representation of a species or ecological community at a
specific location, and the primary unit of record in NatureServe’s data
model. An EO generally delineates a species population or ecological
community stand, and represents the geospatial feature of biological
interest. EOs are not synonymous with collection records. Rather, these
records are spatial units documented in both space and time by voucher
specimens and other forms of ecological observations. For example, a single
plant population EO may be documented by multiple voucher specimens, each
taken from different parts of the population, or from the same place over
multiple years. While mapped objects can be polygons, linear features, or
points, all EOs are represented as polygons to incorporate estimates of
locational uncertainty. More than 500,000 spatially explicit EO records are
managed across the NatureServe network representing several million
observations or specimens.
Because of
the widespread use of EO data in land and natural resource management and
regulatory decisions, inaccuracies can be costly. NatureServe’s data
development and management approach therefore is designed to minimize Type 1
errors. A detailed and rigorous set of scientific methods has been developed
for documenting and mapping element occurrences, which incorporates
estimates of uncertainty and accuracy and includes three distinct quality
assurance
steps.
Metadata for each record allows users to select records that meet specific
requirements for precision, currency, or type of
documentation.
Use of
common inventory, data management, and taxonomic standards across
NatureServe’s network ensures that data gathered and managed by one network
node is thematically and semantically comparable and electronically
compatible with data gathered elsewhere in the network. This internal
consistency enables the exchange and aggregation of data from multiple
institutions across political boundaries, allowing regional and
national-scale analyses and applications (e.g., Stein et al. 2000).
Technology
and Software Tools
NatureServe’s
technology strategy focuses on developing tools that enable network
participants to capture, manage, and disseminate biodiversity data, and to
transform these data into conservation-relevant information. Encapsulating
network-wide standards for inventory, mapping, and data management within
these software products has been key to maintaining a coherent
multi-institutional data set.
Biotics
4
represents
the eighth generation of data management software developed for use by the
NatureServe network, and was released in November 2002. Implemented in an
Oracle database, the system integrates applications for spatial data
management, tabular data management, data import/export and reconciliation,
and
reporting.
The spatial
component of the system is a custom
geographic information system (GIS) application built on ESRI software, and
supports digital mapping, spatial analysis, and data visualization. Biotics 4
has been designed to provide a strong foundation for the Web Services
architecture proposed here. Its Oracle platform supports Web service
standards, and the data management interface, import/export, and
reconciliation tools are all built on an XML-based data format. Installation
of Biotics 4 and conversion of legacy data to its new geospatial standards is
an important step in web-enabling local databases. As of July 2003, 18
programs had installed Biotics 4 and
converted their legacy data, and most U.S. programs are scheduled for
conversion by the end of 2004.
NatureServe
Explorer
(www.natureserve.org/explorer)
represents our second-generation application for Web-based data publishing
and exploration. The tool enables users to query approximately 50,000
species (including infraspecific taxa) and ecological communities by any
combination of scientific or vernacular name, taxonomic group, conservation
or legal status, and geography. NatureServe’s public web offerings receive
more than 60,000 visits (and 3 million hits) per month. Drawing from an
Oracle back-end database, the NatureServe Explorer interface application is
based on open source technologies, including Java servlets, Apache web
server, and WebMacro servlet framework. Planned enhancements include the
incorporation of an Internet mapping functionality to better serve the
spatial data already offered on the site, and to provide a delivery vehicle
for the network-wide EO data that is the focus of the current proposal.
Institutional
Arrangements for Data Access
Data access
policies vary across the network, with some programs making available
complete GIS coverages of their most precise data and others bound by legal
constraints on the provision of precise data. Many network programs consider
at least some precise locational data to be sensitive, releasing them only
on a need-to-know basis. The primary data sensitivity concern involves the
risk that publishing precise locations of some rare species will expose them
to poaching or deliberate destruction. A black market exists for many rare
wild species, and nests of certain raptors, or populations of desirable
orchids, cacti, or reptiles can be threatened by collectors. In other
instances, private landowners may have motives to engage in legal or illegal
activities designed to destroy endangered species populations or alter
habitat. Additional concerns, especially in states with large rural areas,
involve the rights of private property owners to privacy regarding the
endangered resources on their land. State governments take the obligation to
respect these rights very seriously.
As a
membership organization, NatureServe’s members have a governance stake in
the organization, and committee and work group structures are in place to
collectively address issues ranging from institutional priorities, to
development of data standards and technology, to data access. Over the past
several years NatureServe and its member programs have made considerable
progress in creating the institutional framework for aggregating EO data
into a national-level database and for making these data more broadly
accessible. A new Data Sharing Agreement (DSA) was negotiated over the past
year between NatureServe and each of the network participants, which has the
goal of encouraging open access to fine-scale spatial data. Whereas the
first generation of data sharing agreements negotiated five years ago strove
to create a nationally consistent “least common denominator” approach to
data sharing, the current agreement seeks to maximize the level of access
provided by each program. For each data custodian the DSA specifies the
terms under which access to such precise data may be provided. Important
issues already addressed in the data sharing agreements relate to commercial
versus non-commercial uses of data, appropriate acknowledgements and
attributions, re-dissemination to third parties, and the appropriate uses of
sensitive information (NRC 2000).
Letters of
participation from more than 35 natural heritage data centers are attached,
demonstrating the widespread enthusiasm among network participants for
developing online data services designed to provide more open access to
their data, and to meeting the needs of the broader research
community.
Current
Data Flow
Figure
1. Current Data Flow Framework
Figure 1
diagrams the network’s current data flow, where range-wide element data are
made available for public dissemination via NatureServe Explorer, while more
precisely geo-referenced multi-state EO data are available only through
custom data processing and delivery. The network realizes economies of scale
by developing range-wide data centrally and distributing them to local nodes
through regular data exchanges. While NatureServe collaborates with ITIS and
others in maintaining standard taxonomies for use across the network, state
programs are free to adopt alternative taxonomies that meet local
preferences or requirements (e.g., taxonomic concepts embodied in state
regulations) (Morse 1993). The coupling of needed taxonomic reconciliation
with routine data exchanges, however, limits the frequency of such exchanges
to once a year per
node.
Local programs
are the primary data custodians for EO data and serve as the principal
distribution agents for that data within their states. Although individual
centers largely meet the demand for data within their own
jurisdictions—answering more than 75,000 information requests annually—until
recently there has been no mechanism for providing access to EO data across
multiple jurisdictions deriving from multiple institutions. The realization
of formal Data Sharing Agreements with all network participants has enabled
the creation of a national EO data set, managed at NatureServe’s central
office and available for meeting the needs of multi-state users. Access to
this important resource, however, is currently available only through
custom
processing and delivery (Figure 1), a situation this proposal is designed to
address.
Interoperability
with Other Biodiversity Networks
NatureServe
collaborates with a number of other international initiatives focused on
improving dynamic access to biodiversity data. We have joined the Global
Biodiversity Information Facility (GBIF) as an associate participant—the
category available for non-governmental members—and have been designated a
thematic node of the Convention on Biological Diversity’s Clearinghouse
Mechanism (CHM). We are also participating with and serve on the steering
boards for both the North American Biodiversity Information Network (NABIN)
and the Inter-American Biodiversity Information Network (IABIN), and are an
institutional member of the Taxonomic Database Working Group (TDWG). At the
U.S. level, NatureServe is a partner of the National Biological Information
Infrastructure (NBII), and the only non-governmental member of the
Integrated Taxonomic Information System (ITIS) partnership.
A major goal
of the current proposal is to improve the ability to make the NatureServe
network’s data resources interoperable with these initiatives. Several
standards and tools are emerging from the various initiatives that are
relevant to the work proposed here. The ecological informatics community is
involved in developing a domain-specific metadata schema, known as EML
(ecological metadata language), which has relevance to types of information
managed by NatureServe. The Access to Biological Collections Data (ABCD)
task group is in the process of developing a collections-oriented database
schema as well as a protocol (DiGIR) for retrieving structured data from
multiple heterogeneous databases. Of particular significance is the recently
created TDWG Spatial Data Standards subgroup, which is focusing on the
problem of integrating existing standards and best practices developed by
the geographic information systems (GIS) community with the practices and
needs of the taxonomy/collections community, and evaluating uses of XML for
access to distributed spatial databases. The intersection between geospatial
and taxonomic/collections standards and practices lies at the heart of
NatureServe’s efforts to provide access to geospatial biodiversity data, and
we will be actively engaged in this subgroup to help advance that effort,
and to accomplish our project objectives in a way that maximizes
interoperability with other initiatives.
Training
& Technical Support
A serious
commitment to training is central to maintaining the network’s
effectiveness. The core element of this training program has been a formal
weeklong course focusing on inventory, mapping, and data management
standards and protocols. Offered on a regular basis, the most recent course
represented the
106th
such training session. Formal training opportunities are also offered at
regular regional conferences for network participants and at NatureServe’s
annual meeting. Technical support for data management applications occurs at
many levels. Assistance in deploying new software (e.g., Biotics 4) includes
data conversion support, application installation and configuration, and
on-site training. The software itself includes substantial on-line help,
while NatureServe support services include a Web-based knowledge base,
telephone support, and email discussion lists. Application training is also
conducted via the Internet utilizing technology that allows remote users to
share a common desktop.
IV.
Prior NSF Support
Award:
NSF ITR/IM(BIO)
#0113058
Amount:
$497,0370.
Period:
9/01/01-08/31/04.
Title:
Biodiversity
Data Discovery and
Integration.
PI:
Robert
Morris;
Co-PI:
Robert Stevenson. This project builds a component architecture by which
integration of biodiversity web information can take place seamlessly. It
introduces building blocks to locate the origin and authority of a species
name, to extract maps from servers that can compute species distribution,
and provides “nuts and bolts” by which unrelated databases and web sites can
be made to cooperate. This is accomplished through use of XML, and Web
Services, especially framework tools such as Apache Axis, Apache Web
Services Invocation Framework, and Java Databinding tools such as Exolab’s
Castor framework. We have built Web Services layers around ITIS, around our
own electronic field guides, and are producing a wrapper around Robert
Colwell’s BIOTA software that will permit BIOTA users to participate in GBIF
and other architectures using the TDWG DiGIR protocols. We have also
produced an invasive species ontology in DAML+OIL, the Darpa Agent Markup
Language+the Ontology Interface Layer.
Morris, R.
A. and R.D. Stevenson. In prep. Integrating heterogeneous biodiversity
applications. Proceedings of the Workshop on Ecoinformatics, Bangalore,
India, June 2003.
Morris, R.
A., R. D. Stevenson and W. Haber. Submitted. Architecture of Electronic
Field Guides.
Stevenson,
R. D., W. A. Haber, and R. A. Morris. 2003. Electronic field guides and user
communities in the eco-informatics revolution. Conservation Ecology 7(1): 3.
[online] URL:
http://www.consecol.org/vol7/iss1/art3.
Morris, R.
A., M. Passell, and R. D. Stevenson. 2001. A Software Engineering
Perspective on Developing Electronic Field Guides: Lessons Learned For
Bioinformatics. European Environmental Agency Technical Report
Series.
V.
Project Approach
Building on
progress already made in establishing a robust data model and addressing key
data sharing issues (e.g., negotiation of a new generation of DSAs), our
project approach focuses on creating the enabling technology that will
promote open access of data from network participants. Our proposed
technology framework uses XML Web Services as a programmatic interface among
four layers: a gateway site, an enterprise geodatabase, an
authentication/access control subsystem, and the distributed local
databases.
-
Gateway
Site.
The gateway site will provide the primary user interface, allowing data
exploration, visualization (mapping), and extraction. This gateway will
build on the existing NatureServe Explorer application and serve as a
thematic index site, allowing advanced queries based on taxonomy,
geography, and conservation/legal status, as well as other attributes
important to users (e.g., habitat preferences).
-
Enterprise
Geodatabase.
An enterprise geodatabase will provide the framework for managing the
large volumes of network-wide geospatial EO data, and will support queries
against the gateway. This database will enhance multi-jurisdictional data
presentation and analysis and provide crosswalks between local and
network-wide taxonomies.
-
Authentication/Access
Control
Subsystem.
This subsystem will enable us to apply rights management transformations
to data served in XML by the enterprise geodatabase and distributed local
databases.
Separating
the rights enforcement engine from the data will provide for greater
scalability and separation of data retrieval and rights
enforcement.
-
Local
Databases.
More
than 75 local databases support local-level data input, quality control,
management, and dissemination, and are the primary repository for EO
geospatial data. Web-enabling these nodes will ultimately allow for direct
responses to queries received from the gateway site, as well as allow
interoperability with other sources through support of third party
applications.
Variation in
technological and financial capacities among network programs requires a
solution that is flexible at the local level, and does not immediately
require web-enabled local databases. Our project approach (Figure 2) is
phased in such a way as to first support queries through an enterprise
geodatabase (Version 1), and later implement a fully distributed
architecture for querying local databases (Version 2).
The
current project will focus on delivering Version 1, since
available
project funding does not allow for implementation of Version 2.
Version 1
will establish a gateway site expanding on the existing NatureServe Explorer
application, create an enterprise geodatabase containing a replicated set of
all network EO records, and develop data access/authentication components
based on codification of the DSA. During the development of Version 1 we
will increase the installation rate of Biotics 4 software in network data
centers, since moving onto this Oracle-based system will greatly facilitate
making their spatial data more efficiently available over the Web.
Implementation
of Version 2
would
will
deploy
XML Web Services technology to enable direct access to those local databases
that are Web-enabled. Data from network participants not yet Web-enabled
will continue to be accessible through the enterprise geodatabase. This
hybrid solution is designed to ensure that where possible users access
up-to-date information directly from the data custodian, yet are still able
to access data from across the entire network. The enterprise geodatabase
therefore plays an important role in providing short-term access to
network-wide data and longer-term redundancy to ensure robust operation and
adequate system performance. Over time we foresee transforming the
enterprise geodatabase from a replicate EO database containing network-wide
spatial data to a collective mirror site that receives refreshed local data
on a continuous basis, and which provides a back-up when one or more
distributed sites are unavailable.
We will
leverage technology in several areas: 1) distributed database query and
access using a loosely coupled, n-tier architecture; 2) Internet mapping
services and advances in data streaming that allow the dynamic transfer and
assembly of very large geospatial data files; and 3) established XML Web
Services standards that provide protocols for description (WSDL,
Weerawarana
et al. 2002), discovery (UDDI, Bellwood et al. 2002), and communication
(SOAP, Box et al. 2000). Wherever possible we will integrate open source or
commercial off-the-shelf (COTS) software to minimize development costs and
maximize availability of training and support across our institutionally
heterogeneous and geographically dispersed network.
Our proposed
technology framework (Figure 2) will create layers responsible for distinct
tasks as data moves through the system to final delivery. Under this project
(Version 1) we will focus on creating the enterprise geodatabase and the
data access/authentication components. These components will include
authentication, data abstraction, rights enforcement, and data translation.
Data will move between each component layer in XML documents. In addition,
we will create tools for automating the data exchange process between local
databases and the enterprise geodatabase by separating it from the taxonomic
reconciliation process, thus allowing for more frequent exchanges. A future
enhancement (Version 2) will involve delivering XML Web Services to enable a
network node to provide direct access to its data from the Gateway site or
from other compatible third party applications. This phasing will allow us
to create the critical pieces of data aggregation, translation and access
control as part of this project funding, even as we seek other funding to
subsequently deploy these via Web Services.
The project
also requires seamless access to the geodatabases utilizing Web Services and
access control technologies. This is a period of rapid change in the
geospatial technology community, and we are closely following efforts to
utilize open standards for Internet-based geoprocessing, and the technology
developments from both the Open GIS Consortium and ESRI. We will be looking
particularly closely at ArcGIS 9.0, which is expected to provide loss-less
XML serialization of spatial data, replication and a standard XML schema for
GIS data (GML).
Web
Services for Providing Interoperability
XML-based
Web Services will be used to make local data accessible to the broader
community through the gateway site, using emerging standards to maximize
interoperability with other applications. The services will be a part of the
Access/Authorization Components. These components include a data abstraction
layer that will support spatial and tabular queries. Data abstraction will
encapsulate the enterprise geodatabase data model and make it available in a
standard XML schema through an object model. The abstraction layer will use
Web Services deployed in network locations to retrieve data from multiple
sources. Full implementation of this distributed database approach will
require that each of the local databases be Web-enabled and capable of
responding to gateway queries. The Web Services will be based on the Apache
open source tools
(http://ws.apache.org/),
principally the Axis Web Services framework. The result will serve both XML,
valid for an XML-Schema designed in collaboration between the University of
Massachusetts-Boston team (UMASS) and NatureServe, and HTML consistent with
current interfaces but produced from the XML by the application of XSLT the
eXtensible Style Sheet Transformation language. This allows HTML to be
produced against XML that has gone through the rights enforcement layer.
Client side support will be based on the Apache Web Services Invocation
Framework.
Authentication
and Access Control
The
authentication and access control subsystem will provide a means by which
the most sensitive data can be made available to permitted classes of users.
Under the direction of Dr. Robert Morris, the UMASS team will develop the
access control (AC) layer. The team will design and implement an XML access
control system interposed between the backend and the Web Services layer. It
will be a role-based system
(http://csrc.nist.gov/rbac/).
Hence, we propose a fine-grained XML AC system based on roles in an optional
signed digital certificate accompanying queries. A data provider offering
XML will place into all sensitive XML elements, roles and policies for those
roles expressed in the XML Access Control Language
(XACL)(www.trl.ibm.co.jp/projects/xml/xacl/index.htm).
This enhanced XML is passed to an Access Control Agent (ACA) which may be a
separate web service or may be directly invoked by the provider as we
propose in Version 1 implementation. Rights enforcement will apply security
transformation to the XML data documents based on stored roles, as codified
in the DSA. If the roles certified by the certificate accompanying the query
match those in the sensitive element, the policy (including policy for the
absence of a certificate) will be enforced by the ACA, which returns the
resulting XML to the requesting client, perhaps with further processing
(e.g. conversion to HTML if the client is a browser-based application).
Examples of some roles and policies that might be supported:
SeniorScientist: pass precise geospatial data; RegionalPlanner: convert
precise geospatial data to appropriate place name. We note that the
architecture is fully dynamic, permitting the data provider to change the
policy over time, consistent with obligations made in their DSA. The
implementation will provide middleware for inserting XACL at the source, as
well as for implementing policies in the ACA. We will adopt the design
expressed in the Trust Establishment (TE) architecture described by IBM
(http://www.alphaworks.ibm.com/tech/TrustEstablishment).
This system permits digital certificate issuers to authenticate a holder to
a role, and obviates the need for password maintenance or individual
identity authentication.
To be
successful this approach must provide easy and transparent data access from
the user perspective, but must also respect and support the data access
policies of the individual data providers. Technologically, the system will
be designed to provide the finest level of access that an individual data
provider allows, employing user-role authentication XML access control. At
the same time, we believe this project will create the institutional buy-in
and peer pressure required to increase the amount of fine-scale geospatial
data available for scientific, educational, and other use by working with
network participants and others to identify and address key institutional
and intellectual property rights issues. A workshop in year one will focus
on codifying user role and policy descriptions to be embedded in the
authorization and access control subsystem.
VI.
Project Team and Collaborations
Two teams,
working under the direction of the principal investigator, will implement
the project by focusing on specific themes and deliverables. We are also
committed to tapping into innovative technology work taking place across the
network of natural heritage programs, and have established a network-wide
Technology Working Group (TWG) to help empower local technologists and
harvest these innovations for broader application. Because a key objective
of this project is to promote interoperability with other key initiatives,
we have established a Project Advisory Committee that reflects multiple
external perspectives, and will provide guidance to the PI and Co-PIs on
project direction and implementation.
Advisory
Committee (AC)
Lead
Investigator: Bruce Stein
A Project
Advisory Committee has been assembled to provide external perspectives from
individuals involved in various aspects of biodiversity and ecological
informatics. The following individuals have agreed to serve on this Advisory
Committee (see attached letters): Dr. James Beach (University of Kansas);
Dr. James Brunt (Long Term Ecological Research Network, University of New
Mexico); Dr. Frank Davis (University of California, Santa Barbara); Dr.
Steven Kelling (Cornell Laboratory of Ornithology); Dr. Meredith Lane (GBIF
Secretariat); Dr. Ronald Pulliam (University of Georgia); and Dr. Barbara
Stein (MaNIS project, University of California, Berkeley).
Institutional
Relationships Team (IR)
Lead
Investigator: Mary Klein
Purpose:
Work across institutions to involve data custodians in the design of system
components and control processes that embody agreed-upon data sharing and
access guidelines. Identify workshop participants and host data access
workshop. Document participant and user requirements for authentication and
access control system.
Technology
Design and Implementation Team (TDI)
Lead
Investigators: Larry Sugarbaker and Robert Morris
Purpose:
Design and implement the technical architecture and software
components.
-
Gateway
Site Team – design user interface and program backend changes to support
spatial searches from the geodatabase.
-
Web
Services Team – design and develop the Web Services and access control
components.
-
Data Model
Team – design the enterprise geodatabase.
-
Technology
Working Group – consists of a team of technologists from data centers
across the NatureServe network that have expertise or interest in online
data access technologies. Will serve as a network-wide collaboratory
fostering innovations and providing input into design and
implementation.
VII.
Workplan
Months
Adopt
Data Access Rules
Existing
DSAs provide detailed information about custodian data release policies,
especially regarding data access levels and user types. This project will
begin by clarifying the design requirements to embody these policies in
automated technologies and ensure that requirements are understood for all
cases. As needed, we will also update the DSAs to reflect the emerging needs
of an Internet data delivery environment. Colorado State University’s
Colorado Natural Heritage Program will engage students in hosting a workshop
to review data provider expectations with respect to the automation of
existing user-authentication agreements, and to document data user
expectations regarding appropriate uses and ease of access. The workshop
will include network participants as well as partners from the broader
scientific community. The goal of the workshop will be to refine the
requirements and online implementation of the data sharing policies embodied
in the DSAs. Guidelines will document specific classes of users that can be
used to form the basis for the data access/authorization subsystem.
|
Tasks
|
Deliverables
|
-
Workshop
to clarify and refine data sharing and delivery rules
|
-
Data
access guidelines and user authentication rules published on the
Internet
-
Design
requirements
-
Refined
data sharing agreements
|
|
|
Technology
Requirements and Design
We will
review infrastructure options and design the most appropriate architecture
for meeting the needs of the user community. An appropriate XML schema will
need to be derived for EO data. Technical design and interface specs will be
worked out, with input from the CSU workshop guiding the final stages of
development. We will also evaluate the feasibility of connectivity solutions
for the Geodatabase based on an ESRI ArcGIS 9.0 platform.
|
Tasks
|
Deliverables
|
-
TDI
teams review of existing data delivery approaches, alternative data
exchange work flows, and requirements
-
Evaluate
suitability of potential software tools and recommend specific
infrastructure requirements
|
-
System
Design documents for all aspects of technology
-
Data
abstraction layer
-
Access
control agent specifications
-
Translation
libraries
-
Modifications
to Gateway interface and Biotics 4
-
XML
schema
-
Incremental
data exchange protocols
|
|
|
Automate
Data Exchange and Increase Frequency of Data Reconciliation
Regular data
exchanges between NatureServe and network programs are required to ensure
up-to-date range-wide and taxonomic data at the local nodes, and to
aggregate local data into the enterprise geodatabase. These exchanges are
complicated by the need to resolve taxonomic differences between local and
central databases. Data exchanges currently are linked directly to the
taxonomic reconciliation process, a labor-intensive scientific review
process that limits the frequency of updates to once a year per local node.
Automating the data exchange process and separating it from the taxonomic
reconciliation process will allow information to be exchanged on a more
frequent basis, greatly improving the amount of up-to-date range-wide data
available to the local nodes. One possible approach is to create an
incremental update process to enable more frequent updates. Incremental
updates would only include data that have changed from a previous update. By
reducing the amount of data to only the changes should allow for more
frequent refreshes. We will also be reviewing taxonomic reconciliation
software currently under development elsewhere to determine its
applicability to our needs.
|
Tasks
|
Deliverables
|
-
Build
on Biotics 4 data exchange tools to automate synchronization of
local and range-wide data, and identify records requiring review and
reconciliation
-
Design
a new work flow system for taxonomic reconciliation
-
Train
network participants in new work flow procedures
|
-
Automated,
incremental data exchange and synchronization tools
-
Revised
taxonomic reconciliation process, including workflow and labor
impacts
-
Methodology
training for network programs
|
|
|
Expand
Enterprise Database to Include Network-Wide Geospatial Data
The new
enterprise geodatabase will integrate EO layers from all network nodes into
a common database environment that will support data exploration,
visualization, and extraction through the proposed gateway site. This
geodatabase will manage the spatial representation of all EO data as
polygons rather than as geo-referenced points. With Biotics 4 installed,
each local database will have a single ArcView GIS layer containing their
entire set of EO records, represented as polygons. These ArcView shape files
will be converted to a more robust geodatabase format based on the guidance
developed in the Technology Requirements and Design activity, providing the
capability to effectively manage the large volume of network-wide geospatial
data. Implementation will include tools and processes to integrate the
exchange of geospatial data into the regular data exchange and
reconciliation schedule.
|
Tasks
|
Deliverables
|
-
Build
enterprise geodatabase infrastructure including data dictionary,
data model and file management protocols
-
Develop
tools to support refresh of the enterprise geodatabase from local
nodes
-
Implement
a regular refresh schedule for geospatial data
|
-
Enterprise
geodatabase integrating network-wide spatial element occurrence data
and range-wide element data
-
Test
plan
|
|
|
Create
Web Services and Data Access/Authorization Subsystem
Data access
control policies and requirements will be incorporated into the system using
an Access Control Agent (ACA) programmed in the XACL language to support
XML-based management of the rights of specified user classes. We will then
develop Web Services components in Web Services Description Language (WSDL)
using the Apache Axis framework and Web-Service Invocation Framework, in
conjunction with server-side programs that will respond to queries from
GBIF/TDWG DiGIR compliant users. Students from the University of
Massachusetts-Boston will develop the ACA and WSDL code, playing a lead role
in deployment, testing and documentation. The Advisory Committee will review
proposed solutions for transparency, ease of use, and suitability of
pre-defined roles for the research community.
|
Tasks
|
Deliverables
|
-
Develop
server-side facility for XML access control
-
Implement
role certificate management subsystem with Public Key
Infrastructure
-
Build
Access Control Agent to transform controlled elements for delivery
to users
-
Develop
Web Services
|
-
User
authentication and Access Control Agent
-
Web
Services software components
-
WSDL
compliant services using Apache Axis
-
Client-side
invocation of Web Services also using Apache
-
GBIF/TDWG
DiGIR compliant protocols
-
Metadata
in EML registered where suitable (e.g., Metacat)
-
Test
plan
-
Documentation
for installing and maintaining the system
|
|
|
Create
Gateway/User Interface
Version 1
will be delivered via an enhanced NatureServe Explorer user interface to
support spatial searches, data downloads and presentation of the enterprise
geodatabase content. Central node components from all preceding project
activities will be tested and deployed via this interface.
|
Tasks
|
Deliverables
|
-
Modify
NatureServe Explorer interface to create a gateway to the enterprise
geodatabase
-
Develop
user-oriented documentation to describe appropriate uses and
guidelines for interpretation of the data sets
|
-
Gateway
site providing self-service, online data discovery, visualization,
and XML-based delivery
-
Test
plan
|
|
|
Pilot Web
Services Direct Access to Local Nodes
Implementation
of Web Services for direct access to local nodes
(Version
2)
is not a
part
of the current
project
funding.
However, to ensure that Version 1 of the software architecture will meet
that ultimate goal, we will work with two local pilot sites
during
the current project.
Pilot
sites will be selected based on their ability to provide training
opportunities for underrepresented populations, or to
help
build institutional capacity.
|
Tasks
|
Deliverables
|
-
Engage
two network nodes to serve as pilot test sites
-
Test
and deploy local node Web Services components in collaboration with
pilot test sites
-
Train
network nodes on the enterprise geodatabase data models and Web
Services tools
|
-
Test
plan
-
Two
operational local Web Services nodes
-
Rollout
plan for network-wide implementation over longer time frame
-
Technical
architecture, system implementation and maintenance
documentation
-
Long-term,
network-wide training plan
|
|
|
VIII.
Broader Impact
Creating
online access to this unique biological data resources will have broader
impacts in promoting conservation and sustainable environmental management,
in advancing electronic government (E-Gov), and in enriching education at
K-12 and college levels. Additionally, the project will seek to involve
graduate students from underrepresented communities in local pilots and
build capacity in historically underserved jurisdictions.
Conservation
and Environmental Management
Biodiversity
conservation is a major societal concern, and sound scientific information
is needed to set targets for protection and improve management of natural
resources. Data from NatureServe’s network of natural heritage programs is
widely used for conservation and environmental management; indeed,
The
New York
Times
has referred to the network’s databases as “the country’s leading source of
biological information for conservation planners, government agencies and
land managers” (Stevens 2000). While these programs collectively fill more
than 75,000 data requests annually—informing decisions ranging from
nationwide oil-spill contingency planning to the evaluation of site-specific
impacts from housing developments—most of these requests are serviced
manually. As a result, many users and potential users are not incorporating
these data into their routine decision-making processes to the degree that
would be desirable. Creating online access to these data should greatly
expand the number of prospective users for this information, producing
correspondingly broad environmental benefit.
Electronic
Government
Electronic
government, or E-Gov, is a national priority designed to provide improved
government service to citizens and businesses by, among other things,
sharing and integrating federal, state, and local data. Because most natural
heritage programs are operated by state government, enabling dynamic online
access to this information represents a major contribution to meeting E-Gov
goals. Implementation of this distributed database access system will
improve the efficiency and effectiveness of these state government programs
by allowing them to spend less time on routine data requests and
environmental reviews and focus more attention on consultations requiring
in-depth biological expertise. Many users of the data are involved in
environmental or development projects where delays in planning or permitting
can have significant financial costs. By reducing the response time for data
requests, this project should also represent a cost savings to the
government agencies, businesses, and citizens that rely on these data.
Developing an enterprise solution to improving online access to state
natural heritage data—the focus of this proposal—will also create
efficiencies and cost savings by avoiding the need for each state to create
its own technological solution. Finally, such a network-wide solution will
allow data to be queried and integrated across state boundaries, better
meeting the E-Gov needs of regional bodies and federal agencies.
Education
The
biodiversity data held by NatureServe and its member programs also have
broad application in the educational community, including K-12. The
NatureServe Explorer web site, which will be expanded to serve as the
gateway for the new distributed architecture, already has been identified as
a leading web-based educational resource. The American Library Association’s
review of NatureServe Explorer describes it as “a tremendous new resource
that deserves to be bookmarked at every library, whether public, school, or
academic” (ALA 2001). Similarly, NatureServe Explorer is highlighted as a
“great site for middle and high school research” by Education World
(www.education-world.com/a_sites/sites080.shtml),
and is included in numerous local and national “homework help” sites, such
as that maintained by the Carnegie Library of Pittsburgh
(www.carnegielibrary.org/subject/homework/biology.html
). This project is intended to greatly enhance the level of geographic
information available to students through NatureServe Explorer, opening up
new opportunities for this information to serve as the basis for innovative
lesson plans and geographically oriented student research. In particular,
this new data and functionality will open up opportunities for students to
both explore their local environs and to compare and contrast this with
ecosystems elsewhere.
Underrepresented
Groups
The project
will also strive to benefit underrepresented populations through their
involvement in training and institutional capacity building. In particular,
we intend to pilot local implementation of the Web Services infrastructure
at university-based natural heritage programs with significant minority
enrollments, such as University of New Mexico (44% minority) or University
of Alaska (28%).
IX.
Comments from Previous Submission
The
panel reviewing the previous submittal of this proposal agreed that the
natural heritage databases are an invaluable resource and that confederating
them and making them more accessible is an extremely worthwhile and
important goal. The panel raised several issues regarding: 1) the level of
detail presented in the technology plan; 2) development of access and
authentication systems; 3) resolution of institutional issues related to
data access; and 4) level of external participation to promote compatibility
with other
initiatives.
Technology
Plan
Considerable
detail has been added to the technology plan to lay out the conceptual basis
for our approach and outline the technology solutions contemplated. While
some of the technologies to be employed are quite mature and their
implementation will be straightforward (e.g., Web Services), other standards
and technologies are evolving rapidly, especially the field of
Internet-based geoprocessing. Thus,
a
number of specific details must be deferred until the technology
requirements and design phase of the project
to
ensure that we take advantage of the most appropriate emerging standards and
technologies.
Access
Control and Authentication
A detailed
plan for developing access control and authentication is presented in the
project approach and work plan sections of the proposal. This work will be
undertaken by the UMASS-Boston team lead by
Co-PI
Robert
Morris.
Institutional
Issues
The
formalization of institutional agreements among network participants has
been essential for meeting the objectives of this project. Work on
developing consensus among institutional participants on data sharing issues
over the past three years has yielded significant advances, especially with
respect to the documentation of data custodian expectations regarding
control and release of information to authorized users. The creation of
NatureServe in 2000 as an independent membership organization representing
the confederated databases, including governance mechanisms involving
network members, has greatly facilitated inter-institutional data sharing
arrangements. As discussed earlier, over the past year NatureServe has
worked with its member programs to implement a next generation of formal
data sharing agreements (DSAs) that are directly supportive of this
proposal’s approach to improving access to detailed geospatial element
occurrence data. Letters from more than 35 natural heritage programs are
attached, indicating their commitment to participate in this project.
External
Participation
Dr. Robert
Morris of the University of Massachusetts-Boston has joined the project team
as a Co-PI, and we will benefit enormously from his understanding of and
involvement with many of the other leading biodiversity informatics
initiatives. A specific goal of this project is to improve interoperability
of the NatureServe network with other initiatives, and Dr. Morris will be
working with NatureServe technology staff to develop a technical
architecture designed to maximize such interoperability. We have also
continued to include in the project an external Advisory Committee, with
representation from individuals involved in museum database confederation
projects (MaNIS, University of Kansas), international initiatives (GBIF),
ecological informatics networks (LTER), and citizen-science web projects
(eBird). These individuals will bring a varied external perspective to the
design and implementation of the system. Letters from these seven external
advisors are attached indicating their commitment to serve on this
committee.
SECTION
D: References Cited
American
Library Association (ALA). 2001. Review of NatureServe Web site. Choice:
Current Reviews for Academic Libraries 38(9): 3889.
Bellwood,
T., L. Clément, D. Ehnebuske, A. Hately, M. Hondo, Y. L. Husband, K.
Januszewski, S. Lee, B. McKee, J. Munter and C. von Riegen. 2002. Universal
Description, Discovery and Integration (UDDI) Version 3.0.
http://uddi.org/pubs/UDDI-V3.00-Open-Draft-20020703.htm . Web page accessed July 17, 2002.
Box, D., D.
Ehnebuske, G. Kakivaya, A. Layman, N. Mendelsohn, H. F. Nielsen, S. Thatte
and
D.
Winer. 2000. Simple Object Access Protocol (SOAP) 1.1. WC3 Note 08 May 2000.
http://www.w3.org/TR/SOAP/.
Web page accessed July 17, 2002.
Edwards, J.
L., M. A. Lane, and E. S. Nielsen. 2000. Interoperability of biodiversity
databases: Biodiversity information on every desktop. Science 289:
2312-2314.
Groves, C.
R., M. L. Klein, and T. F. Breden. 1995. Natural heritage programs:
Public-private partnerships for biodiversity conservation. Wildlife Society
Bulletin 23: 784-790.
Morse, L.
1993. Standard and alternative taxonomic data in the multi-institutional
Natural Heritage Data Center Network. Pp. 69-79 in F.A. Bisby, G.F. Russell,
and R.J. Pankhurst eds.
Designs
for a Global Plant Species Information
System.
Oxford: Oxford University Press.
National
Research Council (NRC). 2000.
The
Digital Dilemma: Intellectual Property in the Information
Age.
Washington, DC: National Academy Press.
National
Research Council (NRC). 2002.
Information
Technology Research, Innovation, and
E-Government.
Washington, DC: National Academy Press.
President’s
Committee of Advisors on Science and Technology (PCAST). 1998. Teaming with
Life: Investing in Science to Understand and Use America’s Living Capital.
Washington, DC: White House Office of Science and Technology Policy.
Raven, P. H.
and T. Williams. eds. 2000.
Nature
and Human Society: The Quest for A Sustainable
World.
Washington, DC: National Academy Press.
Stein, B.
A., L. S. Kutner, and J. S. Adams. eds. 2000.
Precious
Heritage: The Status of Biodiversity in the United
States.
New York: Oxford University Press.
Stein, B. A.
and F. W. Davis. 2000. Discovering life in America: Tools and techniques of
biodiversity inventory. Pp. 19-53 in Stein, B. A., L. S. Kutner, and J. S.
Adams eds.,
Precious
Heritage: The Status of Biodiversity in the United
States.
New York: Oxford University Press.
Stevens, W.
K. 2000. U.S. found to be a leader in its diversity of wildlife. New York
Times, March 16, 2000, p. A18.
Weerawarana
S.,
R.
Chinnici,
M. Gudgin, and J. Moreau. 2002. Web Services Description Language (WSDL)
Version 1.2. WC3 Working Draft 9.
http://www.w3.org/TR/wsdl12/.
Web page accessed July 17, 2002.
|