Goals and Objectives for TAMU SCOOP participation
- FY 2005 task writeups
Texas A&M University
Statements of Work for SCOOP 2005/2006 Activities
TAMU's work will cover the following tasks: 1.SCOOP archive data center 2.OGC compliant visualization support 3.SCOOP grid security efforts 4.SCOOP data transport 5.Data translation (in concert with UAH) 6.Web Services development (in concert with UAH) 7.Data Standards
All of these tasks are specifically identified in the SCOOP architecture document. In the SCOOP projects, TAMU has demonstrated its capability in terms of technical expertise, reliable operation management, and team work. We are confident we will accomplish the above tasks with SCOOP partners in the realization of the SCOOP mission.
TASK 1: SCOOP Archive Overview: This is a continuing operational task. TAMU will continue to provide archive services to the SCOOP partners. At this time, TAMU has established a data center for environmental data with 15 terabytes of spinning media capacity. TAMU began collecting data for SCOOP on 1 MAY 2005, and has continued to collect data continuously since that time. In this effort, TAMU has been investing over $150K of internal funds in support of the SCOOP archive. See Appendix for details of TAMU's effort in the development of the TAMU Data Center supporting SCOOP data achives.
The decision to support the archive on limited funding was based on TAMU's expertise with large array storage systems, and our recognition that there was both a need for this facility and no other partners with the in-house capability to support the volume of data generated by SCOOP in an active hurricane season.
We believe that this task is critical for the success of SCOOP. SCOOP's original concept was to support several regional data centers across the SCOOP footprint with no specific site responsible for archiving all data in a central repository. However, by introducing the TAMU data center, the SCOOP community now has a central repository in addition to regional archives. This provides a better organizational data model for the following reasons. 1.The central repository has a larger archive capacity than any regional partner. 2.No overriding architecture for a distributed archive has been developed for the SCOOP project. 3.The central repository represents a concentration of resources that leverages large capacity, high bandwidth and computational resources. This eliminates the need to transfer a large file set (multiple gigabytes or terabytes) which would cause a significant delay in modeling and simulation. 4.TAMU possesses sufficient bandwidth to allow the collection of all SCOOP data without creating an undue burden either upon the TAMU campus networks, or on the SCOOP archive facilities. 5.TAMU's provision of archive services was demonstrated to be the failsafe position. The TAMU archive, with its philosophy of capturing and archiving all data made available to the SCOOP partners (including incomplete datasets), or produced by them, has proven itself repeatedly when a request for recovery of data for retrospective analyses was encountered.
The Task: TAMU proposes to continue this effort of providing an operational archive to the SCOOP community. TAMU's specific work related to this task will comprise the following: TAMU will provide sufficient space to maintain in online or near-line storage, at least two years of SCOOP and related data. Data aged beyond the two-year mark will be transferred to offline storage but will remain available upon request using agreed-upon SCOOP standards and tools for data discovery and access. TAMU's portal will provide an independent web-access method for obtaining inventory metadata as well as the data files. Data will be archived and inventoried as it is received. Catalog information will be automatically relayed to the SCOOP catalog based on data collected for the inventory. Web services will be used to catalog services. Data discovery will be supported via catalog services and the THREDDS (Thematic Real-time Environmental Distributed Data Services) project implementation. The TAMU archive will be monitored automatically and anomalies will be reported to a system administrator who will respond within 2 hours of a notification and initiate corrective action. A web-based status page will be maintained to inform the SCOOP partners of status, outages, causes and resolution. TAMU's goal for availability of the archive is 24x7x365, with a 99.9% percentage of uptime. Scheduled maintenance and outages will be announced at least 7 days prior to the downtime. Where possible, backup systems will be employed to maintain archive capabilities during scheduled downtime.
The estimated cost for this task is $158K, and will go toward support of technical and staff salaries, benefits, relevant travel, space and other indirect expenses.
TASK 2: OGC-Compliant Data Visualization Overview: Visualization of model output data and in situ observations is specifically called for in the SCOOP Architecture document. The OpenIOOS.org site is the defined site for the display and dissemination of this data. The specifications developed by the OpenGeospatial? Consortium (OGC) are commonly used by OpenIOOS.org to obtain and provide data on that site. TAMU and LSU have been providing data via OGC-compliant services. TAMU has been well experienced in using this method of data provision and presentation for several years, while LSU has mastered the capability using the ESRI, Inc. ArcGIS tool. TAMU proposes to streamline the process of visualizing routinely received model output data by performing initial OpenGeospatial? Consortium rendering at the archive site, using tools already in use. TAMU's use of recognized tools, services and procedures should allow data to be presented in a geospatial format in a relatively short time after the basic model output data is received.
The Task: TAMU's specific work related to this task will comprise the following: To extract data from point and polygon files to a spatially aware database (PostGIS) for presentation to the Minnesota Mapserver software which will make the data widely available via the Web Mapping and/or Web Feature Services. To present, as Web Coverage Service products, the certain files defined in netCDF format and consistent with well-known netcdf formats To maintain the original model outputs in the archive, without modification.
The estimated cost for this task is $101K, in order to support technical and staff salaries, benefits, relevant travel, space and other indirect expenses.
TASK 3: SCOOP Grid Security Efforts Overview: The SCOOP grid and its data are mission critical and hence need enhanced protection. Security is one of the highlighted elements of the SCOOP architecture document. We should continue our effort of pursuing best practices and undertaking heightened security procedures.
The Task: TAMU's specific work related to this task will be as follows: To develop and deploy a certificate system recognized by VeriSign? or another top-level certificate signing authority, to provide certificates based on the National Middleware Initiative/Shibboleth activities and Public Key exchange for use across the SCOOP grid/network To organize and publish best practices for internet-connected security, in concert with the SCOOP partners and their designated network security agents. To establish and maintain a mailing list to notify SCOOP partners of newly identified vulnerabilities, and of the potential impacts and possible solutions
The estimated cost for this task is $53K, in order to support technical and staff salaries, benefits, relevant travel, space and other indirect expenses.
TASK 4: SCOOP Data Transport Overview: TAMU's proposal on the use of LDM for near-real-time data transport has been adopted by five of the modeling partners that utilize LDM to present their output to the Archives and other users. TAMU is confident in its experience in transporting large and extensive datasets via LDM. Similarly, TAMU has significant expertise in network operations and research, currently participating in the development and deployment of a statewide high-speed internet protocol fiberoptic network. Thus, TAMU has the background, experience and capabilities to evaluate and recommend specific approaches for transport in a given setting. Note that the draft Architecture document identifies a variety of additional transport methodologies that should be considered. As such, TAMU proposes to lead the efforts on data transport for the next year of SCOOP efforts.
The Task: TAMU's work related to this task will be as follows: To provide leadership and expertise in LDM operations and optimization; To serve as a resource for THREDDS and OpENDAP installation, configuration, and operation; To provide guidance in other transport mechanisms including FTP, SFTP (ssh) and web services. To extend the top-level LDM relay, originally designed to assist in transporting National Weather Service radar data, to transporting virtually all data on the Unidata Internet Data Distribution system
The estimated cost for this task is $83K, in order to support technical and staff salaries, benefits, relevant travel, space and other indirect expenses.
Task 5: Task Data Translation Overview: This task will be conducted in close cooperation with UA/Huntsville. UA/Huntsville has served as lead on data translation to date, and has made significant strides in this area. TAMU has served as a partner. For example, TAMU has undertaken a project to translate the Mesoscale Model Version 5 (MM5) weather model output from native FORTRAN binary data to several common data formats for more widespread use by the general public (or, those engaged in MM5 operations at some level). TAMU's continued involvement with UAH will allow both groups to benefit from the experience and knowledge base intrinsic in eachother’s areas of data format, data translation, geospatial referencing and interpolation and other associated problems. Thus, we believe that to maintain this arrangement benefits the most for the SCOOP project. Thus, TAMU proposes to continue the partnership in data translation with UA/Huntsville.
The Task: TAMU's specific work related to this task will be as follows: To work with UAH to develop and deploy translation tools associated with their use at the archive level. To deploy the translation tools as they are developed, at the archive site and make them available via the scoop.tamu.edu portal to facilitate users' requests for data in a non-native format.
The estimated cost for this task is $30K, in order to support technical and staff salaries, benefits, relevant travel, space and other indirect expenses.
Task 6: Web Services Development Overview: This task will be carried out in close cooperation with UA/Hunitsville. The SCOOP architecture document specifies that the majority of inter-system communications for SCOOP will utilize web services. In the course of TAMU's various support activities for SCOOP, we have utilized, developed, modified or created a variety of web services. As with data translation, UA/Huntsville has taken the lead on web service development. TAMU has served as a partner. The results show that this is an effective arrangement and hence TAMU proposes to continue the partnership.
The Task: TAMU's specific work related to this task will be as follows: To focus directly on web services associated with user interaction with the archives, archive interaction with the catalog services, and visualization over the course of the next year. To work with UAH to reach agreement and standardization on XML data transmission schemas prior to deployment of new web services systems. To help, to the extent practical, the use of standards-based web services, especially with regard to geospatial (OGC) services. Where this is impractical, to employ an open approach to disseminating data, making the schema and vocabulary of the system readily available to users.
The estimated cost for this task is $36K, in order to support technical and staff salaries, benefits, relevant travel, space and other indirect expenses.
Task 7: Data Standards Overview: The need for Data Standards is pervasive throughout SCOOP. The conclusion to the SCOOP architecture document states "... the goal is to achieve interoperability among heterogeneous systems through the adoption of open standards for information exchange that can be implemented with either open source or proprietary solutions." To achieve this, SCOOP has to implement and use a dictionary of base data definitions and data formats. Yet, this is the area where the most "assumptions" are made. Historical data translation and transformation formats have generally assumed that such things as Datum (geoid and projection), sea level, et al. are known by both parties and have not formalized the data elements completely and allowed for clear and unambiguous transfer of data. To date, the data translation and meta-data work in SCOOP have done the same. This task intends to extend work begun in year 1 with intensive focus on intersecting SCOOP, MMI, and DMAC work into a consistent set of recommendations for data and metadata base definitions.
SCOOP work products must be consistent with and support other community activities such as MMI, OPEN.US DMAC, OpenIOOS, OOSTech and others. As initially implemented, "The [SCOOP] Catalog includes high level descriptions of SCOOP data collections..." This is the logical location to include the working data definitions and standards that are required by such tasks as data translation and data transport. The SCOOP data translation architecture is being designed to accommodate coupling, nesting, and general interoperability of numerical models developed and run by different research groups. The architecture has to "chain" data between models and data-translation and filter services must operate on data as they are transported between the modeling applications. To do this, the translation code must understand precisely the definition and format of the input data and the output data streams. This type of information has to be supplied automatically from the SCOOP catalogue and other sources. This task must complete an initial working dictionary that can supply exact definition and format specifications that will support the evolving meta-data standards.
The special requirements of "Web Services" will be to investigate and determine if there are "domain" specific terminology or data elements required for SCOOP's effective use of these services to enable machine-to-machine interactions with XML technologies. While not a short term requirement, it is recognized that the production of a sufficiently detailed data standard including recommended formats and resolutions will be beneficial in encouraging developers that future applications should be initially written to the standard.
The Task: TAMU's specific work related to this task will comprise the following: To engage SCOOP partners in identifying current practices and needs for data and metadata standards To continue to add to the glossary of recommended definitive terms, data dictionary, data models, and metadata models drawn from SCOOP and other activities and approved by SCOOP partners To increase interaction with Marine Metadata Initiative (MMI) by exchanging data standards information, regularly updating MMI web pages with SCOOP contributions, and including all appropriate MMI data in SCOOP glossaries To identify, promote, and participate in DMAC activities and subcommittees which can benefit from SCOOP developed standards To analyze special requirements of web services
The estimated cost for this task is $XXXK, in order to support technical and staff salaries, benefits, relevant travel, space and other indirect expenses.
Appendix A: TAMU's Overall Plan on Data Center and Related Development
A.1 TAMU Data Center V1.0
In response to the need of the SCOOP archive services, TAMU invested internal funds of $50K to establish a data center and archive facility with the objective of providing a scalable, sustainable data archive service. The center started its operation in the fall 2005 and now provides daily weather models to the SCOOP partners.
The Version 1.0 of the TAMU Data Center (TDC) comprises some 15 terabytes (TB) of rotating media (disk). It also provides a 16-node (32-processor) computational cluster on the classic Beowulf lines, for operational weather model runs (MM5). In terms of processing power, TDC employs 8 servers, distributing tasks to specific systems rather than overloading a given system. All servers and storage are interconnected on a dedicated gigabit network switch; the backbone network for the computational cluster resides on another, physically separate gigabit network switch. The data center resides on its own routed network, with a gigabit connection to the campus backbone. The campus is connected to the commodity internet via a gigabit network link, as is the Internet2 network.
TDC archives all data sent to it from the SCOOP partners, as well as obtaining and archiving data from the National Centers for Environmental Prediction, and in situ data from the National Weather Service and several volunteer network providers. TAMU, through its data center, also makes available via LDM, specialty data including Level II radar information that the partners might be otherwise unable to obtain, or which might be difficult to get through normal channels. The operation of TDC has been reliable and efficient. To date, no data have been lost due to network outages or the unavailability of storage.
A.2 TAMU Data Center V2.0
While TDC has been successful, we have started the design and imlementation of Version 2.0 of TAMU DATA Center. The new version will have significant advances in cmparison with Version 1.0: TDC V2.0 will significantly expand its storage capacity. We expect it will reach 35 terabytes, comprising 15 terabytes of online (directly accessible) and 20 terabytes of near-line (accessible with slight delays in retrieval) storage. It will provide an essential doubling of the current storage volume in a homogeneous package as opposed to the heterogeneous approach to acquisition of storage devices in the past. TDC V2.0 will allow better management and utilization of the hardware to facilitate storage, inventory, access and retrieval of data in a faster and more robust manner than currently employed. With TDC V2.0 we will be able to review and redirect storage and retrieval methods and procedures on the existing volumes, allowing TAMU to improve efficiencies on the existing system. Such a reengineering effort could not be undertaken without the infusion of significant hardware to allow the safe and rapid movement of data from volume to volume. TDC V2.0 will provide innovative capabilities associated with workflow crating/editing and verification allowing the unsophisticated user to define complex workflows, and confirm the appropriate configuration prior to their execution. TDC V2.0 will provide a production operation and a development sandbox where new ideas can be safely tested without the potential for adverse effect on production resources. This will allow users to proceed more rapidly with research and development than is currently the case.
TAMU is investing about $100K in hardware of TDC V2.0. A beta release date of Dec 16, 2005, is projected.
