Data sharing solutions for publicly funded research

The Data-Sharing Archive at shared-data.com (for providers) and shared-data.org (for free data use) is a new service being developed by Textensor Limited for cataloguing, storing, sharing and disseminating publicly funded research data. It is designed to provide an efficient and cost-effective solution to the need to make such data easily accessible.

The currently proposed features are presented below. This is still at the service definition stage, so if you are interested in data sharing solutions, please get in touch.

Motivation

Making publicly funded research data readily available is widely agreed to be desirable in many cases, and is even mandated by some funding agencies. However, for many types of data there are currently very few options available to the investigator who wishes to share data.

There are already numerous solutions used for commercial data handling that would be functionally adequate, but in general they are prohibitively expensive for research data.

The focus here, therefore, is on economising on those features that are not important in a research context, and maximising those that are. In particular, the diversity of research data to be shared calls for flexible, user-friendly tools to facilitate the creation and presentation of metadata for every data set that is archived. And metadata standards are required to facilitate harvesting and aggregation by other data sharing services so that resources can be easily located.

However, economies can be made on the data storage itself. Maintaining high availability, as in a conventional data center where any item of data can be accessed in a fraction of a second, is not necessary for data sharing. It is probably also not worth the environmental cost: for example, even without backup systems, maintaining 250GB data on a spinning disk consumes about 10 Watts, which is equivalent to an energy requirement of 87kWh per year (a spun-down disk still takes about 40kWh/year). This is no doubt justifiable in some cases, but in other cases such data may be required only a few times a year. It is important therefore that long-term storage hardware is capable of reducing its power requirements to a minimum, and that the storage software can manage demand and relocate data according to its frequency of access.

Service specification

The proposed service comprises five main components:

During the course of the research project, investigators use a web-based system to catalogue and upload the datasets that will be shared. Larger datasets can be sent on CD, DVD or other media. Each dataset or group of data-sets is issued with a DOI (Digital Object Identifier) if it does not already have one, and is registered with the international DOI foundation.
The catalogue, containing all the meta-data, is indexed and made available in a browsable and searchable form on the web. The catalogue is integrated into the global index on shared-data.org and is published as RDF (Resource Description Framework) and various other formats so that the contents can also be picked up by external indexing systems and meta-archives.
The datasets themselves are hosted either on the shared-data.org itself, at external data centres or in robotic tape libraries depending on the volume of data concerned and the level of demand. In all cases, off-line backups are built and tested on a regular basis. Although robotic libraries incur some delay on first access, they offer much reduced energy consumption (a few kWh per TB per year) compared with on-line disk storage. The latter currently takes about 200kWh per TB per year for online storage: an economic and environmental cost that may not be justifiable for large rarely-used data sets.
Small datasets are linked from the catalogue and any user can download them directly. For large datasets, the user will be asked to cover the transfer costs, either bandwidth costs for downloading (currently about 20p per GB) or duplication, packaging and postage costs for physical media. The investigator can opt to cover the costs of a certain volumes of transfers up-front, in which case access is free for the data user unless the allocation has been used up.
At the end of the research project, when no more data is to be added, the complete catalogue and data (up to certain size limits) is archived to physical media (currently DVD) as a standalone website and deposited with institutions of the investigator's choice such as university libraries.

The period for which the data are kept immediately available on-line after the end of the research project can be specified by the investigator but is typically ten years. At any time the full data sets and meta-data can be extracted and re-hosted by third parties such as institutional, national or subject-specific archives. Indeed, Textensor will take action to encourage and facilitate this process by publishing all meta-data formats and access statistics. In this respect, the service helps streamline and facilitate the use of emerging permanent archives.