Home > Drupal > Blog > Defining Institutional Data Storage Requirements

Defining institutional data storage requirements

18 March, 2013

Jonathan Rans

Institutions developing infrastructure in support of research data management are engaging with a whole range of issues, both cultural and technical. One that stands out as a clear priority is that of research data storage, both for “live” data, during the active phase of research, and post-project archiving.

On the 25th February, Jisc, Janet and the DCC hosted a workshop that brought together service providers from a variety of HE institutions with commercial suppliers of storage solutions in an effort to develop a better understanding of institutional requirements and the extent to which these can be met by current provisions. Where gaps exist, the aim was to identify what needs to be done to close them.

The first part of the day comprised presentations from five institutions at different stages of service provision, giving some context and detail for later breakout discussions with suppliers.

For me, the most positive aspect of the workshop presentations and discussions was just how much consensus there was across the spectrum. Even though, in some cases, the consensus was that we simply don’t have all the answers and must do further work, defining the issues is a vital step.

COSTS OF STORAGE

There was fairly broad consensus across a number of participants on the current costs to an institution of providing active storage. Generally speaking, the price of bits and bytes storage amounted to around £500 per terabyte, per year for a single copy held on active, spinning disc storage. In most cases this rose to £1000/TB for two disc copies in separate locations, providing the system with redundancy, with a further tape back-up. These figures cover hardware, running costs and human infrastructure; some of the institutions and suppliers believed that they could provide similar provision at somewhat reduced cost.

Addressing the issue of long-term, archive storage, there was a good deal more uncertainty over the true costs of preserving data; the resource required for curation is poorly understood but expected to outstrip that required for holding the data many times over. With that in mind, discussions of costs assumed a low-curation model of preservation. Allowing for depreciation in the cost of storage media, figures of around £5000/TB were mooted to retain data in perpetuity however, this does make a number of assumptions and couldn’t be considered as robust as the figure for active data storage.

DESIGNING THE STORAGE LANDSCAPE

When outlining the shape of ideal future provision, several of the presenters discussed the concept of tiered storage, in recognition of the spectrum of values and requirements of research data. The problem at the moment, said one, is that we cannot provide a rich enough storage landscape.

At present, expensive, spinning-disc storage is used to hold everything from active research data to redundant datasets that haven’t been accessed in years. It was agreed that any tiered system would have to present a smooth interface to the researcher but how this might be implemented is unclear. As a corollary, there may be a need for software systems that flag duplicated or inactive data that could be migrated to cheaper levels of storage.

System design cannot be done without the input of academic staff who are needed to qualify data value and identify acceptable recovery times. In one case, this question had been put to academic staff and the results were somewhat surprising, with the majority indicating that two weeks was an acceptable recovery time for the main bulk of data.

For long-term retention, all institutions envisaged a hybrid system in which external data centres are used wherever appropriate with what data remaining accommodated by institutional systems. One institution estimated the proportion of external/institutional archived data to be in the region of 25% to 75% respectively.

ENABLING COLLABORATION

A major problem that needs to be addressed when designing active data storage architecture is the difficulty of supporting the collaborative sharing of active data. At present most institutions achieve this through somewhat convoluted workflows for assigning access keys to external researchers. The difficulties of engaging with these systems are driving researchers towards a whole host of third-party solutions such as Microsoft Skydrive, Google Drive and Dropbox.

For some researchers these systems may be perfectly adequate for their needs but security, back-up and tracking issues make it attractive for institutions to bring the data back within their own jurisdiction; in some cases, there may be institutional policies in place that proscribe their use.

HOW MUCH SPACE DOES A RESEARCHER NEED?

Of course, when it comes to designing future storage provision, the fundamental question that needs to be addressed is ‘How much?’ Estimates from a diverse set of universities of the total amount of storage required, based on what is currently held by researchers, ranged from 300TB up to 3.5PB. These figures are likely to be quite inaccurate as the dispersal of data holdings (one institution’s researchers had an average of seven places where data was stored) means that even researchers themselves have difficulty quantifying their data volumes.

There was a general recognition that some kind of control is needed when initially offering managed storage to researchers to prevent the system being swamped by the wholesale migration of low-value data. Even less certain is the volume of data that will need to be retained for the long term and the most appropriate platform and access conditions for universities’ data collections.

OVERCOMING THE ‘PC WORLD EFFECT’

Convincing researchers to use managed storage at a cost of £800-1000 per TB per year can be difficult when any of them can purchase a terabyte hard drive for £60 and run their own ad hoc system. Many institutions directly addressing research data storage infrastructure are seeking to overcome this barrier by providing a certain amount of storage for free and then charging for any usage over and above that provision.

Amounts that are being offered varied from 100GB to 5TB per researcher but in each case the expectation was that this quantity would exceed the requirements of the average user.

QUANTIFYING GROWTH

Of course, estimates of current holdings aren’t sufficient to build an accurate projection of future requirements, there needs to be some understanding of the likely scale of growth of data generation.

Again, there seemed to be fair amount of agreement over the rate of growth with several institutions quoting a figure of around 25% per year. For some, this was considered to be far too low; one presenter suggested that using Moore’s law to calculate the rate of growth might still underestimate the scale.

INTEGRATION OF CLOUD SERVICES INTO INSTITUTIONAL OFFERINGS

Most participants recognised the value of some kind of hybrid storage system, in which appropriate use of the cloud boosts the institution’s capabilities. There is certainly a need for flexible, temporary solutions enabling an institution to accommodate peaks in the use of active, managed storage.

In terms of quantities, there was evidence that sudden requests for storage in the region of 10TB is not unusual but hard to accommodate. Commercial suppliers indicated that they could easily meet this demand and regularly do for quantities up to around 160TB. Per terabyte costs for cloud-based storage appeared to be competitive compared to the costs of in-house provision, with figures of around £860/TB/year including redundancy. For some institutions, using cloud storage is seen as a way of side-stepping the difficulties of providing an in-house repository; one presenter described the prospect of providing open access to parts of the institution’s systems as ‘scary’.

Although there is a clear use-case for integrating cloud storage into institutional systems, there was some considerable concern surrounding service costs and the financial liabilities of using commercial providers when future patterns of use are only hazily defined. Discussions in the latter half of the workshop revealed that some of the third-party providers are currently serving quite specific markets whose use-cases don’t fully map onto those of the HE sector. There will need to be work done on the part of providers to tailor their services for the market and on the part of institutions to more accurately define their needs.

One area of particular concern was that of egress charges, the cost attached to the download of data that is held in commercial cloud storage. Institutions indicated that they couldn’t sign up for cloud services without cost ceilings in place to limit liability, whereas commercial suppliers are understandably wary of offering unlimited data access. One supplier doesn’t charge for egress but does have a fair use policy in place that limits the amount that can be accessed to 5% of the total amount stored per month. Clearly this is appropriate for a standard archive or backup storage but could be of limited use as part of a cloud-based, open-access repository or active data store.

Cost is not the only consideration to take into account, cloud services need to meet institutional data governance requirements. There was consensus that the physical location of the data should be within the European Economic Area, although policies aren’t harmonised across the zone this was felt to be an acceptably small risk.

The integration of cloud services with local systems needs to be a seamless experience for the end user. Providing universal authentication and avoiding multiple logins was seen by many as vital for lowering the barriers to use of alternative storage areas.

KEY MESSAGES:

Physical costs and pricing models of data storage are understood but are a fraction of the true cost of preserving data in the long term. Curation costs are not well quantified.
Storage design is likely to favour tiered, hybrid models.
Systems for sharing live data with collaborators are required.
Most data curation must remain an institutional responsibility but storage and some preservation actions could be outsourced.
Authentication and access issues need to be addressed for cloud services.
Cost and use models for research data in the cloud need to be properly developed.

< Previous
Next >

More about

JISC, JANET, storage, institutional repositories, Cloud, active data storage, infrastructure

Home
Digital curation
- What is digital curation?
- Why preserve digital data?
- Planning for preservation
- Digital curation FAQ
  - How can the DCC help you?
  - DCC Curation Lifecycle Model
  - DRAMBORA
  - Freedom of Information Act
  - Open Source Software and Open Standards
  - Submitting funding proposals to the Medical Research Council
- Glossary
About us
- DCC staff directory
- DCC Charter
- History of the DCC
- DCC Phase 3
- Partnerships
  - e-Science
  - CODATA
- Research & Development
- About this site
  - Website terms of use
    - Copyright, use and liability
    - Use of your data
    - Cookies we use
  - Accessibility
News
- What's New archive
Events
- International Digital Curation Conference (IDCC)
- Data Management Roadshows
- Research Data Management Forum (RDMF)
- Workshops
- Other DCC events
  - Feedback
- External events
  - CRIS2014

Resources

Briefing Papers
- Introduction to Curation
  - Annotation
  - Appraisal and Selection
  - Curating Emails
  - Curating e-Science Data
  - Curating Geospatial Data
  - Data Accreditation
  - Data Citation and Linking
  - Data Protection
  - Database Archiving
  - Digital Repositories
  - Freedom of Information
  - Genre Classification
  - Interoperability
  - Persistent Identifiers
  - Trust Through Self Assessment
  - Using OAIS for Curation
  - Web 2.0
  - What is Digital Curation?
- Making the Case for RDM
- 5 Steps to Research Data Readiness
- Citizen Science
- Legal Watch Papers
  - Creative Commons Licensing
  - IPR in Databases
  - Science Commons
  - Sharing Medical Data
- Standards Watch Papers
  - Information Security Management: Using BS 10012:2009 to Comply with the Data Protection Act (1998)
  - Information Security Management: The ISO 27000 (ISO 27K) Series
  - ISO 15489
  - PREMIS Data Dictionary
  - What are Metadata Standards
  - Using Metadata Standards
  - Workflow Standards for e-Science
- Technology Watch Papers
  - CASPAR
  - Planets Testbed
  - Preservation and Curation in Institutional Repositories
  - Web Archiving
How-to Guides & Checklists
- Five Steps to Decide What Data to Keep
- How to Appraise & Select Research Data for Curation
- How to Cite Datasets and Link to Publications
- How to Develop RDM Services
- How to Develop a DMP
- How to Discover Requirements
- How to License Research Data
- How to Write a Lay Summary
Developing RDM Services
- Assigning DOIs at Bristol
- DMPs in the Arts and Humanities
- Improving RDM at Monash
- Improving Research Visibility
- Increasing Participation in Training
- RDM Training for Librarians
- RDM strategy: moving from plans to action
- Storing and Sharing Data in Hull
Curation Lifecycle Model
- Lifecycle Model FAQ
Curation Reference Manual
- Peer review
- Editorial Board
- Completed chapters
  - Appraisal and Selection
  - Archival Metadata
  - Archiving Web Resources
  - Automated Metadata Generation
  - Curating Emails
  - File Formats
  - Investment in an Intangible Asset
  - Learning Object Metadata
  - Metadata
  - Ontologies
  - Open Source for Digital Curation
  - Preservation Metadata
  - Preservation Scenarios for Projects Producing Digital Resources
  - Preservation Strategies
  - Principles for Enabling Access to Engineering Design Information Through Life
  - Scientific Metadata
  - The Role of Microfilm in Digital Preservation
- Chapters in production
  - Lifecycle Planning
  - Using the OAIS Reference Model for Curation
  - Video Data

Policy and legal

Five Steps to Developing a Research Data Policy
Overview of funders' data policies
Funders' data policies
- AHRC
- BBSRC
- Cancer Research UK
- EPSRC
- ESRC
- MRC
- NERC
- STFC
- Wellcome Trust
Institutional data policies
Policy tools and guidance
RDM guidance webpages

Related searches:
storage curation research digital active

gipoco.com is neither affiliated with the authors of this page or responsible
for its contents. This is a safe-cache copy of the original web site.

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.