spacer

Repositories in the Cloud? Why on earth not?!

by Paul Miller on 21 February, 201018:053 Comments

To be honest, I’ve never fully understood Higher Education’s penchant for building ‘institutional repositories.’ These frequently under-populated aggregations of academic papers produced by ‘research active’ employees of a particular university appear aligned almost exclusively to vaguely expressed institutional imperatives, and seem largely unrelated to either the selfish aspirations of the contributing authors or the tangible relationships they painstakingly construct with others across their chosen discipline. The ‘repository’ all too often appears a bureaucratic solution to a problem that the supposed beneficiaries do not recognise; a technological aberration that sits outside the conversational flow of the Web to which it is only tenuously attached.

Furthermore, ‘Open Access‘ and ‘Repository’ typically go hand in hand. If you support Open Access you need a repository, and if you question the role of repositories you’re in the pocket of evil publishers who want to lock up everything ever written and lease reading rights back to the employers of those who wrote the stuff in the first place.

Nonsense.

Open Access is an important component of today’s scholarly ecosystem. It’s not the only answer, and it’s not perfect, but it does have a significant part to play. Institutions have a role in preserving, disseminating and exploiting the work of their employees, but these are very different tasks that may benefit from different solutions. In too many cases, the repository is by default seen as a preservation mechanism and a dissemination vehicle, and as such it may fail to cost-effectively achieve either aim.

There are some large, well known, and research-intensive institutions where it might be possible to make a compelling argument for projecting a strong institutional image around a single ‘home’ for all of that research output. Never mind, for a moment, that so much research today is the result of inter-institutional collaboration, or that the eminent researcher might wish to take ‘their’ research publications with them as they move from Oxford to Harvard to York during their glittering career.

Alongside those institutions sit a plethora of others where research of equal quality is also being conducted; there just, maybe, isn’t quite as much of it. Bombarded by ‘advice’ and funding, and desperate to keep up with the Russell Group, ever-more institutions blindly join the repository cult and wonder why their new toys do not fill to overflowing with the jewels of scholarly erudition.

As research becomes increasingly data-rich, the whole cycle looks set to repeat. The recently released Panton Principles for Open Data in Science are to be welcomed, but I’ll bet the institutional response will all too often be the commissioning of a ‘data repository’ to sit alongside the ‘publication repository’ they already don’t use.

All of which is a rather long-winded way of introducing the fact that Eduserv’s Andy Powell has asked me to facilitate a breakout afternoon on ‘Policy Issues’ at the Repositories in the Cloud event Eduserv and JISC are holding in London on Tuesday.

“This free event, organised jointly by Eduserv and the JISC, will bring together software developers, repository managers, service providers, funding and advisory bodies to discuss the potential policy and technical issues associated with cloud computing and the delivery of repository services in UK HEIs.”

In a post on 11 February, Andy invited participants to share some of their views ahead of the meeting, and on 19 February he wrote about some of his own thoughts.

Like Andy, I struggled somewhat to nail down a coherent set of thoughts about the issue of pushing today’s repositories into the Cloud. On one level, I wonder whether the vast majority of institutions with small (and relatively low traffic) repositories would see much of a tangible efficiency gain or cost saving by moving off an in-house computer to rent an equivalent Virtual Machine from Amazon, Rackspace, or any of their competitors. If we’re talking about IT systems within a typical university, there are others (email, calendaring, pools of compute resource for research jobs, etc) that appear more immediately compelling for the shift Cloud-ward. Which is not to say that there isn’t a clear opportunity for someone trusted to step into this space and offer a SaaS repository to which institutions might affordably subscribe. Eduserv? Mimas? Edina? The British Library? The National Archives? Duraspace? Any could, and if we’re not ready for something more then at least one probably should.

However, a bolder reconsideration of what repositories are and what they’re for might very well lead to something interesting, sustainable, and perfectly suited for benefitting from Cloud Computing’s strengths.

Why does a paper have to be ‘deposited’ in a repository? Why does a single paper with three authors from three institutions have to be deposited in three separate institutional repositories? Why does that same paper have to be deposited – separately – in the subject repository favoured by scholars in the relevant discipline? Why does the institution’s very reasonable desire to protect, preserve, promote and disseminate its excellence mean that it has to run systems in perpetuity that preserve and permit access? Why do we address the fundamentally different (perhaps even contradictory) problems of access and preservation in the same system? Why can’t the individual researcher easily assemble a view across their publication history, regardless of the institution within which they happened to reside as they wrote each paper? Why don’t the assemblages of papers reflect personal, professional and disciplinary relationships, alongside (or instead of) the contractual accident of employee-employer relationships? Why isn’t the wealth of metadata implicit to any publication (authors, subjects, dates, citations, and more) available and actionable, both inside the repository and far beyond it across the Web? Why isn’t there a tight and active association between the paper and the data from which its findings were derived (something for which Internet Archaeology was demonstrating utility a very long time ago)?

Scholarly papers principally comprise text, augmented by the occasional static image. They’re not big, and they don’t tend to change very fast. In many ways, they represent a fairly easy problem set with which to work. As more and more data becomes key to research in a growing number of subject areas, the problems are set to become far larger and far more difficult. For individual universities to even consider replicating the process by which they all ended up with their repositories of text surely seems madness in this data-rich environment. Even with levels of uptake as low as those seen in too many text repositories, the issues of data management, curation, access and dissemination are too great to be sensibly solved in the institutional machine room. Services like InfoChimps and Amazon’s own Public Data Sets offering show some of the ways that we might begin to work with data at scale. Might we, for example, come to recognise as Amazon has that it’s actually cheaper and quicker to entrust large data sets to FedEx rather than transmit them over the Internet?

‘The answer’ might be some central service for the community, funded by JISC like the Arts & Humanities Data Service (AHDS) of old. Or it might be something different, something nimbler, more responsive, more flexible to individual, institutional, and disciplinary requirements, and something more scalable to new disciplines; institutional support for and use of existing Cloud infrastructures extending far beyond UK Higher Education, aligned with a clear understanding of the separation between preservation and access.

I certainly don’t have all the answers, but I do believe that simply asking whether or not we should move existing repositories to the Cloud is to miss the point. Rather, we should ask what role the Cloud might play in addressing the business requirements to which the institutional repository was our initial – faltering – response. The answer might very well be ‘None,’ but I doubt it.

I look forward to Tuesday’s discussion. I’m not going there to push my personal view that individual institutions frequently shouldn’t be building, running or populating their own repositories at all. I’m going there to facilitate the discussion those in the room want to have, and to learn from their experiences and their perspectives.

Related articles by Zemanta
  • Does a Citation Advantage Exist for Mandated Open Access Articles? (scholarlykitchen.sspnet.org)
  • Scholarly content and the cliff edge: the place of subject ‘repositories’ (hangingtogether.org)
  • Scholarly Communications must be Scalable (downes.ca)
  • Beyond Open Access: Open Publishing (opendotdotdot.blogspot.com)
  • 57 college presidents declare support for public access to publicly funded research in the US (scienceblog.com)
  • Mandelson says academics are ‘set in aspic’ (guardian.co.uk)
spacer

Share:

  • Twitter
  • Facebook
  • Google
  • LinkedIn
  • More
  • Email
  • Print
  • Tumblr
  • Reddit
  • Pinterest
Tags: Academic publishing, Amazon Web Services, Andy Powell, Archives, AWS, Cloud computing, Colleges and Universities, Eduserv, Higher Education, infochimps, Institutional repository, JISC, Open access, Open Data, Panton Principles, repcloud, Research, Software as a service
Categories: Cloud computing, Open Data
Previous postSiri brings their Virtual Personal Assistant's smarts to the iPhone Next postA podcast conversation about GoodRelations, with Martin Hepp and Jamie Taylor
  • Pingback: uberVU - social comments()

  • Pingback: Repositories in the Cloud? Why on earth not?! | CloudAve()

  • Pingback: Dépôts institutionnels: des ressources (27/02/10) « pintiniblog()

Twitter

My Tweets

spacer My Forrester blog

  • Your Business Is Already A Multicloud Business
  • Microsoft and T-Systems Find Innovative Solution To Address Customer Data Privacy Concerns
  • OpenStack Pushes Local Stories In Tokyo
  • Amazon Web Services Pushes Enterprise And Hybrid Messages At re:Invent
  • OpenStack Is Now Ready For Business
  • Covering Cloud Computing From Europe For Forrester’s CIO Role

Upcoming Trips

spacer My writing for Forbes

  • Container Competitors Google, CoreOS, Joyent And Docker Join New Linux Club As Kubernetes Turns One
  • Huawei Bears Open Source Gifts From China
  • In Cloud Computing, It Pays To Commit
  • Microsoft Azure Joins The Rackspace Managed Cloud Portfolio
  • Open Data Awards Celebrate Smart Uses Of Public Data
  • Mirantis Partners To Put OpenStack In A Box
  • IBM Backs Containers As DockerCon Begins
  • IBM Backs Apache Spark For Big Data Analytics
  • Mesosphere's Data Center Operating System Goes Live For Amazon, Microsoft, OpenStack, More
  • Extending Open Data Beyond Governments
  • CERN Keeps Options Open With Its Clouds
  • OpenStack Gathers The Faithful
  • Explaining Technology's Value
  • IBM Knows That Licenses Unlock The Hybrid Cloud
  • Google Launches Bigtable, A Big Managed Database In The Cloud
  • 451 Research Unpicks Private Cloud Pricing To Suggest Surprises
  • Google And Microsoft Make Containers More Useful
  • Personality Matters More Than Code As Hadoop Vendors Battle For Supremacy
  • Hewlett-Packard Sort Of Denies It's Abandoning Public Cloud
  • In Defense Of OpenStack

spacer One short thought each day

  • Joining Forrester Research
  • Intel, Rackspace, OpenStack, and a Cloud for All
  • CartoDB makes it easier to map data
  • Does Google’s Knowledge Graph have a ‘facts’ problem?
  • Google is Positioning Kubernetes to Stand at the Center of the Hybrid World
  • Amazon iterates, Google partners
  • UK education divided over adoption of cloud
  • Data for Good
  • When things go wrong
  • Microsoft shouts about the Azure Data Catalog
  • Amazon aims to dominate yet another area
  • Money for IaaS, money for streaming data
  • Companies need to share how they use our data. Here are some ideas.
  • Open Source Software and the US Government
  • Big telcos with cloudy pretensions
  • Data Localisation may come at a cost
  • Flink. It’s Europe’s Spark. Sort of.
  • McKinsey explains machine learning to execs
  • Big business still builds data centres
  • Another big data survey, another unsurprising result

Recent blog posts

  • I am Joining Forrester
  • On Forbes
  • Remembering Gigaom Research
  • A semantic journey
  • How to Lie with Data
  • Private cloud silliness
  • Infochimps CEO Jim Kaskade talks about acquisition and the big data opportunity
  • Joining the dots with travel data
  • Microsoft Corporate Vice President Quentin Clark discusses data, data platforms, and more
  • Acknowledgement matters
  • Teradata Labs President Scott Gnau discusses the evolving data analytics market
  • Adaptive Computing CEO Robert Clyde talks about Big Data, and lessons from the world of High Performance Computing
  • Infer CEO Vik Singh talks about predictive analytics and lead scoring
  • MapR CEO John Schroeder discusses the market for Hadoop
  • KPMG Capital CEO Mark Toon discusses corporate perspectives on big data
  • orchestrate.io CEO Antony Falco dissects the database industry
  • Conflict of Interest: common sense or witch-hunt?
gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.