R and Open Science

Karthik Ram

karthik@ropensci.org


Shortcuts: M =   ,   G =


Most science is not reproducible or repeatable, even within the same lab group over time.


Science

Data Life Cycle

spacer

source: Michener, 2006 Ecoinformatics.



Open Science


Open data + code

spacer

Source: Wolkovich et al. GCB 2012.



spacer

Source: PLOS, 2007



spacer

R packages are increasingly showing up in domain journals.



Source: Molecular Ecology, 2012


R Open Science



Open Science needs open source tools

spacer

Source: Revolution Analytics, 2010, Nature editorial, 2012

Why R?

The old way...

spacer

Why R?

A better way



glm(y ~ -1 + a + c + z + a:z, data = mydata, maxit = 30)


This is reproducible, repeatable and can serve as a analytic workflow.




spacer

Wrapping all science APIs




Development team


spacer
Carl Boettiger
  • 🔗

spacer
spacer spacer
Karthik Ram
  • 🔗

spacer
spacer spacer
Scott Chamberlain
  • 🔗

spacer



Advisory team


spacer
Duncan
Temple Lang
  • 🔗
spacer spacer
Hadley Wickham

  • 🔗

spacer spacer
JJ Allaire

  • 🔗
spacer spacer
Bertram
Ludascher
  • 🔗
spacer spacer
Matt Jones

  • 🔗

Ropensci's Packages

Data repositories


spacer spacer spacer
spacer   spacer
spacer    

Literature


spacer spacer    

Metadata


spacer spacer spacer  
spacer      


R and APIs

API keys can be stored in a users.rprofile

 
	options(MendeleyKey = "uf5daib7wyil7ag5buc")
	options(MendeleyPrivateKey = "faj2os5dyd7jop2fok6")
	options(PlosApiKey = "ef3vip9yak7od3hud4g")
	options(SpringerMetdataKey = "ri9hi7woc6jax4vaf8w")
	





Note: These keys aren't real.

Public Library of Science full text - rplos


library(rplos)
plot_throughtime(list("reproducible science"), 500)
spacer


Managing bibliography - RMendeley

Manage libraries and measure impact of research

groupDocInfo(mc, 530031, 4344945792)
$abstract
[1] "SUMMARY: Modern biological experiments create vast amounts of data which are geographically distributed. These datasets consist of petabytes of raw data and billions of documents. Yet to the best of our knowledge, a search engine technology that searches and cross-links all different data types in life sciences does not exist.....

$authors
$authors[[1]]
      forename        surname
   "Dominic S" 	"L\xfctjohann" 
# ....
	


Accessing data behind papers - dryad

# Get the URL for a data file
dryaddat <- download_url("10255/dryad.1759")

# Get a file given the URL
file <- dryad_getfile(dryaddat)


Tracking altmetrics - raltmet

Tracks altmetrics across various sources such as GitHub, Total impact, CitedIn, CiteULike, Stackoverflow.

GitHub(userorg = "ropensci", repo = "rmendeley")
totimp(id = "10.5061/dryad.8671")
stackexchange(ids = 16632)

Mapping biodiversity data - rgbif

distribution <- occurrencelist(sciname = "Danaus plexippus", coordinatestatus = TRUE, maxresults = 1000, latlongdf = TRUE)
spacer
Also see Cartodb's powerful mapping capabilites and R package.


Sharing unpublished data - (figshare)

Using Figshare's new API, it is now possible to share figures, data, and any other object generated in R directly to one's figshare account.


> figshare(data)
# code isn't ready yet but once it is, it will return a persistent identifier






spacer A multi-institution consortium to build infrastructure for open science



DataNE

DataONE creates all the necessary components to support persistent and secure access to earth observation data.




DataONE's upcoming R package will allow users to submit and access data to/from member nodes directly from the console.



Provenance is important for reproducibility



spacer
spacer

Source: Modified from original version by James Cheney, University of Edinburgh.



Making R provenance aware



DataNE provenance working group and R

Taking an approach similar to knitr where a user can track workflow provenance using hooks.


Using XML to track metadata and maintain provenance traces across runs


Ideas?





GitHub + Science

Rapid peer-peer sharing of code is great for science



R packages early in development can easily be tested, rapidly deployed from GitHub using devtools and revised before submitting to a persistent repository such as CRAN.


library(devtools)
install_github("RMendeley", "ropensci")



R + collaborative writing


knitr + Markdown


spacer
Xie Y (2012). knitr: A general-purpose package for dynamic report generation in R.

knitr + Markdown + GitHub

GitHub automatically renders Markdown and even provides syntax highlighting


spacer

knitr + Markdown + GitHub = executible paper


spacer

knitr + Markdown + GitHub = pre publication review


spacer

Incorporate citations with R + Markdown


knitcitations

citet(c(Halpern2006 = "10.1111/j.1461-0248.2005.00827.x"))
# then cite in your markdown file
citet("Halpern2006")

# or read citations from a bibtex file which can be automatically generated and updated from services like Mendeley
bib <- read.bibtex("example.bib") # then cite inline citet(bib[["knitr"]])

- knitcitations on Carl Boettiger's GitHub
- tutorial


Open notebooks with R

R talks to Dropbox, Amazon S3, Wordpress, img.ur, and elsewhere in the




Various tools in R can drive data reuse, new collaborations, new tools, novel visualization, and keep the entire research process transparent through open notebooks.

  bit.ly/ORqpuM

Please us if you have feedback or ideas for collaborations.

All ropensci projects are on
also on and

← →

/

#
gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.