HAVANA

Background
Collaborations
Annotation
Publications

Background

The Sanger Institute has made large contributions to a large number of vertebrate genome sequences, including all or part of human chromosomes 1, 6, 9, 10, 13, 20, 22 and X and mouse chromosomes 2, 4, 11 and X, and the full Danio rerio (zebrafish) genome sequence. The Institute has also sequenced or continues to sequence selected parts of other vertebrate genomes, including candidate diabetes gene regions (in reference and non-obese diabetic (NOD) mouse strains) and MHC regions (in wallaby, Tasmanian devil, gorilla, dog, pig, human haplotypes and mouse strains). The HAVANA group provides the manual annotation for these and other genome sequences.

Collaborations

The HAVANA group collaborates with others in both small and large projects. The largest projects are designed to annotate the entire human genome and the majority of coding genes in mouse. The following are the main HAVANA collaborations relating to these projects:

ENCODE (Encyclopedia of DNA Elements) and GENCODE

The ENCODE and GENCODE projects provide in-depth, coordinated analysis of the entire human genome using experimental, computational and manual techniques. HAVANA manual annotation serves as the reference annotation underlying this global project. Continuous feedback between collaborators working on the three different aspects encourages refinement of all techniques involved.
ENCODE website

CCDS (Consensus Coding Sequence)

CCDS is a collaboration between the Sanger Institute (Ensembl, VEGA, HAVANA), UCSC (Genome Bioinformatics Group) and NCBI (RefSeq). CCDS strives to provide a comprehensive database of high-quality coding regions from the human and mouse genomes agreed by all collaborators. Annotation from Sanger Institute and RefSeq, which is created using different techniques, is compared and a CCDS entry is created when the two agree on the coding sequence structure for a given transcript or locus. Conflicts are discussed between all three parties and, where a consensus can be reached, a CCDS entry is created.
CCDS website

IKMC (International Knockout Mouse Consortium)

IKMC is a collaboration between the three main mouse knockout projects: EUCOMM (European Conditional Mouse Mutagenesis), KOMP (Knockout Mouse Project) and NorCOMM (North American Conditional Mouse Mutagenesis). Manual annotation by the HAVANA group and collaborators at Washington University, St Louis, and University of Manitoba, Winnipeg, serves as the foundation for constructing knockout mouse cell lines for every coding gene.
IKMC website

GRC (Genome Reference Consortium)

A collaboration between the Wellcome Trust Sanger Institute, the Genome Center at WashU, the EBI and the NCBI, the GRC aims to provide the best possible genome assemblies for human, mouse and zebrafish. It does so by investigating potential variation, errors, conflicts and sequence gaps with a view to choosing the best or multiple representations of variant sequence, correcting errors, resolving conflicts and filling-in gaps. HAVANA's role is to report and feed back any of these issues affecting genes in the three species.
GRC website

Flow of information between HAVANA (blue and red shapes), collaborators and databases. Thick arrows are direct collaborations, thin arrows show indirect feeding of HAVANA annotation back into the analysis pipeline.

zoom

Annotation

HAVANA annotation is publicly available from the following websites:

VEGA
Ensembl
UCSC

The HAVANA group puts special emphasis on splice variants and pseudogenes, two areas still underdeveloped in automated annotation systems, as well as poly-adenylation features. Also, where other systems concentrate on, or are limited to, protein-coding genes, many HAVANA transcripts are annotated without a protein-coding region. These transcripts may function as non-coding RNAs or they may be incomplete gene fragments for which the coding sequence cannot yet be determined.

The HAVANA group requires that all annotated gene structures (transcripts) are supported by transcriptional evidence, either from cDNA, EST or protein sequences. As such not all annotated transcripts are necessarily complete. Support does not need to come from locus-specific evidence, but can also be homologous, paralogous or orthologous.

While the transcript and protein sequences are the most important pieces of information, HAVANA annotation takes into account and uses other data, such as CpG islands, gene predictions, repeats and genome signatures. Because the annotation software used is DAS (Distributed Annotation System) aware, the HAVANA team can link to external data sources. Ensembl gene models and data from GENCODE collaborators are some of the DAS sources the HAVANA group uses. HAVANA sources are under constant review and subject change. For example, the group recently started to use data from new technologies such as RNAseq and protein mass spectrometry in its annotation efforts.

Annotation guidelines

Like its data sources, HAVANA's annotation guidelines are under constant review and are routinely updated to take into account feedback from collaborators, incorporate new data sources and reflect new trends in genetics, transcriptomics, proteomics and genomics.

HAVANA Annotation guidelines detail our annotation standards.

Otterlace

We use the in-house developed and maintained Otterlace annotation suite for manual annotation. This suite comprises an automated analysis pipeline based on the Ensembl pipeline, graphical interfaces for viewing the pipeline results and interfaces for creating and modifying transcript models. The figure shows a selection of user interfaces from Otterlace.

Annotation interfaces in Otterlace

zoom

The Otterlace user manual gives guidance on how to use the Otterlace interfaces.

Nomenclature

As well as modelling accurate transcript models, it is important to use the correct gene nomenclature. To maintain consistency in an annotation database, especially important when working with syntenic regions across species or haplotypes within a single species, the HAVANA annotation group interacts closely with the nomenclature committees for the human, mouse and zebrafish genomes.

Human genome nomenclature
Mouse genome nomenclature
Zebrafish genome nomenclature

Publications

• Journal papers

citations per annum of HAVANA (co-)authored publications

zoom

Fine mapping of type 1 diabetes regions Idd9.1 and Idd9.2 reveals genetic complexity.

Hamilton-Williams EE, Rainbow DB, Cheung J, Christensen M, Lyons PA, Peterson LB, Steward CA, Sherman LA and Wicker LS

Mammalian genome : official journal of the International Mammalian Genome Society 2013;24;9-10;358-75

PUBMED: 23934554; PMC: 3824839; DOI: 10.1007/s00335-013-9466-y
The zebrafish reference genome sequence and its relationship to the human genome.

Howe K, Clark MD, Torroja CF, Torrance J, Berthelot C, Muffato M, Collins JE, Humphray S, McLaren K, Matthews L, McLaren S, Sealy I, Caccamo M, Churcher C, Scott C, Barrett JC, Koch R, Rauch GJ, White S, Chow W, Kilian B, Quintais LT, Guerra-Assunção JA, Zhou Y, Gu Y, Yen J, Vogel JH, Eyre T, Redmond S, Banerjee R, Chi J, Fu B, Langley E, Maguire SF, Laird GK, Lloyd D, Kenyon E, Donaldson S, Sehra H, Almeida-King J, Loveland J, Trevanion S, Jones M, Quail M, Willey D, Hunt A, Burton J, Sims S, McLay K, Plumb B, Davis J, Clee C, Oliver K, Clark R, Riddle C, Elliot D, Eliott D, Threadgold G, Harden G, Ware D, Begum S, Mortimore B, Mortimer B, Kerry G, Heath P, Phillimore B, Tracey A, Corby N, Dunn M, Johnson C, Wood J, Clark S, Pelan S, Griffiths G, Smith M, Glithero R, Howden P, Barker N, Lloyd C, Stevens C, Harley J, Holt K, Panagiotidis G, Lovell J, Beasley H, Henderson C, Gordon D, Auger K, Wright D, Collins J, Raisen C, Dyer L, Leung K, Robertson L, Ambridge K, Leongamornlert D, McGuire S, Gilderthorp R, Griffiths C, Manthravadi D, Nichol S, Barker G, Whitehead S, Kay M, Brown J, Murnane C, Gray E, Humphries M, Sycamore N, Barker D, Saunders D, Wallis J, Babbage A, Hammond S, Mashreghi-Mohammadi M, Barr L, Martin S, Wray P, Ellington A, Matthews N, Ellwood M, Woodmansey R, Clark G, Cooper J, Cooper J, Tromans A, Grafham D, Skuce C, Pandian R, Andrews R, Harrison E, Kimberley A, Garnett J, Fosker N, Hall R, Garner P, Kelly D, Bird C, Palmer S, Gehring I, Berger A, Dooley CM, Ersan-Ürün Z, Eser C, Geiger H, Geisler M, Karotki L, Kirn A, Konantz J, Konantz M, Oberländer M, Rudolph-Geiger S, Teucke M, Lanz C, Raddatz G, Osoegawa K, Zhu B, Rapp A, Widaa S, Langford C, Yang F, Schuster SC, Carter NP, Harrow J, Ning Z, Herrero J, Searle SM, Enright A, Geisler R, Plasterk RH, Lee C, Westerfield M, de Jong PJ, Zon LI, Postlethwait JH, Nüsslein-Volhard C, Hubbard TJ, Roest Crollius H, Rogers J and Stemple DL

Nature 2013;496;7446;498-503

PUBMED: 23594743; PMC: 3703927; DOI: 10.1038/nature12111
Ensembl 2013.

Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, García-Girón C, Gordon L, Hourlier T, Hunt S, Juettemann T, Kähäri AK, Keenan S, Komorowska M, Kulesha E, Longden I, Maurel T, McLaren WM, Muffato M, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ritchie GR, Ruffier M, Schuster M, Sheppard D, Sobral D, Taylor K, Thormann A, Trevanion S, White S, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Harrow J, Herrero J, Hubbard TJ, Johnson N, Kinsella R, Parker A, Spudich G, Yates A, Zadissa A and Searle SM

Nucleic acids research 2013;41;Database issue;D48-55

PUBMED: 23203987; PMC: 3531136; DOI: 10.1093/nar/gks1236
The non-obese diabetic mouse sequence, annotation and variation resource: an aid for investigating type 1 diabetes.

Steward CA, Gonzalez JM, Trevanion S, Sheppard D, Kerry G, Gilbert JG, Wicker LS, Rogers J and Harrow JL

Database : the journal of biological databases and curation 2013;2013;bat032

PUBMED: 23729657; PMC: 3668384; DOI: 10.1093/database/bat032
Sequencing and comparative analysis of the gorilla MHC genomic sequence.

Wilming LG, Hart EA, Coggill PC, Horton R, Gilbert JG, Clee C, Jones M, Lloyd C, Palmer S, Sims S, Whitehead S, Wiley D, Beck S and Harrow JL

Database : the journal of biological databases and curation 2013;2013;bat011

PUBMED: 23589541; PMC: 3626023; DOI: 10.1093/database/bat011
Structural and functional annotation of the porcine immunome.

Dawson HD, Loveland JE, Pascal G, Gilbert JG, Uenishi H, Mann KM, Sang Y, Zhang J, Carvalho-Silva D, Hunt T, Hardy M, Hu Z, Zhao SH, Anselmo A, Shinkai H, Chen C, Badaoui B, Berman D, Amid C, Kay M, Lloyd D, Snow C, Morozumi T, Cheng RP, Bystrom M, Kapetanovic R, Schwartz JC, Kataria R, Astley M, Fritz E, Steward C, Thomas M, Wilming L, Toki D, Archibald AL, Bed'Hom B, Beraldi D, Huang TH, Ait-Ali T, Blecha F, Botti S, Freeman TC, Giuffra E, Hume DA, Lunney JK, Murtaugh MP, Reecy JM, Harrow JL, Rogel-Gaillard C and Tuggle CK

BMC genomics 2013;14;332

PUBMED: 23676093; PMC: 3658956; DOI: 10.1186/1471-2164-14-332
The B10 Idd9.3 locus mediates accumulation of functionally superior CD137(+) regulatory T cells in the nonobese diabetic type 1 diabetes model.

Kachapati K, Adams DE, Wu Y, Steward CA, Rainbow DB, Wicker LS, Mittler RS and Ridgway WM

Journal of immunology (Baltimore, Md. : 1950) 2012;189;10;5001-15

PUBMED: 23066155; PMC: 3505683; DOI: 10.4049/jimmunol.1101013
Analyses of pig genomes provide insight into porcine demography and evolution.

Groenen MA, Archibald AL, Uenishi H, Tuggle CK, Takeuchi Y, Rothschild MF, Rogel-Gaillard C, Park C, Milan D, Megens HJ, Li S, Larkin DM, Kim H, Frantz LA, Caccamo M, Ahn H, Aken BL, Anselmo A, Anthon C, Auvil L, Badaoui B, Beattie CW, Bendixen C, Berman D, Blecha F, Blomberg J, Bolund L, Bosse M, Botti S, Bujie Z, Bystrom M, Capitanu B, Carvalho-Silva D, Chardon P, Chen C, Cheng R, Choi SH, Chow W, Clark RC, Clee C, Crooijmans RP, Dawson HD, Dehais P, De Sapio F, Dibbits B, Drou N, Du ZQ, Eversole K, Fadista J, Fairley S, Faraut T, Faulkner GJ, Fowler KE, Fredholm M, Fritz E, Gilbert JG, Giuffra E, Gorodkin J, Griffin DK, Harrow JL, Hayward A, Howe K, Hu ZL, Humphray SJ, Hunt T, Hornshøj H, Jeon JT, Jern P, Jones M, Jurka J, Kanamori H, Kapetanovic R, Kim J, Kim JH, Kim KW, Kim TH, Larson G, Lee K, Lee KT, Leggett R, Lewin HA, Li Y, Liu W, Loveland JE, Lu Y, Lunney JK, Ma J, Madsen O, Mann K, Matthews L, McLaren S, Morozumi T, Murtaugh MP, Narayan J, Nguyen DT, Ni P, Oh SJ, Onteru S, Panitz F, Park EW, Park HS, Pascal G, Paudel Y, Perez-Enciso M, Ramirez-Gonzalez R, Reecy JM, Rodriguez-Zas S, Rohrer GA, Rund L, Sang Y, Schachtschneider K, Schraiber JG, Schwartz J, Scobie L, Scott C, Searle S, Servin B, Southey BR, Sperber G, Stadler P, Sweedler JV, Tafer H, Thomsen B, Wali R, Wang J, Wang J, White S, Xu X, Yerle M, Zhang G, Zhang J, Zhang J, Zhao S, Rogers J, Churcher C and Schook LB

Nature 2012;491;7424;393-8

PUBMED: 23151582; PMC: 3566564; DOI: 10.1038/nature11622
An integrated encyclopedia of DNA elements in the human genome.

ENCODE Project Consortium

Nature 2012;489;7414;57-74

PUBMED: 22955616; PMC: 3439153; DOI: 10.1038/nature11247
Landscape of transcription in human cells.

Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J, Derrien T, Drenkow J, Dumais E, Dumais J, Duttagupta R, Falconnet E, Fastuca M, Fejes-Toth K, Ferreira P, Foissac S, Fullwood MJ, Gao H, Gonzalez D, Gordon A, Gunawardena H, Howald C, Jha S, Johnson R, Kapranov P, King B, Kingswood C, Luo OJ, Park E, Persaud K, Preall JB, Ribeca P, Risk B, Robyr D, Sammeth M, Schaffer L, See LH, Shahab A, Skancke J, Suzuki AM, Takahashi H, Tilgner H, Trout D, Walters N, Wang H, Wrobel J, Yu Y, Ruan X, Hayashizaki Y, Harrow J, Gerstein M, Hubbard T, Reymond A, Antonarakis SE, Hannon G, Giddings MC, Ruan Y, Wold B, Carninci P, Guigó R and Gingeras TR

Nature 2012;489;7414;101-8

PUBMED: 22955620; PMC: 3684276; DOI: 10.1038/nature11233
The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression.

Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, Knowles DG, Lagarde J, Veeravalli L, Ruan X, Ruan Y, Lassmann T, Carninci P, Brown JB, Lipovich L, Gonzalez JM, Thomas M, Davis CA, Shiekhattar R, Gingeras TR, Hubbard TJ, Notredame C, Harrow J and Guigó R

Genome research 2012;22;9;1775-89

PUBMED: 22955988; PMC: 3431493; DOI: 10.1101/gr.132159.111
Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome.

Howald C, Tanzer A, Chrast J, Kokocinski F, Derrien T, Walters N, Gonzalez JM, Frankish A, Aken BL, Hourlier T, Vogel JH, White S, Searle S, Harrow J, Hubbard TJ, Guigó R and Reymond A

Genome research 2012;22;9;1698-710

PUBMED: 22955982; PMC: 3431487; DOI: 10.1101/gr.134478.111
Comparative proteomics reveals a significant bias toward alternative protein isoforms with conserved structure and function.

Ezkurdia I, del Pozo A, Frankish A, Rodriguez JM, Harrow J, Ashman K, Valencia A and Tress ML

Molecular biology and evolution 2012;29;9;2265-83

PUBMED: 22446687; PMC: 3424414; DOI: 10.1093/molbev/mss100
GENCODE: the reference human genome annotation for The ENCODE Project.

Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, van Baren J, Brent M, Haussler D, Kellis M,

Background

Collaborations

ENCODE (Encyclopedia of DNA Elements) and GENCODE

CCDS (Consensus Coding Sequence)

IKMC (International Knockout Mouse Consortium)

GRC (Genome Reference Consortium)

Annotation

Annotation guidelines

Otterlace

Nomenclature

Publications

• Journal papers

Fine mapping of type 1 diabetes regions Idd9.1 and Idd9.2 reveals genetic complexity.

The zebrafish reference genome sequence and its relationship to the human genome.

Ensembl 2013.

The non-obese diabetic mouse sequence, annotation and variation resource: an aid for investigating type 1 diabetes.

Sequencing and comparative analysis of the gorilla MHC genomic sequence.

Structural and functional annotation of the porcine immunome.

The B10 Idd9.3 locus mediates accumulation of functionally superior CD137(+) regulatory T cells in the nonobese diabetic type 1 diabetes model.

Analyses of pig genomes provide insight into porcine demography and evolution.

An integrated encyclopedia of DNA elements in the human genome.

Landscape of transcription in human cells.

The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression.

Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome.

Comparative proteomics reveals a significant bias toward alternative protein isoforms with conserved structure and function.

GENCODE: the reference human genome annotation for The ENCODE Project.