- Background
- Collaborations
- Annotation
- Publications
Background
The Sanger Institute has made large contributions to a large number of vertebrate genome sequences, including all or part of human chromosomes 1, 6, 9, 10, 13, 20, 22 and X and mouse chromosomes 2, 4, 11 and X, and the full Danio rerio (zebrafish) genome sequence. The Institute has also sequenced or continues to sequence selected parts of other vertebrate genomes, including candidate diabetes gene regions (in reference and non-obese diabetic (NOD) mouse strains) and MHC regions (in wallaby, Tasmanian devil, gorilla, dog, pig, human haplotypes and mouse strains). The HAVANA group provides the manual annotation for these and other genome sequences.
Collaborations
The HAVANA group collaborates with others in both small and large projects. The largest projects are designed to annotate the entire human genome and the majority of coding genes in mouse. The following are the main HAVANA collaborations relating to these projects:
ENCODE (Encyclopedia of DNA Elements) and GENCODE
The ENCODE and GENCODE projects provide in-depth, coordinated analysis of the entire human genome using experimental,
computational and manual techniques. HAVANA manual annotation serves as the reference annotation underlying this
global project. Continuous feedback between collaborators working on the three different aspects encourages
refinement of all techniques involved.
ENCODE website
CCDS (Consensus Coding Sequence)
CCDS is a collaboration between the Sanger Institute (Ensembl, VEGA, HAVANA), UCSC (Genome Bioinformatics Group) and
NCBI (RefSeq). CCDS strives to provide a comprehensive database of high-quality coding regions from the human and
mouse genomes agreed by all collaborators. Annotation from Sanger Institute and RefSeq, which is created using
different techniques, is compared and a CCDS entry is created when the two agree on the coding sequence structure for
a given transcript or locus. Conflicts are discussed between all three parties and, where a consensus can be reached,
a CCDS entry is created.
CCDS website
IKMC (International Knockout Mouse Consortium)
IKMC is a collaboration between the three main mouse knockout projects: EUCOMM (European Conditional Mouse
Mutagenesis), KOMP (Knockout Mouse Project) and NorCOMM (North American Conditional Mouse Mutagenesis). Manual
annotation by the HAVANA group and collaborators at Washington University, St Louis, and University of Manitoba,
Winnipeg, serves as the foundation for constructing knockout mouse cell lines for every coding gene.
IKMC website
GRC (Genome Reference Consortium)
A collaboration between the Wellcome Trust Sanger Institute, the Genome Center at WashU, the EBI and the NCBI, the
GRC aims to provide the best possible genome assemblies for human, mouse and zebrafish. It does so by investigating
potential variation, errors, conflicts and sequence gaps with a view to choosing the best or multiple representations
of variant sequence, correcting errors, resolving conflicts and filling-in gaps. HAVANA's role is to report and feed
back any of these issues affecting genes in the three species.
GRC website
Flow of information between HAVANA (blue and red shapes), collaborators and databases. Thick arrows are direct collaborations, thin arrows show indirect feeding of HAVANA annotation back into the analysis pipeline.
zoom
Annotation
HAVANA annotation is publicly available from the following websites:
- VEGA
- Ensembl
- UCSC
The HAVANA group puts special emphasis on splice variants and pseudogenes, two areas still underdeveloped in automated annotation systems, as well as poly-adenylation features. Also, where other systems concentrate on, or are limited to, protein-coding genes, many HAVANA transcripts are annotated without a protein-coding region. These transcripts may function as non-coding RNAs or they may be incomplete gene fragments for which the coding sequence cannot yet be determined.
The HAVANA group requires that all annotated gene structures (transcripts) are supported by transcriptional evidence, either from cDNA, EST or protein sequences. As such not all annotated transcripts are necessarily complete. Support does not need to come from locus-specific evidence, but can also be homologous, paralogous or orthologous.
While the transcript and protein sequences are the most important pieces of information, HAVANA annotation takes into account and uses other data, such as CpG islands, gene predictions, repeats and genome signatures. Because the annotation software used is DAS (Distributed Annotation System) aware, the HAVANA team can link to external data sources. Ensembl gene models and data from GENCODE collaborators are some of the DAS sources the HAVANA group uses. HAVANA sources are under constant review and subject change. For example, the group recently started to use data from new technologies such as RNAseq and protein mass spectrometry in its annotation efforts.
Annotation guidelines
Like its data sources, HAVANA's annotation guidelines are under constant review and are routinely updated to take into account feedback from collaborators, incorporate new data sources and reflect new trends in genetics, transcriptomics, proteomics and genomics.
HAVANA Annotation guidelines detail our annotation standards.
Otterlace
We use the in-house developed and maintained Otterlace annotation suite for manual annotation. This suite comprises an automated analysis pipeline based on the Ensembl pipeline, graphical interfaces for viewing the pipeline results and interfaces for creating and modifying transcript models. The figure shows a selection of user interfaces from Otterlace.
Annotation interfaces in Otterlace
zoom
The Otterlace user manual gives guidance on how to use the Otterlace interfaces.
Nomenclature
As well as modelling accurate transcript models, it is important to use the correct gene nomenclature. To maintain consistency in an annotation database, especially important when working with syntenic regions across species or haplotypes within a single species, the HAVANA annotation group interacts closely with the nomenclature committees for the human, mouse and zebrafish genomes.
- Human genome nomenclature
- Mouse genome nomenclature
- Zebrafish genome nomenclature
Publications
• Journal papers
citations per annum of HAVANA (co-)authored publications
zoom
-
Fine mapping of type 1 diabetes regions Idd9.1 and Idd9.2 reveals genetic complexity.
Hamilton-Williams EE, Rainbow DB, Cheung J, Christensen M, Lyons PA, Peterson LB, Steward CA, Sherman LA and Wicker LS
Mammalian genome : official journal of the International Mammalian Genome Society 2013;24;9-10;358-75
PUBMED: 23934554; PMC: 3824839; DOI: 10.1007/s00335-013-9466-y
-
The zebrafish reference genome sequence and its relationship to the human genome.
Howe K, Clark MD, Torroja CF, Torrance J, Berthelot C, Muffato M, Collins JE, Humphray S, McLaren K, Matthews L, McLaren S, Sealy I, Caccamo M, Churcher C, Scott C, Barrett JC, Koch R, Rauch GJ, White S, Chow W, Kilian B, Quintais LT, Guerra-Assunção JA, Zhou Y, Gu Y, Yen J, Vogel JH, Eyre T, Redmond S, Banerjee R, Chi J, Fu B, Langley E, Maguire SF, Laird GK, Lloyd D, Kenyon E, Donaldson S, Sehra H, Almeida-King J, Loveland J, Trevanion S, Jones M, Quail M, Willey D, Hunt A, Burton J, Sims S, McLay K, Plumb B, Davis J, Clee C, Oliver K, Clark R, Riddle C, Elliot D, Eliott D, Threadgold G, Harden G, Ware D, Begum S, Mortimore B, Mortimer B, Kerry G, Heath P, Phillimore B, Tracey A, Corby N, Dunn M, Johnson C, Wood J, Clark S, Pelan S, Griffiths G, Smith M, Glithero R, Howden P, Barker N, Lloyd C, Stevens C, Harley J, Holt K, Panagiotidis G, Lovell J, Beasley H, Henderson C, Gordon D, Auger K, Wright D, Collins J, Raisen C, Dyer L, Leung K, Robertson L, Ambridge K, Leongamornlert D, McGuire S, Gilderthorp R, Griffiths C, Manthravadi D, Nichol S, Barker G, Whitehead S, Kay M, Brown J, Murnane C, Gray E, Humphries M, Sycamore N, Barker D, Saunders D, Wallis J, Babbage A, Hammond S, Mashreghi-Mohammadi M, Barr L, Martin S, Wray P, Ellington A, Matthews N, Ellwood M, Woodmansey R, Clark G, Cooper J, Cooper J, Tromans A, Grafham D, Skuce C, Pandian R, Andrews R, Harrison E, Kimberley A, Garnett J, Fosker N, Hall R, Garner P, Kelly D, Bird C, Palmer S, Gehring I, Berger A, Dooley CM, Ersan-Ürün Z, Eser C, Geiger H, Geisler M, Karotki L, Kirn A, Konantz J, Konantz M, Oberländer M, Rudolph-Geiger S, Teucke M, Lanz C, Raddatz G, Osoegawa K, Zhu B, Rapp A, Widaa S, Langford C, Yang F, Schuster SC, Carter NP, Harrow J, Ning Z, Herrero J, Searle SM, Enright A, Geisler R, Plasterk RH, Lee C, Westerfield M, de Jong PJ, Zon LI, Postlethwait JH, Nüsslein-Volhard C, Hubbard TJ, Roest Crollius H, Rogers J and Stemple DL
Nature 2013;496;7446;498-503
PUBMED: 23594743; PMC: 3703927; DOI: 10.1038/nature12111
-
Ensembl 2013.
Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, García-Girón C, Gordon L, Hourlier T, Hunt S, Juettemann T, Kähäri AK, Keenan S, Komorowska M, Kulesha E, Longden I, Maurel T, McLaren WM, Muffato M, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ritchie GR, Ruffier M, Schuster M, Sheppard D, Sobral D, Taylor K, Thormann A, Trevanion S, White S, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Harrow J, Herrero J, Hubbard TJ, Johnson N, Kinsella R, Parker A, Spudich G, Yates A, Zadissa A and Searle SM
Nucleic acids research 2013;41;Database issue;D48-55
PUBMED: 23203987; PMC: 3531136; DOI: 10.1093/nar/gks1236
-
The non-obese diabetic mouse sequence, annotation and variation resource: an aid for investigating type 1 diabetes.
Steward CA, Gonzalez JM, Trevanion S, Sheppard D, Kerry G, Gilbert JG, Wicker LS, Rogers J and Harrow JL
Database : the journal of biological databases and curation 2013;2013;bat032
PUBMED: 23729657; PMC: 3668384; DOI: 10.1093/database/bat032
-
Sequencing and comparative analysis of the gorilla MHC genomic sequence.
Wilming LG, Hart EA, Coggill PC, Horton R, Gilbert JG, Clee C, Jones M, Lloyd C, Palmer S, Sims S, Whitehead S, Wiley D, Beck S and Harrow JL
Database : the journal of biological databases and curation 2013;2013;bat011
PUBMED: 23589541; PMC: 3626023; DOI: 10.1093/database/bat011
-
Structural and functional annotation of the porcine immunome.
Dawson HD, Loveland JE, Pascal G, Gilbert JG, Uenishi H, Mann KM, Sang Y, Zhang J, Carvalho-Silva D, Hunt T, Hardy M, Hu Z, Zhao SH, Anselmo A, Shinkai H, Chen C, Badaoui B, Berman D, Amid C, Kay M, Lloyd D, Snow C, Morozumi T, Cheng RP, Bystrom M, Kapetanovic R, Schwartz JC, Kataria R, Astley M, Fritz E, Steward C, Thomas M, Wilming L, Toki D, Archibald AL, Bed'Hom B, Beraldi D, Huang TH, Ait-Ali T, Blecha F, Botti S, Freeman TC, Giuffra E, Hume DA, Lunney JK, Murtaugh MP, Reecy JM, Harrow JL, Rogel-Gaillard C and Tuggle CK
BMC genomics 2013;14;332
PUBMED: 23676093; PMC: 3658956; DOI: 10.1186/1471-2164-14-332
-
The B10 Idd9.3 locus mediates accumulation of functionally superior CD137(+) regulatory T cells in the nonobese diabetic type 1 diabetes model.
Kachapati K, Adams DE, Wu Y, Steward CA, Rainbow DB, Wicker LS, Mittler RS and Ridgway WM
Journal of immunology (Baltimore, Md. : 1950) 2012;189;10;5001-15
PUBMED: 23066155; PMC: 3505683; DOI: 10.4049/jimmunol.1101013
-
Analyses of pig genomes provide insight into porcine demography and evolution.
Groenen MA, Archibald AL, Uenishi H, Tuggle CK, Takeuchi Y, Rothschild MF, Rogel-Gaillard C, Park C, Milan D, Megens HJ, Li S, Larkin DM, Kim H, Frantz LA, Caccamo M, Ahn H, Aken BL, Anselmo A, Anthon C, Auvil L, Badaoui B, Beattie CW, Bendixen C, Berman D, Blecha F, Blomberg J, Bolund L, Bosse M, Botti S, Bujie Z, Bystrom M, Capitanu B, Carvalho-Silva D, Chardon P, Chen C, Cheng R, Choi SH, Chow W, Clark RC, Clee C, Crooijmans RP, Dawson HD, Dehais P, De Sapio F, Dibbits B, Drou N, Du ZQ, Eversole K, Fadista J, Fairley S, Faraut T, Faulkner GJ, Fowler KE, Fredholm M, Fritz E, Gilbert JG, Giuffra E, Gorodkin J, Griffin DK, Harrow JL, Hayward A, Howe K, Hu ZL, Humphray SJ, Hunt T, Hornshøj H, Jeon JT, Jern P, Jones M, Jurka J, Kanamori H, Kapetanovic R, Kim J, Kim JH, Kim KW, Kim TH, Larson G, Lee K, Lee KT, Leggett R, Lewin HA, Li Y, Liu W, Loveland JE, Lu Y, Lunney JK, Ma J, Madsen O, Mann K, Matthews L, McLaren S, Morozumi T, Murtaugh MP, Narayan J, Nguyen DT, Ni P, Oh SJ, Onteru S, Panitz F, Park EW, Park HS, Pascal G, Paudel Y, Perez-Enciso M, Ramirez-Gonzalez R, Reecy JM, Rodriguez-Zas S, Rohrer GA, Rund L, Sang Y, Schachtschneider K, Schraiber JG, Schwartz J, Scobie L, Scott C, Searle S, Servin B, Southey BR, Sperber G, Stadler P, Sweedler JV, Tafer H, Thomsen B, Wali R, Wang J, Wang J, White S, Xu X, Yerle M, Zhang G, Zhang J, Zhang J, Zhao S, Rogers J, Churcher C and Schook LB
Nature 2012;491;7424;393-8
PUBMED: 23151582; PMC: 3566564; DOI: 10.1038/nature11622
-
An integrated encyclopedia of DNA elements in the human genome.
ENCODE Project Consortium
Nature 2012;489;7414;57-74
PUBMED: 22955616; PMC: 3439153; DOI: 10.1038/nature11247
-
Landscape of transcription in human cells.
Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J, Derrien T, Drenkow J, Dumais E, Dumais J, Duttagupta R, Falconnet E, Fastuca M, Fejes-Toth K, Ferreira P, Foissac S, Fullwood MJ, Gao H, Gonzalez D, Gordon A, Gunawardena H, Howald C, Jha S, Johnson R, Kapranov P, King B, Kingswood C, Luo OJ, Park E, Persaud K, Preall JB, Ribeca P, Risk B, Robyr D, Sammeth M, Schaffer L, See LH, Shahab A, Skancke J, Suzuki AM, Takahashi H, Tilgner H, Trout D, Walters N, Wang H, Wrobel J, Yu Y, Ruan X, Hayashizaki Y, Harrow J, Gerstein M, Hubbard T, Reymond A, Antonarakis SE, Hannon G, Giddings MC, Ruan Y, Wold B, Carninci P, Guigó R and Gingeras TR
Nature 2012;489;7414;101-8
PUBMED: 22955620; PMC: 3684276; DOI: 10.1038/nature11233
-
The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression.
Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, Knowles DG, Lagarde J, Veeravalli L, Ruan X, Ruan Y, Lassmann T, Carninci P, Brown JB, Lipovich L, Gonzalez JM, Thomas M, Davis CA, Shiekhattar R, Gingeras TR, Hubbard TJ, Notredame C, Harrow J and Guigó R
Genome research 2012;22;9;1775-89
PUBMED: 22955988; PMC: 3431493; DOI: 10.1101/gr.132159.111
-
Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome.
Howald C, Tanzer A, Chrast J, Kokocinski F, Derrien T, Walters N, Gonzalez JM, Frankish A, Aken BL, Hourlier T, Vogel JH, White S, Searle S, Harrow J, Hubbard TJ, Guigó R and Reymond A
Genome research 2012;22;9;1698-710
PUBMED: 22955982; PMC: 3431487; DOI: 10.1101/gr.134478.111
-
Comparative proteomics reveals a significant bias toward alternative protein isoforms with conserved structure and function.
Ezkurdia I, del Pozo A, Frankish A, Rodriguez JM, Harrow J, Ashman K, Valencia A and Tress ML
Molecular biology and evolution 2012;29;9;2265-83
PUBMED: 22446687; PMC: 3424414; DOI: 10.1093/molbev/mss100
-
GENCODE: the reference human genome annotation for The ENCODE Project.
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, van Baren J, Brent M, Haussler D, Kellis M,