It is a
collaborative effort among researchers to compare and evaluate methods
and strategies for de novo genome assembly (dnGASP) using data from
2nd generation sequencing platforms and is being organized by the
National Center for Genome Analysis (CNAG) in Barcelona, Spain. A
sister project dubbed
RGASP3 (the third incarnation of th RNA-Seq
Genome Annotation Assessment Project) is focused on evaluating RNASeq
read alignment algorithms and will be organized separately by the
Centre for Genomic Regulation and the Wellcome Trust Sanger Institute.
Both projects will culminate in a joint workshop in Barcelona April 5-7, 2011, organized in partnership with the
International Center for Scientific Debate (ISCD), an initiative of Biocat with support from “la Caixa” Welfare Projects, and supported by additional funds from
READNA.
This web portal is dedicated to providing the primary point of information and data exchange for dnGASP. We encourage those groups interested in participating in RGASP3 to retrieve the relevant information at the CRG.
All groups interested in testing their algorithms for assembly of large genomes from second generation sequencing data are invited to participate in dnGASP. The format of the project will follow the tradition of the previous "GASP" workshops (CASP, GASP, EGASP, NGASP, RGASP, etc.): a dataset will be provided, submissions of the processed dataset will be solicited (deadline: Feb 15, 2011), the submissions will be evaluated by CNAG, and the afforementioned workshop will be held to discuss the results.
The Genome
The reference genome is an unidentified naturally composed eukaryotic genome of known sequence with the following characteristics:
| Ploidy: |
diploid (SNP frequency ~1/1000) |
| Genome size: |
~1.8Gb |
| Chromosome number: |
14 |
| GC content: |
~42% (36-50%) |
| Repeat content: |
similar to vertebrate repeat content |
| Derivation: |
Our "genome" sequence is derived from sequence assembled by a traditional combination of WGS and clone-based approaches using Sanger technology, an undisclosed transformation was applied to the sequences to mask their identity, and finally alleles were simulated based on a realistic SNP distribution (SNPs and small indels). The genome additionally contains a minimal amount of purely artificial sequence introduced by the evaluation committee. |
Read Data
A total of 64x coverage of the simulated genome has been simulated as Illumina GAIIx reads. Sequencing errors were introduced according to quality values (empirically determined from real reads). The 64x coverage is subdivided into four libraries with the following characteristics:
| read length |
insert size |
kind |
coverage |
| 114nt x 2 |
500bp |
paired ends (PE) |
44x |
| 36nt x 2 |
3kb |
mate pairs (MP) |
8x |
| 36nt x 2 |
5kb |
mate pairs (MP) |
8x |
| 36nt x 2 |
10kb |
mate pairs (MP) |
4x |
The read data can be downloaded here: DOWNLOAD
Submissions
We will ask that one assembly be submitted for each of the following:
- level 1: PE
- level 2: PE + 3kb MP
- level 3: PE + 3kb + 5kb MP
- level 4: PE + 3kb + 5kb + 10kb MP
Each assembly submissions will consist of a single multifasta file of scaffolds (including all scaffolded and unscaffolded contigs). Gaps of estimated size are to be represented by strings of N’s of length equal to the most likely gap size. Scaffolds which do not contain at least one contig of length 115 nt (read length plus one) or greater will not be considered for evaluation. We reserve the right to apply an even higher length threshold during evaluation if it is deemed necessary to make the analysis fairer or more tractable.
Deadline
March 1st 23:59 GMT. Files are to be uploaded through the project web site.
Evaluation
We will evaluate all submitted assemblies up to a maximum of one per level per participant according to several criteria. In addition to using the standard measures of assembly quality (N50/N90 and the largest/mean/median contig/scaffold size), we will leverage the fact that the reference genome sequence is absolutely known and the position from which each read was simulated is known. Assemblies will be aligned to the reference genome using established alignment methods and various measures will be calculated to quantify the level of completeness (e.g. coverage, number of gaps closed, N50 of aligned contigs, etc.) and correctness (e.g. synteny, accuracy of gap size estimation). Additionally, we will measure the ability to bridge different types of repeats (both “naturally” occurring and artificially constructed and embedded) varying in repeat unit size, total length, copy number within the genome and amount of variation. Finally, we will analyze other factors which might impact assembly quality such as variation in GC content or SNP rate.