• Home/
  • Long-read and chromosome-scale assembly of the hexaploid wheat genome achieves higher resolution for research and breeding

Long-read and chromosome-scale assembly of the hexaploid wheat genome achieves higher resolution for research and breeding

bioRxiv 2021
Aury et al

Jean-Marc Aury1,*, Stefan Engelen1
, Benjamin Istace1
, Cécile Monat2
, Pauline
Lasserre-Zuber2
, Caroline Belser1
, Corinne Cruaud3
, Hélène Rimbert2
, Philippe
Leroy2
, Sandrine Arribat4
, Isabelle Dufau4
, Arnaud Bellec4
, David Grimbichler5
,
Nathan Papon2
, Etienne Paux2
, Marion Ranoux2
, Adriana Alberti1,7, Patrick Wincker1
,
Frédéric Choulet2,
*

Long-read and chromosome-scale assembly of the
hexaploid wheat genome achieves higher resolution for
research and breeding
Jean-Marc Aury1,*, Stefan Engelen1
, Benjamin Istace1
, Cécile Monat2
, Pauline
Lasserre-Zuber2
, Caroline Belser1
, Corinne Cruaud3
, Hélène Rimbert2
, Philippe
Leroy2
, Sandrine Arribat4
, Isabelle Dufau4
, Arnaud Bellec4
, David Grimbichler5
,
Nathan Papon2
, Etienne Paux2
, Marion Ranoux2
, Adriana Alberti1,7, Patrick Wincker1
,
Frédéric Choulet2,
*
* corresponding authors
1 Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ
Evry, Université Paris-Saclay, 91057 Evry, France
2 GDEC, Université Clermont Auvergne, INRAE, UMR1095, 63000
Clermont-Ferrand, France
3 Commissariat à l’Energie Atomique (CEA), Institut François Jacob, Genoscope,
F-91057 Evry, France
4
INRAE, CNRGV French Plant Genomic Resource Center, F-31320, Castanet
Tolosan, France
5 Mésocentre Clermont Auvergne, DOSI / Bâtiment Turing, 7 avenue Blaise Pascal,
63178 Aubière CEDEX
6 Current address: Université Paris-Saclay, CEA, CNRS, Institute for Integrative
Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France.
1
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Abstract
The sequencing of the wheat (Triticum aestivum) genome has been a methodological
challenge for many years due to its large size (15.5 Gb), repeat content, and hexaploidy.
Many initiatives aiming at obtaining a reference genome of cultivar Chinese Spring have
been launched in the past years and it was achieved in 2018 as the result of a huge effort to
combine short-read whole genome sequencing with many other resources.
Reference-quality genome assemblies were then produced for other accessions but the
rapid evolution of sequencing technologies offers opportunities to reach high-quality
standards at lower cost. Here, we report on an optimized procedure based on long-reads
produced on the ONT (Oxford Nanopore Technology) PromethION device to assemble the
genome of the French bread wheat cultivar Renan. We provide the most contiguous and
complete chromosome-scale assembly of a bread wheat genome to date, a resource that
will be valuable for the crop community and will facilitate the rapid selection of agronomically
important traits. We also provide the methodological standards to generate high-quality
assemblies of complex genomes.
Introduction
Bread wheat (Triticum aestivum) is among the most important cereal crops and a better
knowledge in the area of wheat genomics is needed to face the main challenge of ensuring
food security to a growing population in the context of climate change. Improving productivity
requires both that local producers adapt their practices to increase their climate resilience
and a better understanding of the wheat production systems. In this context, a better
knowledge of the wheat genome and its gene content, but also the sequencing of numerous
accessions, are essential.
However, the genome of bread wheat is particularly characterized by its complexity. Indeed
this hexaploid genome is the result of two interspecific hybridization events. The earliest
cultivated wheat was diploid, but humans have intensified the cultivation of polyploid
species. Recent studies show that these polyploid species appear to be advantaged by their
genomic plasticity1
. Indeed, modifications of the gene space and related elements are
buffered by the polyploid nature of wheat and open a wider field to selection. Bread wheat is
composed of three subgenomes A, B and D derived from three ancestral diploid species that
diverged between 2.5 and 6 million years ago2
.
The wheat genome is one of the largest among sequenced plant genomes (15.5 Gb), mainly
composed of repetitive sequences (ca. >85%), and contains many homoeologous regions
2
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
between the three subgenomes (A, B and D). Repetitive sequences and polyploidy pose
serious challenges in the generation of genome assemblies. The adventure of sequencing
the hexaploid wheat genome began in 2005 with the creation of the International Wheat
Genome Sequencing Consortium (IWGSC)3
. With the advent of sequencing technologies,
the wheat genome has been competitively sequenced several times4–6. The first
reference-quality genome sequence with a comprehensive annotation was published by the
IWGSC in August 20187
for the accession Chinese Spring (CS). This assembly represents a
tremendous resource for the scientific community and offers the promise of facilitating and
accelerating breeding efforts.
More recently, fifteen genomes of hexaploid wheat have been published8 which represents a
new step in the knowledge of the wheat model. Ten of these new wheat genomes have been
assembled at the chromosome level, allowing for comparative analysis on a scale that was
previously impossible. Although a valuable resource, these assemblies have been produced
using short-read technologies and are therefore not up to the quality standard of current
genomes9–13. In 2017, an assembly of the CS genome using long-reads was produced5
,
although not annotated, highlighting the added-value of long-reads in such complex
genomes. By accumulating long-read assemblies, the scientific community is now aware of
the flaw in short-read strategies. Indeed they underestimate the repetitive content of the
genome and more importantly can lack tandemly duplicated genes14,15. Several years ago,
Pacific Biosciences (PACBIO) and Oxford Nanopore (ONT) sequencing technologies were
commercialized with the promise to sequence long DNA fragments and revolutionize
complex genome assemblies.
Here, we report the first hexaploid wheat genome based on ONT long-reads. We sequence
the genome of a French variety (Renan) using the PromethION device and organize the
assembled contigs at the chromosome scale using optical maps (BioNano Genomics, BNG)
and Hi-C libraries (Arima Genomics, AG). This assembly has a contig N50 of 2.2 Mb, which
is a 30-fold improvement over existing chromosome-scale assemblies.
Results
Genome sequencing and optical maps
We sequenced genomic DNA using 20 ONT flow cells (2 MinION and 18 PromethION) which
produced 12M reads representing 1.1 Tb. All the reads were originally base called using the
guppy 2.0 software, but given the improvement of guppy software during our project, we
decided to call bases using a newer version of the guppy software (version 3.6 with High
3
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Accuracy setting). This dataset represented a coverage of 63x of the hexaploid wheat
genome and the read N50 was of 24.6 kb. More importantly, we got 3.1M reads larger than
50 kb representing a 14x genome coverage (Table S1). In addition, we generated Illumina
short-reads and long-range data for respectively polishing and organizing nanopore contigs.
We produced an optical map using the Saphyr instrument commercialized by Bionano
Genomics (BNG). High molecular weight DNA was extracted and labeled using the Direct
Label and Stain Chemistry (DLS) with the DLE-1 enzyme. The DLE-1 optical map was
assembled using proprietary tools provided by BNG and had a cumulative size of 14.9 Gb
with an N50 of 37.5 Mb (Table S2). Four Hi-C libraries from two biological replicates were
prepared using the Arima Genomics protocol and sequenced on an Illumina sequencer to
reach 537 Gb i.e., a depth of 35x. We used a sample of 240 million read pairs (72 Gb, 5x) to
build a Hi-C map.
Genome assembly
Since the dataset was too large for many long-read assemblers, we sampled a 30x coverage
by selecting the longest reads (Table S1). This subset was assembled using multiple
assembly tools dedicated to processing this large amount of data (Redbean16
,
SMARTdenovo17 and Flye18). SMARTdenovo is not among the fastest algorithms and has
not been updated for several years, but since it can be easily parallelized, it remains an
interesting choice for assembling large genomes. The overlap and consensus calculations
were split into 60 chunks and each were run on a 32-core server and took about two days
and ten hours respectively. In comparison, Redbean was able to generate an assembly after
just seven days on a 64-core server with 3TB of memory while Flye needed 43 days on the
same computer server. Surprisingly, the redbean assembly had a cumulative size two times
higher than the expected genome size (29.6Gb vs 14.5Gb), a low contiguity and contained a
large amount of short contigs. The SMARTdenovo and Flye assemblies were highly
comparable, but Flye was the most contiguous (contigs N50 of 1.8 Mb vs 1.1 Mb) and
SMARTdenovo had a cumulative size closer to the expected one (14.1 Gb vs 13.0 Gb, Table
S3). Additionally, even though the assemblies were polished later, the raw SMARTdenovo
assembly contained a higher number of complete BUSCO genes (83.0% vs 49.5%) which
indicates that its consensus module is more efficient.
The SMARTdenovo and Flye assemblies were successively polished using Racon19 and
Medaka20 with long reads and Hapo-G21 with short reads. Polished contigs were validated
and organized into scaffolds using the DLE-1 optical map and proprietary tools provided by
BNG. As expected, due to its lower cumulative size, Flye scaffolds contained a larger
4
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
proportion of unknown bases (851 Mb and 262 Mb). Based on this result, the assembly
produced by SMARTdenovo17 was selected (Table S4). Local contig duplications (negative
gaps) were resolved using BiSCoT22, which improved the contigs N50 from 1.2 Mb up to 2.1
Mb. Finally, the resulting assembly was polished one last time using Hapo-G21 with short
reads. This led to 2,904 scaffolds (larger than 30 kb) representing 14.26 Gb with a N50 of 48
Mb (79 scaffolds) and a maximum scaffold size of 254 Mb. Thus, the genome size is in the
same range as all other available reference quality assemblies of T. aestivum: e.g. 14.29 Gb
for cv. LongReach Lancer, 14.55 Gb for cv. Chinese Spring, and 14.96 Gb for cv. SY Mattis.
Construction and validation of pseudomolecules
We then guided the construction of the 21 chromosome sequences (i.e. pseudomolecules)
based on collinearity with the CS (Chinese Spring) RefSeq Assembly v2.123. Given the
complexity of this hexaploid genome, we established a dedicated approach in order to
anchor each Renan scaffold based on similarity search against CS. To avoid problems due
to multiple mappings, we selected a dataset of uniquely mappable sequences. Genes are
not uniquely mappable since most of them are repeated as three homoeologous copies
sharing on average 97% nucleotide identity. In addition, the gene density (1 gene every 130
kb on average) is too low to anchor small Renan scaffolds that do not carry genes. Thus, we
used 150 bp tags corresponding to the 5′ and 3′ junctions between a transposable element
(TE) and its insertion site (75 bps on each side) which are called ISBP (Insertion Site-Based
Polymorphism) markers and are highly abundant and uniquely mappable in the wheat
genome24. We designed a dataset of 5.76 million ISBPs from CS assembly which represent
1 ISBP every 2.5 kb. Their mapping enabled the anchoring of 2,566 scaffolds on 21
pseudomolecules representing 14.20 Gb (99% of the assembly). We then used Hi-C data to
validate the assembly and to correct the mis-ordered and mis-oriented scaffolds. The Hi-C
map revealed only a few inconsistencies, demonstrating that the collinearity between CS
and Renan was strong enough to guide the anchoring in a very accurate manner. The Hi-C
map-based curation led to the detection of 18 chimeric scaffolds that were split into 2 or 3
pieces and to the correction of the location and/or orientation of 198 scaffolds. The final
assembly was composed of 21 pseudomolecules (Figure 1) with 338 unanchored scaffolds
representing 61 Mb only.
Quality assessment of the assembly
We first estimated the completeness and quality of the assembly by searching for the
presence of known genes, i.e. the 107,891 High Confidence (HC) genes predicted in CS
5
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
RefSeq v1.1. We used BLAST25 to search for the presence of each of the 491,456 exons
larger than 30 bps in the Renan scaffolds, and we considered only matches showing at least
90% identity over at least 90% query length. We found hits for 97.6% of the query exons with
on average 99.3% identity, suggesting that the gene space is assembled at a high-quality
level. The missing genes/exons would correspond, in most of the cases, to real
presence/absence variations between CS and Renan while the nucleotide divergence
between exons is only 0.7%. It also demonstrated that homoeologous gene copies, sharing
on average 97% identity7 were not collapsed in the delivered Renan assembly. Indeed, 62%
of the CS exons are strictly identical in Renan, showing that, even between a French and an
Asian accession, SNPs in coding sequences are rare. It also confirmed that the Renan
sequence quality is high even in homoeologous repeated sequences. We then assessed the
assembly quality of the TE space by aligning the complete dataset of ISBP markers of CS
onto the Renan assembly. We found that 94% markers were conserved (at least 90%
identity over 90% query length) i.e., present in the assembly, revealing that the TE space is
extremely close to completeness. Indeed, 6% of missing markers is similar to the proportion
of expected Presence-Absence variations (PAV) affecting TEs26
.
Additionally, we aligned both short and long reads on the final assembly and examined the
coverage in 100 kb windows. Interestingly, using short reads we found lower coverage of the
D subgenome compared to the A and B subgenomes (Figures S1 and S2), which may
indicate mapping issues with the short reads. We extracted the highly covered regions which
may be candidate regions for homoelogous exchanges or may represent collapsed regions
during the assembly. We found only 186 highly covered regions, representing 0.13% of the
whole genome sequence, that were mainly localized on the A and B sugenomes (Figure 1)
with few differences between short and long reads.
Impact of the polishing
Based on BUSCO and the alignment of the IBSP markers from the CS assembly, we
monitored the evolution of the consensus quality through successive polishing iterations. As
previously described, the SMARTdenovo consensus allowed the recovery of a greater
number of complete BUSCO genes compared to that of Flye, which may be an indicator of
its greater accuracy. However, the BUSCO score was still low (83%) especially for a
hexaploid genome, underlining the importance of polishing raw assemblies. Likewise, we
were able to find 80.4% of the IBSP markers but only 7% were aligned without mismatch
between the two genotypes (Table S5). When polished with long-reads, the BUSCO score
reached 96.7% and 92.9% of the IBSP markers were retrieved (including 28.0% with perfect
6
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
matches). The subsequent polishing step with short reads weakly decreased the BUSCO
score (from 96.7% to 96.6%), but the proportion of duplicated genes increased from 83.1%
to 87.0% which is here wanted because in the case of a hexaploid genome most of the
genes are in three copies. Moreover, the proportion of perfectly aligned ISBP markers
drastically increased from 28.0% up to 58.9%. Although the polishing with short reads
weakly impacts the BUSCO conserved genes, the IBSP markers underline its importance in
the case of long reads assemblies. Since ISBPs are unique tags sampling the whole
genome, this analysis revealed that nucleotide errors were frequent before polishing,
affecting half of the sample loci. Thus, we showed that the polishing steps were successful,
even in this large and polyploid genome, and drastically improved the quality of the
consensus.
Recent improvement of the ONT technology
Oxford Nanopore technology is evolving rapidly, and improvements to the base calling
softwares are frequent, allowing old data to be analyzed with the aim of improving read
accuracy and subsequent analysis. To measure the gain brought by each new version during
this project, we analyzed a subset of ultra-long reads (longer than 100 kb) with different
basecallers or versions of the same basecaller: guppy 2.0, guppy 3.0.3 (High Accuracy
mode), guppy 3.6 (High Accuracy mode) and the recent bonito v0.3.1. We observed a strong
difference in accuracy, of around 7%, between guppy 2.0 and the newer basecaller (bonito
v0.3.1), representing the gain over the last two years (Figure 2A). This significant
improvement could lead nanopore users to reanalyze their old sequencing data to improve
the quality of their assemblies. Surprisingly, the identity percentage obtained on wheat is
lower than what was obtained on yeast and human samples (Figure 2B). This difference can
be explained by the fact that, first, the consensus of the wheat genome is not perfect and
secondly, that basecallers are trained on a mixture which contains yeast and human data.
Indeed, DNA modification patterns can differ between taxa, and read accuracy seems better
when the model was trained on native DNA from the same species27. This huge difference
between the read accuracy of yeast and wheat samples should motivate nanopore users to
train basecaller models to their targeted species.
Additionally, we evaluated the improvement of a given assembly after a reanalysis of the
sequencing data, and launched SMARTdenovo twice using ONT reads basecalled with
guppy 3.3 and guppy 3.6 (Table S6). The accuracy of raw nanopore reads gained about 2%
on average using guppy 3.6. We observed a reduction of the number of contigs of 19%, and
an improvement of the contig N50 of 26%, which represents a substantial gain. Likewise, the
7
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
cumulative size is slightly higher in the guppy 3.6 assembly, which may underline a smaller
amount of collapsed repetitive regions (Table S7).
Annotation of transposable elements and protein-coding genes
We annotated TEs based on similarity search against our wheat-specific TE library
ClariTeRep28 and raw results were then refined using CLARITE, a homemade program able
to resolve prediction conflicts, merge adjacent features into a single complete element, and
identify nested insertion patterns. We detected 3.9 million copies of TEs in the Renan
genome assembly, representing 12.0 Gb i.e. 84% of the assembly size. The proportions of
each superfamily were extremely similar to what has been described for CS29 (Table 2).
Gene annotation was achieved by, first, transferring genes predicted in CS RefSeq v2.1 by
homology using the MAGATT pipeline23. This allowed us to accurately transfer 105,243 (out
of 106,801; 98%) HC genes and 155,021 (out of 159,846; 97%) Low Confidence genes.
Such a transfer of genes predicted in another genotype (here CS) avoided genome-wide de
novo gene prediction that may artificially lead to many differences between the annotations.
We thus focused de novo predictions using TriAnnot30 only on the unannotated part of the
genome, representing 8.5% of the 14.2 Gb, after having masked transferred genes and
predicted TEs. This method allowed us to predict 4,440 genes specific to Renan compared
to CS i.e., 4% of the gene complement. This is consistent with the extent of structural
variations affecting genomes of Triticeae26. Transfer of known genes, novel predictions, and
manual curation (limited to storage protein encoding genes), led us to annotate 109,552
protein-coding genes on the Renan pseudomolecules.
Comparison with existing hexaploid genome assemblies
We compared our long-read assembly with 10 other available chromosome-scale
assemblies of wheat genomes. Although the gene content was similar between the different
assemblies, as expected, the assemblies based on short-read had a lower contiguity (contig
N50 values lower than 100 kb compared to the 2 Mb of the assembly of the Renan genome,
Figure 3A-B). Logically, they also contained more gaps (around 40 times, Figure 3C).
Interestingly, we found in general more gaps per Mb in the D subgenome compared to the A
and B subgenomes in Renan. This tendency is more pronounced in long-read assemblies
(Figure S3). Chromosomes from the different assemblies had similar length except for the
ArinaLrFor and the SY_Mattis variety in which a translocation has been previously described
between chromosomes 5B and 7B8
(Figure 3D).
In addition, we generated dotplots between CS and Renan homeologous chromosomes and
8
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
confirmed the strong collinearity between the two genomes (Figure 4). Whole chromosome
alignments highlighted 16 large-scale inversions (>5 Mb; up to 118 Mb) on 10 chromosomes
and 1 translocation of a ca. 45 Mb segment on chromosome 4A. We performed the same
comparisons with the 10 other available genomes of related varieties assembled at the
pseudomolecule level (Supplementary Data 1). It showed that only 2 of these inversions are
specific to Renan while the others are shared between several accessions. They correspond
to regions of 23 Mb on chr6B (position 398-421 Mb) and 10 Mb on chr7B (position 267-277
Mb).
Comparative analysis of a storage protein coding gene cluster in T. aestivum
Tandem duplications are an important mechanism in plant genome evolution and
adaptation31,32 but the assembly of tandemly duplicated gene clusters is difficult, especially
with short-reads. In order to illustrate the gain brought by this optimized assembly process,
we focused on an important locus on chromosome 1B known to carry multiple copies of
storage protein and disease resistance genes33,34. Among them, the genes encoding
omega-gliadins are not only duplicated in tandem, but are also composed of microsatellite
DNA in their coding part, making them particularly hard to assemble properly from short
reads. We compared orthologous regions harboring these genes between CS and Renan,
spanning 1.58 Mb and 2.32 Mb, respectively. The CS region was more fragmented with 81
gaps versus only 2 in Renan (Figures 5). The number of copies of omega-gliadin encoding
genes was quite similar: 9 in CS and 10 in Renan. The most striking difference came from
the completeness of the microsatellite motifs: 8 copies out of 9 contain N stretches in CS
RefSeq v2.1, revealing that the microsatellite is usually too large to be fully assembled with
short reads. In contrast, all 10 copies predicted in Renan were assembled completely. More
generally, we mapped the corresponding proteins back to the locus and showed that it was
better reconstructed in the Renan assembly, with a mean protein alignment length of 99%
compared to 58% in CS.
Discussion
In this study, we showed that the recent improvement of the Oxford Nanopore technology, in
terms of error rate and throughput, has opened up new perspectives in the age of long-read
technologies. Indeed, the sequencing and assembly of complex genomes is now accessible
to sequencing facilities. Additionally, the ability to sequence ultra-long reads using ONT
devices is a real advantage over other long-read technologies, and the error rate that was
previously a thorn in their side has been drastically reduced over the last year. By following
9
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
basecallers evolution, we noticed that the gain when using recent basecaller is high and we
guess this observation will encourage users to reprocess older data. However, this is not
trivial and it requires sufficient computing resources. With all these recent improvements, it is
clear that ONT now offers the ability to generate high-quality assemblies even in the case of
complex genomes, like the hexaploid wheat.
Interestingly, we observed that the error rate of ONT data is highly organism dependent and
that the training of basecaller has a significant impact on the overall quality of the reads. This
is, in our opinion, an important fact because a large proportion of de novo assemblies now
concern non-model organisms and users will have to address this limitation of current
software. There are existing methods to train the basecaller on non-model species, but this
can still be a big barrier, depending on the size of the dataset, for many end users. However,
as highlighted in this study, the combination of long-reads and short-reads sequencing with
polishing methods greatly improves the consensus sequence of a given genome assembly
and these algorithms seem sufficient at least in coding regions.
Even though there are now several chromosome-scale assemblies of the hexaploid wheat
genome, this assembly of the Renan variety based on long-reads will benefit biologists and
geneticists as it offers a higher resolution. We demonstrated by examining an important loci
containing prolamin and resistance genes that such regions are truly enhanced and contain
very few gaps compared to assemblies based on short-reads. Additionally, unlike recent
chromosome-scale assemblies, Renan’s gene prediction is not only a projection of Chinese
Spring gene models, but also includes de novo annotation which is of real benefit for the
construction of pan genome (or pan annotation) or when cultivar-specific genes are
examined. For all of these reasons, we believe this higher resolution assembly will benefit
the wheat community and help breeding programs dedicated to the bread wheat genome.
Methods
Plant material and DNA extraction
Triticum aestivum cv. Renan seeds were provided by the INRAE Biological Resource Center
on small grain cereals and grown for two weeks and a dark treatment was applied on the
seedlings for two days before collecting leaf tissues.
10
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
For the sequencing experiments, DNA was isolated from frozen leaves using QIAGEN
Genomic-tips 100/G kit (Cat No./ID: 10243) and following the tissue protocol extraction.
Briefly, 1g of leaves were ground in liquid nitrogen with mortar and pestle. After 3h of lysis
and one centrifugation step, the DNA was immobilized on the column. After several washing
steps, DNA is eluted from the column, then desalted and concentrated by alcohol
precipitation. The DNA is resuspended in the TE buffer.
To generate the optical map, uHMW DNA were purified from 0.5 gram of very young fresh
leaves according to the Bionano Prep Plant tissue DNA Isolation Base Protocol (30068 –
Bionano Genomics) with the following specifications and modifications. Briefly, the leaves
were fixed using a fixing solution (Bionano Genomics) containing formaldehyde
(Sigma-Aldrich) and then grinded in a homogenization buffer (Bionano Genomics) using a
Tissue Ruptor grinder (Qiagen). Nuclei were washed and embedded in agarose plugs. After
overnight proteinase K digestion in Lysis Buffer (Bionano Genomics) and one hour treatment
with RNAse A (Qiagen), plugs were washed four times in 1x Wash Buffer (Bionano
Genomics) and five times in 1x TE Buffer (ThermoFisher Scientific). Then, plugs were
melted two minutes at 70°C and solubilized with 2 µL of 0.5 U/µL AGARase enzyme
(ThermoFisher Scientific) for 45 minutes at 43°C. A dialysis step was performed in 1x TE
Buffer (ThermoFisher Scientific) for 45 minutes to purify DNA from any residues. The DNA
samples were quantified by using the Qubit dsDNA BR Assay (Invitrogen). Quality of
megabase size DNA was validated by pulsed field gel electrophoresis (PFGE).
Illumina Sequencing
DNA (1.5μg) was sonicated using a Covaris E220 sonicator (Covaris, Woburn, MA, USA).
Fragments (1µg) were end-repaired, 3′-adenylated and Illumina adapters (Bioo Scientific,
Austin, TX, USA) were then added using the Kapa Hyper Prep Kit (KapaBiosystems,
Wilmington, MA, USA). Ligation products were purified with AMPure XP beads (Beckman
Coulter Genomics, Danvers, MA, USA). Libraries were then quantified by qPCR using the
KAPA Library Quantification Kit for Illumina Libraries (KapaBiosystems), and library profiles
were assessed using a DNA High Sensitivity LabChip kit on an Agilent Bioanalyzer (Agilent
Technologies, Santa Clara, CA, USA). The library was sequenced on an Illumina NovaSeq
instrument (Illumina, San Diego, CA, USA) using 150 base-length read chemistry in a
paired-end mode. After the Illumina sequencing, an in-house quality control process was
applied to the reads that passed the Illumina quality filters35. These trimming and removal
steps were achieved using Fastxtend tools36
.
11
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Nanopore Sequencing
Libraries were prepared according to the protocol Genomic DNA by ligation (SQK-LSK109
kit). Genomic DNA fragments (1.5 µg) were repaired and 3’-adenylated with the NEBNext
FFPE DNA Repair Mix and the NEBNext® Ultra™ II End Repair/dA-Tailing Module (New
England Biolabs, Ipswich, MA, USA). Sequencing adapters provided by Oxford Nanopore
Technologies (Oxford Nanopore Technologies Ltd, Oxford, UK) were then ligated using the
NEBNext Quick Ligation Module (NEB). After purification with AMPure XP beads (Beckmann
Coulter, Brea, CA, USA), the library was mixed with the Sequencing Buffer (ONT) and the
Loading Bead (ONT) and loaded on MinION or PromethION R9.4.1 flow cells. One
PromethION run was performed with Genomic DNA purified with Short Read Eliminator kit
(Circulomics, Baltimore, MD, USA) before the library preparation.
Optical Maps
Labeling and staining of the uHMW DNA were performed according to the Bionano Prep
Direct Label and Stain (DLS) protocol (30206 – Bionano Genomics). Briefly, labeling was
performed by incubating 750 ng genomic DNA with 1× DLE-1 Enzyme (Bionano Genomics)
for 2 hours in the presence of 1× DL-Green (Bionano Genomics) and 1× DLE-1 Buffer
(Bionano Genomics). Following proteinase K digestion and DL-Green cleanup, the DNA
backbone was stained by mixing the labeled DNA with DNA Stain solution (Bionano
Genomics) in presence of 1× Flow Buffer (Bionano Genomics) and 1× DTT (Bionano
Genomics), and incubating overnight at room temperature. The DLS DNA concentration was
measured with the Qubit dsDNA HS Assay (Invitrogen).
Labelled and stained DNA was loaded on Saphyr chips. Loading of the chips and running of
the Bionano Genomics Saphyr System were all performed according to the Saphyr System
User Guide (30247 – Bionano Genomics). Data processing was performed using the
Bionano Genomics Access software.
A total of 4541 Gb data were generated. From this data, molecules with a size larger than
150kb were filtered generating 1931 Gb of data. These filtered data, corresponding to 128x
coverage of the Triticum aestivum cv. Renan consists of 7,810,298 molecules with an N50 of
237.5 kb and an average label density of 14.3/100kbp. The filtered molecules were aligned
using RefAligner with default parameters. It produced 1053 genome maps with a N50 of 37.5
Mbp for a total genome map length of 14946.8 Mbp.
12
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Long reads genome assembly
The 20 ONT runs were basecalled using two versions of guppy: 3.3 HAC and 3.6 HAC
(Table S6). We monitored the gain of each guppy basecaller release and evaluated three
different assemblers in the context of large genomes: Redbean16 v2.5 (git commit 3d51d7e),
SMARTdenovo17 (git commit 5cc1356) and Flye18 v2.7 (git commit 5c12b69). All assemblers
were launched using a subset of reads consisting of 30X of the longest reads (Table S3).
Then, we selected one of the assemblies based not only on contiguity metrics such as N50
but also cumulative size, proportion of unknown bases. The Flye (longest reads) and
SMARTdenovo (all reads) assemblies were very similar in terms of contiguity but we decided
to keep the SMARTdenovo assembly as its cumulative size was higher. The SMARTdenovo
assembler using the longest reads resulted in a contig N50 of 1.1Mb and a cumulative size
of 14.07Gb. As nanopore reads contain systematic error in homopolymeric regions, we
polished the consensus of the selected assembly with nanopore reads as input to the Racon
(v1.3.2, git commit 5e2ecb7) and Medaka softwares. In addition, we polished the assembly
two additional times using Illumina reads as input to the Hapo-G tool (v1.0, git commit ).
Long range genome assembly
The Bionano Genomics scaffolding workflow (Bionano Solve version 3.5.1) was launched
with the nanopore contigs and the Bionano map. We found in several cases that the
nanopore contigs were overlapping (based on the optical map) and these overlaps were
corrected using the BiSCoT software22 with default parameters. Finally, the consensus
sequence was polished once more using Hapo-G and short reads, to ensure correction of
duplicate regions that were collapsed (Table S4).
Validation of the Triticum aestivum cv Renan assembly
We used BLAST25 to search for the presence of 107,891 HC genes from CS RefSeq v1.1 in
the Renan genome sequence. We extracted the 491,456 individual exons larger than 30 bps
from this dataset and computed exon-by-exon BLAST in order to avoid spurious sliced
alignments. An exon was considered present if it matched the Renan scaffolds with at least
90% identity over at least 90% of its length. We extracted all available ISBPs (150 bps each)
from the CS RefSeq v1.1 and filtered out ISBPs containing Ns and those that do not map
uniquely on the CS genome. This led to the design of a dataset containing 5,394,172 ISBPs
which were aligned on the Renan scaffolds using BLAST. We considered an ISBP was
conserved in Renan if it matched with at least 90% identity over 90% of its length. We used
13
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
the same ISBP dataset to study the impact of polishing on error rate in the assembly while
using BLAST and considering at least 90% identity over at least 145 aligned nucleotides.
Anchoring of the Triticum aestivum cv Renan assembly
We guided the construction of 21 Renan pseudomolecules based on collinearity with the CS
RefSeq Assembly v2.1. For this, we used the positions of conserved ISBPs as anchors
(5,087,711 ISBPs matching with >=80% identity over >=90% query overlap). This
represented 357 ISBPs/Mb, meaning that even the smallest scaffolds (30 kb) carried
generally more than 10 potential anchors. However, some ISBPs match at non-orthologous
positions which create noise to precisely determine the order and orientation of some
scaffolds. To overcome this issue, we considered ISBPs by pairs. Only pairs of adjacent
ISBPs (i.e. separated by less than 50 kb on both CS and Renan genomes) were kept as
valid anchors, allowing the filtering out of isolated mis-mapped ISBPs. Only scaffolds
harboring at least 50% of valid ISBP pairs on a single chromosome were kept. The others
were considered unanchored and they comprised the “chrUn”. We calculated the median
position of matching ISBP pairs along each CS chromosome for defining the order of the
Renan scaffolds relative to each other. Their orientation was retrieved from the orientation of
all matching ISBP pairs in CS following the majority rule. We thus built 21 pseudomolecules
that were then corrected according to the HiC map as explained hereafter.
Two Hi-C biological replicates were prepared from ten-days plantlets of Triticum aestivum cv.
Renan following the Arima Hi-C protocol (Arima Hi-C User Guide for Plant Tissues DOC
A160106 v01). For each replicate, two libraries were constructed using the Kapa Hyper Prep
kit (Roche) according to Arima’s recommendation (Library Preparation using KAPA Hyper
Prep Kit DOC A160108 v01). The technical replicates were then pooled and sent to Genewiz
for sequencing on an Illumina HiSeq4000 (four lanes in total), reaching a 35x coverage. We
mapped a sample of 240 million read pairs with BWA-MEM (Burrows-Wheeler Aligner, Heng
Li, 2013) to the formerly built 21 pseudomolecules, filtered out for low quality, sorted, and
deduplicated using the Juicer pipeline37. We produced a Hi-C map from the Juicer output by
the candidate assembly visualizer mode of 3D-DNA pipeline38 and visualized it with the
Juicebox Assembly Tools software. Based on abnormal frequency contacts signals revealing
a lack of contiguity, scaffold-level modifications of order, orientation and/or chimeric scaffolds
were identified in order to improve the assembly. In case of chimeric scaffolds, coordinates
of resulting fragments were retrieved from the Juicebox Assembly Tools application but then
recalculated to correspond precisely to the closest gap in the scaffold. Pseudomolecules
were eventually rebuilt from initial scaffolds and new fragments while adding 100N gaps
14
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
between neighbor scaffolds. A final Hi-C map was built to validate the accuracy of the final
assembly.
Calculation of chromosome coverage
Short and long-reads were aligned using minimap2 (with the following parameters ‘-I 17G -2
–sam-hit-only -a -x sr’ and ‘-I 17G -2 –sam-hit-only –secondary=no -a -x map-ont’
respectively). Coverage of individual chromosomes was calculated in 100Kb windows using
mosdepth39 (version 0.3.1) and the following parameters ‘–by 100000 -n -i 2 -Q 10 -m’. For
each chromosome, 100Kb windows with a coverage higher than two times the median
coverage of the corresponding chromosome were tagged highly covered. Only genomic
regions with at least five consecutives 100Kb windows were kept and represented in Figure
1.
Transposable elements annotation
Transposable elements were annotated using CLARITE28. Briefly, TEs were identified
through a similarity search approach based on the ClariTeRep curated databank of repeated
elements using RepeatMasker (www.repeatmasker.org) and modelled with the CLARITE
program that was developed to resolve overlapping predictions, merge adjacent fragments
into a single element when necessary, and identify patterns of nested insertions28
.
Gene prediction
We used MAGATT pipeline (Marker Assisted Gene Annotation Transfer for Triticeae,
https://forgemia.inra.fr/umr-gdec/magatt) to map the full set of 106,801 High Confidence and
159,848 Low Confidence genes predicted in Chinese Spring IWGSC RefSeq v2.1. The
workflow implemented in this pipeline was described in Zhu et al.23. Briefly, it uses gene
flanking ISBP markers in order to determine an interval that is predicted to contain the gene
before homology-based annotation transfer, limiting problems due to multiple mapping.
When the interval is identified, MAGATT uses BLAT40 to align the gene (UTRs, exons, and
introns) sequence and recalculate all sub-features coordinates if the alignment is full-length
and without indels. If the alignment is partial or contains indels, it runs GMAP41 to perform
spliced alignment of the candidate CDS inside the interval. If no ISBP-flanked interval was
determined or if both BLAT and GMAP failed to transfer the gene, MAGATT runs GMAP
against the whole genome, including the unanchored fraction of the Renan assembly. We
kept the best hit considering a minimum identity of 70% and a minimum coverage of 70%,
with cross_species parameter enabled.
15
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
We then masked the genome sequence based on mapped genes and predicted
transposable elements coordinates using BEDTools42 mergeBed and maskfasta v2.27.1.
Hence, we computed a de novo gene prediction on the unannotated part of the genome. We
used TriAnnot30 to call genes based on a combination of evidence: de novo predictions of
gene finders (FGeneSH, Augustus), similarity with known proteins in Poaceae, and similarity
with transcribed sequences, as described previously7
. For that purpose, we produced
RNASeq data for Renan from 28 samples corresponding to 14 different organs/conditions in
replicates: grains at four developmental stages (100, 250, 500, and 700 degree days) under
heat stress and control conditions, stems at two developmental stages, leaves at three
stages, and roots at one stage), representing on average 78.8 million read-pairs per sample
i.e., 2.2 billion read-pairs in total. We mapped reads with hisat243 v2.0.5, called 277,505
transcripts with StringTie44 v2.0.3, extracted their sequences with Cufflink45 gffread v2.2.1,
and provided this resource as input to TriAnnot. We optimized TriAnnot workflow to ensure a
flawless use on a cloud-based hpc cluster (10 nodes with 32 CPUs/128GB RAM each and
shared file system) using the IaaS Openstack infrastructure from the UCA Mesocentre.
Gene models were then filtered as follows: we discarded gene models that shared strong
identity (>=92% identity, >=95% query coverage) with an unannotated region of the Chinese
Spring RefSeq v2.1, considered as doubtful predictions. We then kept all predictions that
matched RNASeq-derived transcripts (>=99% identity, >=70% query and subject coverage).
For those that did not show evidence of transcription, we kept gene models sharing protein
similarity (>=40% identity, >=50% query and subject coverage) with a Poaceae protein
having a putative function (filtering out based on terms “unknown”, “uncharacterized”, and
“predicted protein”).
Comparison of genome assemblies
Genome assemblies were downloaded from https://webblast.ipk-gatersleben.de/downloads.
Contigs were extracted by splitting input sequences at each N and standard metrics were
computed. Gene completion metrics were calculated using BUSCO v5.0 and version 10 of
the poales geneset which contains 4896 genes.
We built dotplots between Renan, CS and 10 other reference quality genomes (ArinaLrFor,
CDC Landmark, CDC Stanley, Jagger, Julius LongReach Lancer, Mace, Norin61, SY Mattis,
spelta PI190962) by using orthologous positions of conserved ISBPs (1 ISBP every 2.5 kb
on average) identified by mapping them with BWA-MEM (maximum 2 mismatches, 100%
coverage and minimal mapping quality of 30).
16
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Comparison of a storage protein coding gene cluster
We performed manual curation of the gene models encoding storage proteins predicted in
Renan. Protein sequences of prolamin and resistance genes33 from a 1B chromosome locus
were downloaded and aligned to the CS and Renan genomes using BLAT40 with default
parameters. Draft alignments were refined by aligning the given protein sequence and the
genomic region defined by the blat alignment using Genewise with default parameters.
Resulting alignments were filtered in order to conserve only the best match for each position
by keeping only the highest-scoring alignment and the genomic region containing the gene
cluster was extracted. Then, we used the jcvi suite46 with the mcscan pipeline to find synteny
blocks between both genomes. First, we used the “jcvi.compara.catalog” command to find
orthologs and then the “jcvi.compara.synteny mcscan” with “–iter=1” command to extract
synteny blocks. Finally, we generated the figure with the “jcvi.graphics.synteny” command
and manually edited the generated svg file to improve the quality of the resulting image by
changing gene colors, incorporating gaps and renaming genes. Moreover, to make the figure
clearer, we artificially reduced the intergenic space by 95% so that gene structures appear
bigger. The omega gene cluster representation figure was generated by using
DnaFeaturesViewer47 with coordinates of features generated by the mcscan pipeline used
previously.
Additional files
All the supporting data are included in two additional files: (a) A supplementary file which
contains Supplementary Tables 1-7 and Supplementary Figures 1-3; (b) A supplementary file
which contains dotplots of the 21 chromosomes of Renan with other wheat genome
assemblies.
Acknowledgements
This work was supported by the Genoscope, the Commissariat à l’Énergie Atomique et aux
Énergies Alternatives (CEA) and France Génomique (ANR-10-INBS-09-08). The biological
material (i.e. plant production, sample management, DNA and RNA extractions performed
by Caroline Pont and Cécile Huneau at GDEC) have been obtained in the framework of the
France Génomique WheatOMICS project (2017-2021) coordinated by Jérôme Salse. The
authors are grateful to Oxford Nanopore Technologies Ltd for providing early access to the
PromethION device through the PEAP, and we thank the staff of Oxford Nanopore
17
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Technologies Ltd for technical help. We are grateful to the Mésocentre Clermont Auvergne
University and/or AuBi platform for providing help and/or computing and/or storage
resources.
Availability of supporting data
The Illumina and PromethION sequencing data and the Bionano optical map are available in
the European Nucleotide Archive under the following project PRJEB46515. The genome
assembly and gene predictions are freely available from the Genoscope website
http://www.genoscope.cns.fr/plants/.
Competing interests
The authors declare that they have no competing interests. JMA received travel and
accommodation expenses to speak at Oxford Nanopore Technologies conferences. JMA
and CB received accommodation expenses to speak at Bionano Genomics user meetings.
Funding
This work was supported by the Genoscope, the Commissariat à l’Énergie Atomique et aux
Énergies Alternatives (CEA) and France Génomique (ANR-10-INBS-09-08).
Author’s contributions
SA, ID and AB extracted the sequenced DNA and generated the optical map. KL and AA
optimized and performed the nanopore and Illumina sequencing. NP, EP and MR generated
the Hi-C libraries and sequences. JMA, SE, BI, CM, PLZ, CB, HR, PL, DG and FC
performed the bioinformatic analyses. JMA, SE, BI, CM, PLZ, CB, CC, HR, PL and FC wrote
the article. JMA, PW and FC supervised the study.
18
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Table 1: Comparison of Triticum aestivum L. genome assemblies.
Renan
This study
Chinese Spring
IWGSC7
Number of contigs 12,608 693,050
Cumulative size (bp) 13,943,021,299 14,271,578,887
N50 (bp)
L50
2,164,453
1,946
51,835
81,466
N90 (bp)
L90
607,045
6,582
11,621
295,310
Longest contig (bp) 15,116,687 580,542
Number of chromosomes 21 21
Cumulative size (bp) 14,195,643,615 14,547,261,565
N50 (bp)
L50
703,299,328
10
709,773,760
10
N90 (bp)
L90
520,815,552
19
509,857,056
19
Longest (bp) 854,463,248 830,829,764
% of N 1.78% 1.90%
BUSCO
(N=4,896)
Complete 99.1% 99.3%
Duplicated 94.7% 96.0%
Fragmented 0.1% 0.1%
Missing 0.8% 0.6%
Number of genes 109,552 107,891
Average number of exons 5.10 5.33
BUSCO
(N=4,896)
Complete 99.1% 99.5%
Duplicated 94.6% 98.2%
Fragmented 0.2% 0.1%
Missing 0.7% 0.4%
19
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Table 2: TE classes proportions in Chinese Spring and Renan genome assemblies.
Chinese Spring
RefSeq_v1.0
from Zhu et al23
Chinese Spring
RefSeq_v2.1
from Zhu et al23
Renan RefSeq_v2.0
Genome size (bp) 14,066,280,851 14,225,829,371 14,195,643,615
TE (bp) 11,921,309,743 12,092,094,168 11,967,447,100
TE (%) 84.7 85.0 84.3
Class I
(Retrotransposons)
67.6 66.9 66.6
Gypsy (RLG) 46.7 46.1 45.8
Copia (RLC) 16.7 16.5 16.5
Unclassified LTR
retrotransposons (RLX)
3.24 3.3 3.2
LINE (RIX) 0.9 1.1 1.1
SINE (SIX) 0.01 0.01 0.01
Class II (DNA
transposons)- Subclass 1
16.5 17.0 16.9
CACTA (DTC) 15.5 15.9 15.8
Mutator (DTM) 0.38 0.44 0.44
Unclassified DNA
transposons with TIR
(DTX)
0.21 0.24 0.24
Harbinger (DTH) 0.16 0.18 0.18
Mariner (DTT) 0.16 0.17 0.17
Unclassified DNA
transposons (DXX)
0.06 0.06 0.06
hAT (DTA) 0.006 0.009 0.009
Helitrons (DHH) 0.004 0.01 0.01
Unclassified TE (XXX) 0.68 0.95 0.82
20
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Figure 1. Genome overview of the 21 chromosomes of hexaploid T. aestivum Renan (the 7
A chromosomes are in blue, the 7 B chromosomes in orange and the 7 D chromosomes in
green). From inner to outer track: (i) Gene density, (ii) Density of CACTA (DNA transposon)
elements, (iii) Density of Copia elements, (iv) Density of Gypsy elements, (v) dots represent
highly covered regions (candidate regions of homoeologous exchanges or collapsed
regions) with illumina reads (in red) and nanopore reads (in blue), (vi) Density of gaps. All
densities are calculated in 1-Mb windows; blue and red colors in density plots indicate lower
and higher values, respectively.
21
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Figure 2. Comparison of the accuracy of different ONT basecallers. A. ONT reads from a
yeast sample. B. ONT ultra-long reads (>100 kb) from a wheat sample.
A.
B.
22
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Figure 3. Comparison of existing hexaploid genome assemblies A. contig N50 values. B.
Complete BUSCO genes found in each assembly. C. Number of gaps in each chromosome.
D. chromosome length.
23
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Figure 4. Dotplot comparisons of the 21 chromosomes of Renan (y axis) with the Chinese
Spring RefSeq v2.1 assembly (x axis).
24
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Figure 5. Comparative view of an important locus on chromosome 1B containing prolamin
and resistance genes, tandemly duplicated. a. Representation of the region with gaps and
genes on the two assemblies of Renan and CS. b. Zoomed view on the omega gliadin gene
cluster c. Proportion of the length of the proteins that were aligned in the genomic region of
Renan and CS.
25
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
References
1. Dubcovsky, J. & Dvorak, J. Genome Plasticity a Key Factor in the Success of Polyploid
Wheat Under Domestication. Science 316, 1862–1866 (2007).
2. Marcussen, T. et al. Ancient hybridizations among the ancestral genomes of bread
wheat. Science 345, 1250092 (2014).
3. Guan, J. et al. The Battle to Sequence the Bread Wheat Genome: A Tale of the Three
Kingdoms. Genomics Proteomics Bioinformatics 18, 221–229 (2020).
4. Chapman, J. A. et al. A whole-genome shotgun approach for assembling and anchoring
the hexaploid bread wheat genome. Genome Biol. 16, 26 (2015).
5. Zimin, A. V. et al. The first near-complete assembly of the hexaploid bread wheat
genome, Triticum aestivum. GigaScience 6, 1–7 (2017).
6. Clavijo, B. J. et al. An improved assembly and annotation of the allohexaploid wheat
genome identifies complete families of agronomic genes and provides genomic evidence
for chromosomal translocations. Genome Res. 27, 885–896 (2017).
7. Consortium (IWGSC), T. I. W. G. S. et al. Shifting the limits in wheat research and
breeding using a fully annotated reference genome. Science 361, (2018).
8. Walkowiak, S. et al. Multiple wheat genomes reveal global variation in modern breeding.
Nature 588, 277–283 (2020).
9. Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome.
Nature 585, 79–84 (2020).
10. Belser, C. et al. Chromosome-scale assemblies of plant genomes using nanopore long
reads and optical maps. Nat. Plants 4, 879–887 (2018).
11. Rousseau-Gueutin, M. et al. Long-read assembly of the Brassica napus reference
genome Darmor-bzh. GigaScience 9, (2020).
12. Li, G. et al. A high-quality genome assembly highlights rye genomic characteristics and
26
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
agronomically important genes. Nat. Genet. 53, 574–584 (2021).
13. Liu, J. et al. Gapless assembly of maize chromosomes using long-read technologies.
Genome Biol. 21, 121 (2020).
14. Tørresen, O. K. et al. Tandem repeats lead to sequence assembly errors and impose
multi-level challenges for genome and protein databases. Nucleic Acids Res. 47,
10994–11006 (2019).
15. Li, C. et al. Long-read sequencing reveals genomic structural variations that underlie
creation of quality protein maize. Nat. Commun. 11, 1–11 (2020).
16. Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17,
155–158 (2020).
17. Liu, H., Wu, S., Li, A. & Ruan, J. SMARTdenovo: a de novo assembler using long noisy
reads. Gigabyte 2021, 1–9 (2021).
18. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads
using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
19. Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome
assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
20. Medaka : https://github.com/nanoporetech/medaka. (Oxford Nanopore Technologies,
2021).
21. Aury, J.-M. & Istace, B. Hapo-G, haplotype-aware polishing of genome assemblies with
accurate reads. NAR Genomics Bioinforma. 3, (2021).
22. Istace, B., Belser, C. & Aury, J.-M. BiSCoT: improving large eukaryotic genome
assemblies with optical maps. PeerJ 8, e10150 (2020).
23. Zhu, T. et al. Optical maps refine the bread wheat Triticum aestivum cv. Chinese Spring
genome assembly. Plant J. n/a,.
24. Rimbert, H. et al. High throughput SNP discovery and genotyping in hexaploid wheat.
PLOS ONE 13, e0186329 (2018).
27
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
25. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment
search tool. J. Mol. Biol. 215, 403–410 (1990).
26. De Oliveira, R. et al. Structural Variations Affecting Genes and Transposable Elements
of Chromosome 3B in Wheats. Front. Genet. 11, (2020).
27. Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools
for Oxford Nanopore sequencing. Genome Biol. 20, 129 (2019).
28. Daron, J. et al. Organization and evolution of transposable elements along the bread
wheat chromosome 3B. Genome Biol. 15, 546 (2014).
29. Wicker, T. et al. Impact of transposable elements on genome structure and evolution in
bread wheat. Genome Biol. 19, 103 (2018).
30. Leroy, P. et al. TriAnnot: A Versatile and High Performance Pipeline for the Automated
Annotation of Plant Genomes. Front. Plant Sci. 0, (2012).
31. Kondrashov, F. A. Gene duplication as a mechanism of genomic adaptation to a
changing environment. Proc. R. Soc. B Biol. Sci. 279, 5048–5057 (2012).
32. Panchy, N., Lehti-Shiu, M. & Shiu, S.-H. Evolution of Gene Duplication in Plants. Plant
Physiol. 171, 2294–2316 (2016).
33. Huo, N. et al. Gene Duplication and Evolution Dynamics in the Homeologous Regions
Harboring Multiple Prolamin and Resistance Gene Families in Hexaploid Wheat. Front.
Plant Sci. 9, (2018).
34. Xu, J.-H. & Messing, J. Organization of the prolamin gene family provides insight into the
evolution of the maize genome and gene duplications in grass species. Proc. Natl. Acad.
Sci. U. S. A. 105, 14330–14335 (2008).
35. Alberti, A. et al. Viral to metazoan marine plankton nucleotide sequences from the Tara
Oceans expedition. Sci. Data 4, 170093 (2017).
36. Engelen S, Aury JM. fastxtend. https://www.genoscope.cns.fr/externe/fastxtend/.
37. Durand, N. C. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution
28
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint
Hi-C Experiments. Cell Syst. 3, 95–98 (2016).
38. Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields
chromosome-length scaffolds. Science 356, 92–95 (2017).
39. Pedersen, B. S. & Quinlan, A. R. Mosdepth: quick coverage calculation for genomes and
exomes. Bioinforma. Oxf. Engl. 34, 867–868 (2018).
40. Kent, W. J. BLAT—The BLAST-Like Alignment Tool. Genome Res. 12, 656–664 (2002).
41. Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for
mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).
42. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic
features. Bioinformatics 26, 841–842 (2010).
43. Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome
alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37,
907–915 (2019).
44. Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from
RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
45. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals
unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol.
28, 511–515 (2010).
46. Tang, H. et al. Synteny and Collinearity in Plant Genomes. Science 320, 486–488
(2008).
47. Zulkower, V. & Rosser, S. DNA Features Viewer, a sequence annotations formatting and
plotting library for Python. bioRxiv 2020.01.09.900589 (2020)
doi:10.1101/2020.01.09.900589.
29
made available under aCC-BY-NC 4.0 International license.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.24.457458; this version posted August 24, 2021. The copyright holder for this preprint

This website stores cookies on your computer. These cookies are used to collect information about how you interact with our website and allow us to remember you. We use this information in order to improve and customize your browsing experience and for analytics and metrics about our visitors both on this website and other media. To find out more about the cookies we use, see our Privacy Policy.