GFF files agnate to their agnate gene models (PGSC v4.04, ITAG v1.0) were retrieved from the Spud DB ( potato genomics resource13. The two models (39,431 PGSC and 35,004 ITAG) were again compared on the base of their exact chromosomal area and acclimatization in adjustment to actualize the best complete set of genes from both predicted genome models. Several combinations arose from the absorb (Fig. 1a), those for which no activity was appropriate (singletons, i.e. sole PGSC or ITAG genes not accoutrement any added genes); 1-to-1 or 1-to-2 combinations amid PGSC and ITAG genes, which were apparent programmatically; and lastly, combinations of added than 3 genes in assorted aggregate types, which connected on to chiral curation. The closing (318 gene clusters; archetype in Fig. 1b) were advised to be non atomic absorb examples (overlapping genes in two models or assorted genes in PGSC agnate to a distinct gene in ITAG). This resulted in a alloyed DM genome GFF3 book with 49,322 chromosome position specific sequences, of which 31,442 were assigned with ITAG gene IDs and 17,880 with PGSC gene IDs14.

Merging DM Phureja PGSC and ITAG gene models. (a) Decision timberline for the amalgamation of both genome models, with 6 accessible outcomes: article genes (‘keep gene as is’), chiral curation (‘gene array to chiral curation’ and programmatic band-aid (all actual 4 options). Green solid curve represent a annoyed action (Y: Yes), abject red lines, an aghast action (N: No). (b) Archetype of chiral curation in the alloyed DM genome GTF, arena visualisation (chr12:11405699..11418575) in the Spud DB ( Genome Browser13. ITAG authentic Sotub12g014200.1.1 spans three PGSC authentic coding sequences (PGSC0003DMT400005728, PGSC0003DMT400005745 and PGSC0003DMT400005726). Below the gene models, RNA arrangement advance are shown, assuming how these genes are bidding in assorted bulb organs. In the accurate case, Sotub12g014200.1.1 was adopted due to RNA-Seq affirmation actuality in concordance, and no affirmation for PGSC0003DMT400005745.

The complete bioinformatic activity is categorical in Fig. 2. Arrangement affection appraisal of raw RNA-Seq data, affection trimming, and abatement of adapter sequences and polyA cape was performed application CLC Genomics Workbench v6.5-v10.0.1 (Qiagen) with best absurdity anticipation beginning set to 0.01 (Phred affection annual 20) and no cryptic nucleotides allowed. Minimal akin arrangement breadth accustomed was set to 15 bp while best up to 1 kb. Orphaned reads were re-assigned as single-end (SE) reads. Candy reads were affiliated into cultivar datasets as appropriately paired-end (PE) reads or SE reads per cultivar per sequencing platform. For the Velvet assembler, SOLiD reads were adapted into bifold encoding reads application perl calligraphy “”15. To abate the admeasurement of cv. Désirée and cv. Rywal datasets, agenda normalization was performed application khmer from bbmap apartment v37.6816 above-mentioned to administering de novo accumulation application Velvet and rnaSPAdes.

Bioinformatics activity for bearing of potato transcriptomes. Software acclimated in specific accomplish are accustomed in bold. Ascribe datasets (sequence reads) and achievement abstracts (transcriptomes) are depicted as dejected cylinders. Abstracts upload accomplish to accessible repositories are black in orange. Abbreviations: SRA – NCBI Arrangement Apprehend Archive, PGSC – Potato Genome Sequencing Consortium, ITAG – all-embracing Tomato Comment Group, CLC – CLC Genomics Workbench, PacBio – Pacific Biosciences Iso-Seq sequencing, Tr – transcriptome, StPanTr – potato pan-transcriptome, tr2aacds – “transcript to amino acerbic coding sequence” Perl calligraphy from EvidentialGene pipeline.

PacBio continued reads were candy for anniversary sample apart application Iso-Seq 3 assay software (Pacific Biosciences). Briefly, the activity included Circular Accord Arrangement (CCS) generation, feature reads identification (”classify” step), absorption isoforms (”cluster” step) and”polishing” footfall application Arrow accord algorithm. Alone high-quality feature PacBio isoforms were acclimated as ascribe for added steps.

Cupcake ToFU ( scripts17 were acclimated to added clarify the Iso-Seq archetype set. Bombastic PacBio isoforms were burst with “” and counts were acquired with “”. Isoforms with beneath than two acknowledging counts were filtered application “” and 5′-degraded isoforms were filtered application “”. Isoforms from the two samples were accumulated into one non-redundant Iso-Seq archetype set application “”.

Short reads were de novo accumulated application Trinity v.r2013-02-2518, Velvet/Oases v. 1.2.1019, rnaSPAdes v.3.11.120 and CLC Genomics Workbench v8.5.4-v10.1.1 (Qiagen). Illumina and SOLiD reads were accumulated separately. For CLC Genomics de novo assemblies, combinations of three balloon sizes and 14 k-mer sizes were activated on PW363 Illumina dataset. Varying balloon admeasurement breadth did not access the accumulation statistics abundant (Supplementary Fig. 2), accordingly we absitively to use the breadth of 85 bp for Illumina datasets of the added two cultivars. Balloon admeasurement and k-mer breadth ambit acclimated for Velvet and CLC are accustomed in Table 1. The axle advantage in CLC and Velvet was disabled. Added abundant advice per accumulation is provided in Auxiliary Table 221.

739 mio Désirée abbreviate reads were accumulated into 3,765,661 abeyant transcripts, 284 mio PW363 abbreviate reads were accumulated into 6,022,291 abeyant transcripts, and 710 mio Rywal abbreviate reads and 1.4 mio Rywal PacBio sequences were accumulated into 1,912,821 abeyant transcripts. While bearing of several transcriptomes from assorted ascribe abstracts and assorted constant combinations increases the likelihood of capturing and accurately accumulating transcripts22, back-up abridgement after accident of advice and absurdity abatement from the over-assemblies is required. All cultivar-specific transcriptome assemblies, aggregate into cultivar-specific transcriptome over-assembly, were initially filtered with the tr2aacds activity (part of EvidentialGene v2016.07.1112) which consists of four steps. First, all absolute bombastic nucleotide sequences are removed application fastnrdb, allotment of the absolve package23, abrogation alone the archetype with the longest coding region. Next, all absolute bits of the actual transcripts are removed application cd-hit24. These aboriginal two accomplish are important in abbreviation transcriptome redundancy, as accurate transcripts are accepted to be accumulated apart by assorted of the accumulation methods. Keeping the transcripts with the longest and best complete coding arena helps annihilate chimeric and misassembled transcripts, as these errors tend to action added about in UTR regions or in a address that causes frameshifts and long, abridged coding regions12).

The third and the fourth footfall of the tr2aacds activity choose transcripts that are acceptable isoforms, alleles, or added variations apparent at a distinct locus. This is done through amino acerbic arrangement clustering, which identifies accepted transcripts that alter alone in bashful mutations, and through alternate BLAST, which detects high-identity exon-sized alignments (likely isoforms). A tag is assigned to all transcripts accouterment abundant advice on why they were alone (e.g. absolute fragments, absolute duplicates, actual aerial similarity, …) or why they were apparent as alternatives (and which arrangement they are an another anatomy of). The final achievement of the tr2aacds activity are three sets of abstracts – a non-redundant set of adumbrative sequences (i.e. capital set), a set of accepted alternatives mapped to the adumbrative set (i.e. alt set), and a alone set (i.e. bead set) of bombastic sequences. It is important to agenda that not all alone sequences are of poor affection or incorrect – abounding of them are alone due to abounding or fractional redundancy.

Representative and another sets (termed accept sets by EvidentialGene) were alloyed into antecedent cultivar advertence transcriptomes and, as tr2aacds alone uses centralized affirmation for abstracts curation, acclimated in added alien affirmation for accumulation validation, clarification and comment accomplish (Fig. 2). The de novo cultivar-specific archetype sets were aboriginal mapped to the DM advertence genome by STARlong 2.6.1d25 application ambit optimized for de novo transcriptome datasets (all scripts are deposited at FAIRDOMHub ( activity home page26). Aligned transcripts were analysed with MatchAnnot to analyze transcripts that bout the PGSC or ITAG gene models. Domains were assigned to the polypeptide dataset application InterProScan software amalgamation v5.37-71.027. For all transcripts and coding sequences, annotations application DIAMOND v0.9.24.12528 were generated by querying UniProt ( retrieved databases (E-value absolute 10−0.5 and concern transcript/cds and ambition arrangement alignment advantage college or according to 50%; custom databases: Solanum tuberosum, Solanaceae, plants). Initially accumulated transcriptomes were additionally buried for nucleic acerbic sequences that may be of agent agent (vector articulation contamination) application VecScreen added anatomy affairs v.0.1629 adjoin NCBI UniVec Database ( Abeyant biological and bogus contagion was articular as up to 3.3% of sequences per cultivar, if demography into annual cases back abeyant contaminants covered beneath than 1% of the arrangement (number of sequences with strong, abstinent and anemic affidavit of contagion as follows: 182, 547 and 10,509 for Désirée; 48, 228 and 7,877 for PW363; 169, 179 and 4,103 for Rywal). The after-effects from MatchAnnot, InterProScan and DIAMOND were acclimated as biological affirmation in added clarification by centralized R scripts. Transcripts that did not map to the genome nor had any cogent hits in either InterPro ( or UniProt ( were alone from added assay to access college believability of complete transcriptomes30,31,32. Pajek v5.0833, centralized scripts, and cdhit-2d from the CD-HIT amalgamation v4.624 were acclimated to re-assign post-filtering adumbrative and another classes and to access finalised cultivar-specific transcriptomes.

The accomplished back-up abatement action bargain the antecedent transcriptome assemblies by 18-fold for Désirée, 38-fold for Rywal, and 24-fold for PW363. Completeness of anniversary antecedent de novo accumulation to cultivar-specific transcriptome was estimated with BUSCO (Supplementary Figs. 1–3).

Individual contributions by assorted accumulation methods were advised in ablaze of what contributed to the final, bankrupt cultivar transcriptomes. SOLiD assemblies (Supplementary Fig. 1: CLCdnDe1, CLCdnDe8, VdnDe8-10), produced by either CLC or Velvet/Oases pipelines, contributed atomic to transcriptomes, which can mostly be attributed to abbreviate breadth of the ascribe sequences. Interestingly, accretion k-mer admeasurement in the CLC activity for Illumina assemblies produced added complete assemblies according to BUSCO array and added transcripts were called for the antecedent transcriptome (Supplementary Fig. 1: CLCdnDe1-7, CLCdnDe9-14). On the contrary, accretion k-mer breadth in Velvet/Oases activity advance to transcripts that were beneath favoured by the back-up abatement action (Supplementary Fig. 1: VdnDe1-7). The Trinity accumulation was commensurable to the aerial k-mer CLC assemblies in transcriptome addition and BUSCO annual (Supplementary Fig. 1). It ability assume that the PacBio Iso-Seq transcripts contributed beneath than accepted to the cv. Rywal transcriptome (Supplementary Fig. 3), about it should be acclaimed that a ample cardinal of PacBio transcripts was assigned to the EvidentialGene bead set because they had absolute or near-perfect CDS character of transcripts accumulated by CLC. The EvidentialGene activity additionally prioritised CLC-assembled transcripts over PacBio transcripts because the back-up abatement algorithm reorders the near-perfect duplicates by archetype name and alone retains the aboriginal archetype listed (Auxiliary Table 634).

While the PGSC gene archetypal authentic transcripts as able-bodied as coding sequences, the ITAG gene archetypal authentic alone coding sequences. Therefore, the potato pan-transcriptome architecture was conducted at the akin of CDS.

Cultivar-specific adumbrative coding sequences (57,943 of Désirée, 43,883 of PW363 and 36,336 of Rywal) were accumulated with coding sequences from the alloyed Phureja DM gene models (17,880 and 31,442 non-redundant PGSC and ITAG genes, respectively) and subjected to the cdhit-est24 algorithm (global arrangement character beginning 90%, alignment advantage for the beneath arrangement 75%, bandwidth of alignment 51 nt and chat breadth of 9) to actualize potato pan-trancriptome. Sequences that did not array application cdhit-est were afar into tetraploid and DM datasets and subjected to the cdhit-2d24 algorithm (local arrangement character beginning 90%, alignment advantage for the beneath arrangement 45%, bandwidth of alignment 45 nt and chat breadth of 5).

Sequences that are aggregate by the DM alloyed gene archetypal and de novo accumulated cultivar-specific transcriptomes were appointed as “core” transcripts, and sequences that were accumulated in alone one transcriptome were appointed “genotype-specific”. The absolute pan-transcriptome includes 96,886 representative, non-redundant transcripts and 90,618 another sequences (covering another braid forms, allelic isoforms and fractional transcripts) for those loci (Fig. 3, Supplementary Fig. 4, Auxiliary Table 735). The amount subset of the pan-transcriptome contains 68,708 sequences, amid which 12% are fractional sequences.

Structure of the potato pan-transcriptome. Stacked bar artifice assuming the overlap of paralogue groups in cultivar-specific transcriptomes and alloyed Phureja DM gene model. Alone adumbrative and another transcripts of the pan-transcriptome are counted (i.e. cultivar adumbrative sequences) while behindhand added cultivar another transcripts. For Phureja DM, the alloyed ITAG and PGSC DM gene models were counted. DM and at atomic one Group Tuberosum: sequences aggregate by Phureja DM and at atomic one tetraploid genotype, core: sequences aggregate amid all genotypes in the pan-transcriptome.

Polyploid crop pan-genomes about abide of abounding cultivar-specific genes36. Accordingly we included all genotype-specific sequences in our potato pan-transcriptome (Fig. 3, Supplementary Fig. 4). This subset contains 64,529 sequences, amid which 13% sequences are partial35. Genotype-specific transcripts are about beneath in breadth than the amount transcripts, about they do not alter abundant in the allotment of complete transcripts.

