VITAP: a high precision tool for DNA and RNA viral classification based on meta-omic data

Database of viral reference genomic sequences used by VITAP

Published viral reference genomes were used to build the VITAP-specific database and perform benchmarking. As the high priority of ICTV in viral taxonomy, the taxonomic assignments by VITAP are highly based on viral genomes accepted by ICTV. After carefully evaluating the differences in taxonomic information between various databases, viral genomes from other databases can also be integrated into the VITAP database to enhance the sensitivity of taxonomic assignments. Genomes collected in ICTV VMRs and NCBI RefSeqs (VMR-MSL35, VMR-MSL37, VMR-MSL38, and NCBI RefSeq209) were retrieved and downloaded from GenBank. These included 10,345, 15,230, 16,238, and 13,650 viral genomes/segments from 6388, 10,245, 11,095, and 10,377 viral species respectively. Based on the viral region information provided by ICTV, viral regions that were integrated into host genomes were extracted from corresponding host chromosomes.

The gene calling of viral genomes/segments was performed using two strategies, open-reading-frame (ORF)-based and end-to-end translation. Prodigal (v.2.6) was used to predict putative viral ORFs with default parameters in “meta” mode⁸⁴. For those short sequences that could not be processed by Prodigal, the end-to-end translation strategy was applied to generate six possible reading frames using Seqkit (v.2.5)⁸⁵. The ORFs and end-to-end translated reading frames were merged into a viral reference protein set.

Determination of taxonomic thresholds

Taxon-specific alignment thresholds were calculated for each taxonomic unit based on all-to-all alignment of reference genomes. Unlike the method designed by Lavigne et al., the proportion of shared homologous ORFs within a taxonomic unit was not presupposed⁸⁶. Instead, the alignment-based sequence similarity was quantified and used to determine the taxon thresholds. This is a more flexible approach, as each taxonomic unit is assigned a specific threshold, thus avoiding the previous rigid taxonomic assignment process. Viral reference proteins were self-aligned using DIAMOND (v.0.9) with 1e-5 as the E-value and other default parameters, reporting the top 1000 hits. Each ORF was subsequently given a taxonomic lineage composition based on the taxonomic lineage of the assigned reference genomes. Different ORFs within a target genome might have different taxonomic lineage compositions. The hits for each ORF were then categorized into three types: self-hit, top-hit, and others, assigning different weights to each type. A self-hit is an ORF’s self-alignment, with a weight of 1; top-hit is the second-ranking hit, excluding the self-hit, and all other hits sharing the same taxonomic unit, with a weight of 1.2; others, including all remaining hits, were assigned a weight of 0.8. A second type of weight based on a voting strategy was then defined. Dominant taxonomic units (constituting at least 50% of all hits for an ORF, referring to geNomad) were assigned a weight of 1.2⁴⁹, while the remaining taxonomic units were assigned a weight of 1. Based on the two types of weights mentioned above (${\omega }_{1}$ and ${\omega }_{2}$) and the raw bitscore of each hit of each ORF ($b$), the taxon bitscore (${b}_{t}$) of each hit was calculated:

$${b}_{t}={\omega }_{1}\cdot {\omega }_{2}\cdot b$$

(1)

where ${b}_{t}$ describes a modified bitscore considering different hit types for each alignment.

Each taxonomic unit was scored for each genome. Firstly, the ratio of the number of homologous ORFs (with 1e-5 as the E-value cut-off of BLASTp) is defined. Let ${n}_{{{{{\mathrm{orf}}}}}}$ be the number of ORFs of a genome that fall into a taxonomic unit; let ${N}_{{{{{\mathrm{orf}}}}}}$ be the total number of ORFs of a genome. Based on the calculated ${n}_{{{{{\mathrm{orf}}}}}}$ and ${N}_{{{{{\mathrm{orf}}}}}}$, a parameter is defined by:

$$D={\left(\frac{{n}_{{{{{\mathrm{orf}}}}}}}{{N}_{{{{{\mathrm{orf}}}}}}}\right)}^{2}$$

(2)

where $D$ describes the ratio of the number of homologous ORFs (with 1e-5 as the E-value cut-off of BLASTp) appearing in a genome within a taxonomic unit to its total ORFs. For each unit within each taxonomic hierarchy, a distinct $D$ will be calculated. In other words, each realm-, kingdom-, phylum-, class-, order-, family-, genus-, and species-level unit within the viral taxonomy framework will have its own specific $D$. This ratio is squared to amplify differences. Secondly, the taxonomic scores for each genome are determined. Let $\sum {B}_{{{{{\mathrm{same}}}}}}$ be the sum of ${b}_{t}$ for a genome within a taxonomic unit; let $n$ be the number of hits within this taxonomic unit. Then, the taxonomic score ($T$) for a genome is defined by:

$$T=\, D\cdot \frac{\sum {B}_{{{{{\mathrm{same}}}}}}}{n}$$

(3)

where $T$ describes a quantified parameter that considers two features, including ORF alignment scores, and the proportion of homologous ORFs present in a taxonomic unit relative to the total number of ORFs in the genome.

The taxonomic threshold for a taxonomic unit is determined from the set of taxonomic scores of a reference viral genome. Defining ${T}_{{{{same}}\_}min }$ and ${T}_{{{{diff}}\_}\,max }$: ${T}_{{{{same}}\_}min }$ belongs to a set that contains a series of taxonomic scores, all of which are associated with taxonomic units identical to those of the query genome. The lowest value within this set is defined as ${T}_{{{{same}}\_}min }$. Similarly, ${T}_{{{{diff}}\_}max }$ belongs to another set that consists of taxonomic scores associated with taxonomic units different from those of the query genome. The highest value within this set is defined as ${T}_{{{{diff}}\_}max }$. Based on the calculated ${T}_{{{{same}}\_}min }$ and ${T}_{{{{diff}}\_}max }$, two types of taxonomic score threshold are defined by:

$$\left\{\begin{array}{c}{T}_{c1}=\frac{{T}_{{same}\_min }+{T}_{{diff}\_{max }}}{{2}}\\ {T}_{c2}=\frac{3\cdot {T}_{{same}\_{min }}}{4}\hfill \end{array}\right.$$

(4)

where ${T}_{c1}$ and ${T}_{c2}$ respectively describe the cases where multiple genomes from the same taxonomic unit are present in the database, and the cases where certain taxonomic units in the database contain only a single genome. These equations considered both different taxonomic units with close relationships and relatively independent taxonomic units. If a genome has taxonomic scores in multiple different taxonomic units, the taxonomic score threshold for its bona fide unit is the minimum value among all members that meet the consistency criteria for that unit. If all genomes in a particular unit only have taxonomic scores within that unit, the threshold is set in the upper quartile of the lowest taxonomic score in that unit, ensuring a degree of diversity tolerance.

The taxonomic thresholds were calculated for each taxonomic unit through the algorithm described above. The thresholds were subsequently used to calculate quantitative parameters for the classification of a target genome into taxonomic units at different taxonomic hierarchies. These steps provide the minimum thresholds for the taxonomic validity of each taxonomic unit. These values vary for different taxonomic units and are influenced by the diversity of the reference sequence database.

Optimal taxonomic lineage determination and confidence level assignment

The taxonomic unit assignment of VITAP independently processes in different taxonomic hierarchies (from realm-level to species-level). For annotations of a certain genome, the best-fit taxonomic units belonging to different hierarchies are not always consistent with its real hierarchies. This confusion might have resulted from horizontal gene transfers between different taxonomic units within different taxonomic ranks. Based on a range of taxon scores of a genome, an algorithm to detect all possible taxonomic lineages for that genome and determine one optimal taxonomic lineage was developed. Firstly, based on the genome-taxonomy bipartite graph and taxonomic hierarchy, taxonomic units of a genome were aligned at a range of hierarchies. The genome-taxonomy bipartite graph consists of genome IDs, taxonomic units, and taxonomic scores ($T$), describing the taxonomic score ($T$) of each genome for each taxonomic unit. A genome may have multiple taxonomic unit associations, but the mapping between each genome and each taxonomic unit is unique. This step produced all possible taxonomic lineages for a genome. Secondly, based on pre-built taxonomic thresholds (${T}_{c1}$ and ${T}_{c2}$) and taxonomic score for a genome ($T$), the ratio of each taxonomic unit’s taxon score to its corresponding threshold was calculated:

$${\omega }_{T}=\frac{T}{{T}_{c1}}{{{\rm{or}}}} \, {\omega }_{T}=\frac{T}{{T}_{c2}}$$

(5)

where the ${\omega }_{T}$ was used as the taxon weight (species to realm-level). For a genome, ${\omega }_{T}$ described a quantitative parameter for the likelihood of a genome being attributed to a specific taxonomic unit. Thirdly, the lineage score was calculated based on the array of ${\omega }_{T}$ of a genome. Let ${S}_{1}$ and ${S}_{2}$ be two types of lineage scores; let $I$ be the lowest taxonomic rank at which ${\omega }_{T}$ exceeds 0.3; $i$ is the final terminating taxonomic hierarchy, for which the calculated lineage score is the maximum among all lineage scores; let ${n}_{T}$ be the number of taxon levels included in a lineage. Then:

$$\left\{\begin{array}{c}{S}_{1}=\frac{{\sum }_{i}^{{{{realm}}}}{\omega }_{T}}{{n}_{T}}\,(i={{{species}}},{{{genus}}},{{{family}}},{{{order}}},{{{class}}})\\ {S}_{2}=\max \left(\left\{\frac{{\sum }_{i}^{{{{realm}}}}{\omega }_{T}}{{n}_{T}}\right\}\right)\,(i={{{species}}},{{{genus}}},{{{family}}},{{{order}}},{{{class}}})\end{array}\right.$$

(6)

In detail, for lineages where all taxon scores are less than 0.3, the determination of ${S}_{2}$ involves a series of calculations. Starting from the realm, the taxon scores of lower ranks were progressively summed and the average calculated, resulting in a set of lineage scores $\left(\left\{\frac{{\sum }_{i}^{{{{{\mathrm{realm}}}}}}{\omega }_{T}}{{n}_{T}}\right\}\right)$. The highest lineage score is designated as ${S}_{2}$; the corresponding lowest rank is identified as the optimal terminating hierarchy. Based on the current database and taxonomic frameworks, this step determined the lowest classifiable taxonomic hierarchy for a genome. Fourthly, the three confidence levels were assigned based on lineage scores: high-confidence for results with lineage score ≥ 0.6; medium-confidence for results with lineage score ≥ 0.3; low-confidence for the rest of other results. In terms of the significance of varying lineage scores, a lineage score of 1 indicates that a genome meets the threshold of the current taxonomic framework (a high-confidence setting of 0.6 is used to maintain a certain tolerance level). When the lineage score is less than 1, it signifies that a genome does not meet the threshold of the existing taxonomic framework but is, to some extent, close to certain taxonomic lineages (the lineage score describes this degree). Fifthly, the result with the highest lineage score with certain conditions for each genome was selected as the optimal taxonomic lineage (${l}^{*}$). Let $L$ be a set containing all possible taxonomic lineages; let $\left|{n}_{T}\right|$ be the number of weights (taxonomic hierarchies) in the set ${n}_{T}$ for $l$. Then, for each possible lineage ($l$) from the last weight to the first weight, each lineage score ${S}_{3}$:

$${S}_{3}=\frac{{\sum }_{i}^{{{{realm}}}}{\omega }_{T}}{{n}_{T}}(i={{{species}}},{{{genus}}},{{{family}}},{{{order}}},{{{class}}})$$

(7)

its ratio to ${S}_{2}$ was calculated and ${l}^{*}$ was determined:

$${l}{*}={\arg }\,{\max }_{l \in L}\left\{\left|{n}_{T}\right|:\frac{{S}_{3}}{{S}_{2}}\ge 0.6\right\}$$

(8)

In summary, this equation aims to find an $l$ in the set $L$ such that, under the condition $\frac{{S}_{3}}{{S}_{2}}\ge 0.6$, the value of $\left|{n}_{T}\right|$ is maximized. This allows VITAP to capture the condition where the optimal taxonomic lineage has the most number of weights (taxonomic hierarchies) and its average weight is at least 0.6 of the maximum weight of all possible taxonomic lineages.

The generalization ability assessment of VITAP

The 16,238 viral genomic sequences collected in VMR-MSL38 were used to characterize the generalization ability of VITAP. 70% of the sequences from the set were selected as the training set, and the remaining sequences as the test set. The sequences in the test set were sliced into fragments with diverse lengths (1-kb, 5-kb, 10-kb, 20-kb, 30-kb). The training set was used to build a database utilized by VITAP. Taxonomic assignments of the test set were performed by VITAP based on the built database in the previous step. The model’s performance on the test set was characterized using four parameters for different taxonomic levels: accuracy, precision, recall, F1 score, and annotation rate. Let ${{{{\mathrm{TP}}}}}$, ${{{{\mathrm{TN}}}}}$, ${{{{\mathrm{FP}}}}}$, ${{{{\mathrm{FN}}}}}$ be numbers of true positive, true negative, false positive, and false negative, respectively. Then:

$${{{accuracy}}}=\,\frac{{{{TP}}}+{{{TN}}}}{{{{TP}}}+{{{TN}}}+{{{FP}}}+{{{FN}}}}$$

(9)

$${{{precision}}}=\,\frac{{{{TP}}}}{{{{TP}}}+{{{FP}}}}$$

(10)

$${{{recall}}}=\,\frac{{{{TP}}}}{{{{TP}}}+{{{FN}}}}$$

(11)

$$F1=\,\frac{2\times {{{precision}}}\times {{{recall}}}}{{{{precision}}}+{{{recall}}}}$$

(12)

$${{{annotation}}\; {{rate}}}=\,\frac{{{{number}}\; {{of}}\; {{contigs}}\; {{with}}\; {{taxonomic}}\; {{assignments}}}\,}{{{{number}}\; {{of}}\; {{contigs}}}}$$

(13)

where TP refers to the number of viral sequences correctly assigned to the bona fide taxonomic units; TN refers to the number of viral sequences that do not belong to given taxonomic units and were also correctly excluded by the VITAP; FP refers to the number of viral sequences that were incorrectly assigned to a taxonomic unit; FN refers to the number of viral sequences that should have been assigned to a taxonomic unit but were incorrectly excluded by the VITAP and other pipelines. These performance matrices were calculated for family and genus, respectively.

The model is not expected to provide taxonomic units beyond those present in the training set. On the other hand, unlike traditional binary and multi-class supervised learning, viral taxonomic assignment task is a complex hierarchical classification rather than a flat classification. Therefore, applying a strategy to describe the model’s generalization ability objectively was considered: the statistics were performed only on the taxonomic units appearing in the training and test sets. For taxonomic units missing from the training set, if the related sequences were marked by VITAP as “-“ (indicating that they cannot be classified) rather than being assigned to incorrect taxonomic units, these sequences were also considered true positive results. These instances should not be considered failures of VITAP and other pipelines. These steps were repeated ten times independently (tenfold validation) to ensure the stability of the generalization ability assessment.

Benchmarking of VITAP with other pipelines on viral sequences

The performance of VITAP and other state-of-the-art pipelines was evaluated based on a simulated virome generated from ICTV reference genomes. To enable a comparison between different pipelines, three distinct VITAP databases were constructed based on VMR-MSL35 (compared to VPF-Class), VMR-MSL37 (compared to vConTACT2, PhaGCN2, and geNomad), and NCBI RefSeq209 (compared to CAT) to ensure that VITAP and the other pipelines used the same (or contemporaneous) databases. First, we sliced the viral genomes from each database corresponding to each pipeline into 1-kb, 5-kb, 20-kb, and 30-kb sequences and performed taxonomic assignments to evaluate how effectively each pipeline utilized its respective database. Second, two “new” viral datasets (viral RefSeqs released after 2022.01 and after 2020.01, respectively) were established as test datasets based on the release dates of these databases. Sequences of different lengths (1-kb, 5-kb, 10-kb, 20-kb, and 30-kb) from two “new” viruses datasets were generated by slicing genomes, using fragment lengths as the window size and half the fragment length as the step size. The sequences generated in this step exhibited at least 50% overlap in their regions and covered the entire genomes. The database with an older version (VMR-MSL35) corresponds to a larger number of “new” test sequences, while the database with newer versions (VMR-MSL37 and NCBI RefSeqs209) corresponds to fewer “new” test sequences. Consequently, pipelines tested using VMR-MSL35 as the database (e.g., VPF-Class and VITAP) include a greater number of sequences compared to those tested using VMR-MSL37/RefSeqs209 as the database (e.g., vConTACT2, CAT, PhaGCN2, geNomad, and VITAP).

Datasets with diverse lengths were subject to previously published pipelines to perform taxonomic assignments. For vConTACT2, the genomes of VMR-MSL37 were merged with the “new” virus dataset to generate VCs (with “–db none” option and other default parameters). As stated by vConTACT2, the extent to which vConTACT2 can reproduce a situation where one VC and one sub-VC represent one family and genus was monitored, respectively. We strictly followed vConTACT2’s official guideline: “If the user genome is in the same VC but not the same subcluster as a reference, then it’s highly likely the two genomes are related at roughly genus-subfamily level.”⁴⁰. Therefore, given that a VC may represent a genus or a subfamily, a VC must have purity at the family-level, meaning that all members within the same VC must belong to the same family. Otherwise, all members within that VC were considered false negatives that cannot be efficiently classified. For genus, referring to “If the user genome is within the same VC subcluster as a reference genome, then there is a very high probability that the user genome is part of the same genus,” a sub-VC produced by vConTACT2 represents a genus. All members within the same subVC must belong to the same genus. Otherwise, all members within that subVC were considered false negatives that cannot be efficiently classified. For CAT (with NCBI RefSeq209 database)⁴⁶, fragments of the “new” viral dataset (released after 2022.01) were used to perform taxonomic assignments. For VPF-Class (with VMR-MSL35-contemporary database)⁴⁴, fragments of “new” viral datasets (released after 2020.01) were used to perform taxonomic assignments. At that time, Caudovirales had not yet been abolished. For PhaGCN2 (with VMR-MSL37 database)⁴⁸, fragments of the new viral dataset (released after 2022.01) were used to perform taxonomic assignments. Only fragment sets with lengths over 10-kb can be used in taxonomic assignments of PhaGCN2. Shorter sequences cannot be recognized by the neural network module of PhaGCN2. For geNomad (with VMR-MSL37 database)⁴⁵, fragments of “new” viruses set (released after 2022.01) were used to perform taxonomic assignments. The consistency between assigned taxonomic units and bona fide taxonomic units was assessed by accuracy, precision, recall, F1 score, and annotation rate. The calculation approach is the same as the generalization ability assessment.

Comparison of taxonomic annotation on IMG/VR (v.4)

The taxonomic annotation of 2,631,412 vOTUs from the IMG/VR (v.4) database was performed based on the approach described above⁴⁵. As the GTA-like genomes have been assigned as novel viral families in the current ICTV viral metadata resource (VMR-MSL38, as of September 2023), it was expected that viral genomes belonging to these novel families would be found. The IMG/VR (v.4) genome database was downloaded from the JGI genome portal ( For DNA viruses, a vOTU-level cluster was determined if its internal members had 95% average nucleotide identity (ANI) and 85% aligned fraction coverage (AF) across the genome⁴⁵. For RNA viruses, a vOTU-level cluster was determined if its internal members had 90% ANI and 80% AF for RNA-dependent RNA polymerase (RdRp)⁶⁷. The annotation of vOTUs by VITAP was compared with their original taxonomic units to assess the consistency.

For viral contigs that showed differences in family-level taxonomic assignments between two pipelines (2,830 inconsistencies and 41,479 newly added) (Supplementary Data 6, 7), their ORFs were aligned with the NR database (as 2023.12.26), using 1e-5 as the E-value cut-off, and reporting the top 1,000 hits. Based on the alignment result and GenBank taxonomy database, let ${n}_{{{{{\mathrm{orf}}}}}}$ be the number of ORFs of a genome that fell into a family; let ${N}_{{{{{\mathrm{orf}}}}}}$ be the total number of ORFs of a genome, the participation ratio ($p$) of each viral contig at each family-level was calculated:

$$p=\frac{{n}_{{{{orf}}}}}{{N}_{{{{orf}}}}}$$

(14)

where $p$ is the compositional percentage of a viral sequence at the protein level for a given family. Subsequently, the consistency with the results of VITAP or IMG/VR based on the family with the highest $p$ was evaluated. This voting-based evaluation method does not rely on the databases of VITAP and IMG/VR. Although it is influenced by the species diversity present in public databases, and so the results may be relatively objective.

Building the database from diverse genome sources

To enhance classification sensitivity, 4807 and 785 representative genomes from NCBI RefSeq209 and IMG/VR (v. 4) were incorporated into the reference sequence of VITAP, respectively. All 13,650 viral genomes from NCBI RefSeq209 were included. For genomes from IMG/VR (v.4), only viral taxonomic units (vOTUs) labeled as high-confidence and high-quality (completeness >0.9) from IMG/VR workflow, as well as with genus as lowest taxonomic hierarchy, were included (n = 10,335)⁴⁵. Firstly, viral genomes from NCBI RefSeq209 were aligned to the VMR-MSL37 database using MMSeqs2 with parameters 95 and 100% as the alignment identity and coverage, respectively. This step led to 6840 non-redundant genomes being compared to the VMR-MSL37 database. Secondly, vOTUs from IMG/VR (v.4) were aligned to the VMR-MSL37 genome database and 6,840 non-redundant viral RefSeqs, respectively, leading to 1234 extra non-redundant genomes. Thirdly, taxonomic assignments were performed on these 8074 extra non-redundant genomes by VMR-MSL37-based VITAP. Only genomes where the VITAP-derived taxonomic information is consistent with the original taxonomic information to a certain level were retained: 3627 genomes with consistency down to genus; 909 genomes with consistency down to family (received no genus level units from VITAP); 271 genomes with consistency down to order (received no family and genus level units from VITAP); 501 genomes with consistency down to class (received no order, family and genus level units from VITAP); 284 genomes without any taxonomic assignment by VITAP. The 5592 genomes as final extra reference genomes were combined with the VMR-MSL37 database. These extra genomes and genomes from the VMR-MSL37 database were merged to calculate thresholds of all taxonomic units and build the database utilized by VITAP.

Taxonomic re-annotation of the deep ocean virome

VITAP and five other pipelines (vConTACT2, CAT, VPF-Class, PhaGCN2, and geNomad) were used to perform viral taxonomic assignments of deep ocean viromes datasets derived from four studies. The taxonomic databases used by these pipelines include VMR-MSL37/RefSeq209/IMGVR4-hybrid database (VITAP), VMR-MSL37 (vConTACT2 and VITAP), RefSeq209 (CAT), and pipeline-integrated databases (VPF-Class, PhaGCN2, and geNomad). Four viromes associated with deep sea and corresponding metadata were retrieved and downloaded from public databases. These viromes were derived from oceanic trenches and cold seep sediments^4,28,50,51. All viral contigs were performed taxonomic assignments through VITAP using the VMR-MSL37/RefSeq209/IMGVR4-hybrid database and stand-alone VMR-MSL37, respectively. The prokaryotic viral genomes with lengths over 5 kb from the top ten viral families were used to build a genome-content similarity network by vConTACT2⁴⁰ and visualized by Cytoscape⁸⁷. As the contigs assigned to Herelleviridae have no linkages with Herelleviridae RefSeqs, these contigs were further confirmed by a viral proteome tree using ViPTree (v.4)⁴¹. The proteins from two assigned giant viral families (Mimiviridae, and Phycodnaviridae) were subject to Diamond BLASTp against viral proteins from RefSeq223. The average number of hits for each target viral protein was calculated across four different viral groups (Phycodnaviridae in RefSeqs, Mimiviridae in RefSeqs, Head-tail viruses in RefSeqs, and Filamentous phages in RefSeqs).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.