Publication

Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer Public

Qian Zhang,Se-Ran Jun,Michael Leuze,David Ussery,Intawat Nookaew 2017 January 19 Scientific Reports 7, Article number: 40712 (2017)

Abstract

The development of rapid, economical genome sequencing has shed new light on the classification
of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database
contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences
that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral
classification and provide insight into the viral “tree of life”. However, due to the lack of evolutionary
conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional
phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method
that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available
in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful
dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative
relative entropy, (2) average number of common features among genomes, and (3) the Shannon
diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq.
The resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore
classification of viruses.

Citation

Zhang, Q., S.-R. Jun, M. Leuze, D. Ussery and I. Nookaew (2017). "Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer." Scientific Reports 7: 40712.