A GenomeNet resource for virus-host interaction network analyses
1Institute for Chemical Research, Kyoto University, 2SGI Japan, Ltd., 3Aix Marseille Universit_, CNRS
Viruses infect a wide range of cellular organisms from higher animals to tiny prokaryotes. Given the rapid increase of sequenced genomes of viruses infecting diverse hosts, it is becoming increasingly important to study viruses from the wider taxonomic perspective of a global interaction network of both viruses and hosts. In such ecological or evolutionary studyies of viruses, the taxonomic as well as genomic information of both viruses and hosts is often essential. Such information allows investigating the correlation in nucleotide/codon compositions between viral and host genomes, revealing their co-evolutionary scenarios, or detecting their genetic interactions through horizontal gene transfers. However, it is currently difficult to obtain comprehensive and computer treatable information of viruses and their hosts from public databases. Some databases provide such information but cover only a part of known viruses. For example, RefSeq stores viral host information only for 62% of the sequenced viral genome entries; UniProt, only for 20% of sequenced viruses. Furthermore, RefSeq provides host information in the form of scientific (e.g., Homo sapiens) or common (e.g., human) names with no specification of computer interpretable NCBI/taxonomic identifiers (TaxIDs). Through mining existing databases, we established a simple web resource that links taxonomic data between complete viral genomes and their hosts. We retrieved host information represented in different formats from the RefSeq database and assigned TaxIDs, enabling access to taxon specific data or to other related biological database entries. When there was no host information leading to TaxIDs, we extracted the information through literature surveys. The established virus-host information resource for over 6,000 complete virus genomes will be useful for systematic analysis of virus and host genome data including those from environmental sequencing efforts.