PorthoMCL: Parallel orthology prediction using MCL for the realm of massive genome availability
© The Author(s) 2017
Received: 17 June 2016
Accepted: 21 October 2016
Published: 10 January 2017
Finding orthologous genes among multiple sequenced genomes is a primary step in comparative genomics studies. With the number of sequenced genomes increasing exponentially, comparative genomics becomes more powerful than ever for genomic analysis. However, the very large number of genomes in need of analysis makes conventional orthology prediction methods incapable of this task. Thus, an ultrafast tool is urgently needed.
Here, we present PorthoMCL, a fast tool for finding orthologous genes among a very large number of genomes. PorthoMCL can be run on a single machine or in parallel on computer clusters. We have demonstrated PorthoMCL’s capability by identifying orthologs in 2,758 prokaryotic genomes. The results are available for download at: http://ehsun.me/go/porthomcl/.
PorthoMCL is a fast and easy to run tool for identifying orthology among any number of genomes with minimal requirements. PorthoMCL will facilitate comparative genomics analysis with increasing number of available genomes thanks to the rapidly evolving sequencing technologies.
KeywordsAlgorithms Sequence alignment Orthologous Genes Software
Orthologs are genes in different species derived from the last common ancestor through speciation events. Orthologous genes generally share the same biological functions in their host genomes. Therefore, identification of orthologous genes among a group of genomes is crucial to almost any comparative genomic analysis . In contrast, paralogs, which are genes that are resulted from gene duplication within a species, may have different functions, though their sequences can be highly conserved. Depending on whether duplication occurred before or after speciation, they are called outparalogs or inparalogs, respectively . Thus, a major challenge in predicting orthologs of a gene is differentiating its orthologs from the orthologs of its paralogs.
Furthermore, due to the rapid advancement in sequencing technologies, sequencing a prokaryotic genome now occurs at an unprecedentedly fast speed and low cost. As a result, tens of thousands of prokaryotic genomes have been fully sequenced, and this number will soon reach hundreds of thousands. The availability of a large number of completed genomes makes comparative genomics an increasingly powerful approach for genome annotations, thereby addressing many important theoretical and application-based problems. However, the rate at which genomes are sequenced outpaces that at which CPU speed increases. This poses a great challenge in comparative genomics that requires faster algorithms or adaptation of existing tools in parallel environments.
Similarly, within-genome reciprocal hits that have a better normalized score than between-genomes hits are identified as paralogs . Ortholog and paralog groups are then identified by finding the heavily connected subgraphs using the MCL . However, OrthoMCL relies on a relational database system to store the BLAST results and issues SQL commands to find reciprocal best hits, making it computationally inefficient when the number of genomes becomes large.
To overcome this problem and to speed up the method further, we developed PorthoMCL, a parallel orthology prediction tool using MCL. In addition to the parallelization, our sparse file structure that is more efficient makes PorthoMCL ultrafast and highly scalable. Furthermore, PorthoMCL is platform independent, thus can be run on a wide range of high performance computing clusters and cloud computing platforms.
These step are embarrassingly parallel computing problems and do not require shared memory, process coordination or data exchange platform  as used in orthAgogue. Hence, these steps are readily designed to be executed in parallel on a variety of high performance computing (HPC) environments. However, these steps are not totally independent as each step needs the output of the preceding step. The output of these steps are eventually collated to construct a sequence similarity graph that is then cut by the MCL program to predict orthologous and paralogous gene groups.
High performance computing support
PorthoMCL is designed to predict orthologs in a very large number of sequenced genomes in a HPC environments, such as computing clusters or cloud computing platforms without the need of a database server or Message Passing Interface, which is an advantage over OrthoMCL and orthAgogue. We have included a TORQUE script in the repository to facilitate its use in such environments. However, PorthoMCL also runs on a desktop or a server with minimal requirement using the provided wrapper script.
Comparison of runtimes of OrthoMCL and PorthoMCL for different number of genomes
To illustrate the power of PorthoMCL, we applied it to 2,758 sequenced bacterial genomes obtained from GenBank using their annotated protein sequences. These genomes contain a total of 8,661,583 protein sequences with a median length of 270 amino acids. These sequences serve as both the query and the database for all-against-all BLAST searches. For this application, PorthoMCL split the query sequences into smaller files each containing about 10,000 sequences, and ran in the parallel mode on a cluster with 60 computing nodes (each node has 12 cores and 36GBs of RAM). PorthoMCL finished the job in 18 days, of which it spent 11 and 7 days on BLAST searches and the remaining steps that would have taken 549 and 1,634 days, respectively, if run on a single node. In contrast, OrthoMCL could not finish the job after 35 days running on a database server with 40 cores and 1TBs of RAM.
PorthoMCL identified 763,506,331 ortholog gene pairs and identified 230,815 ortholog groups in these genomes. The orthologous pairs (file size: 6.2GB) and orthologous groups (file size: 50 MB) as well as paralogous pairs are available for download at http://ehsun.me/go/porthomcl. We will periodically update our predictions when more genomes are available in the future. The options and arguments needed at each step are discussed in detail in the documentation of the PorthoMCL package that can be freely accessed from github.com/etabari/PorthoMCL.
PorthoMCL is fast tool with minimal requirements for identifying orthologs and paralogs in any number of genomes. While PorthoMCL uses the same mathematical basis as OrthoMCL to investigate orthology among genomes, it is much faster and a more scalable tool when handling a very large number of genomes than existing tools. PorthoMCL can facilitate comparative genomics analysis through exploiting the exponentially increasing number of sequenced genomes.
Basic local alignment search tool
High performance computing
Message passing interface
Structured query language
Authors wish to thank Jonathan Halter for his technical HPC support and valuable contributions to this project. We also wish to acknowledge Katherine Jones for her help preparing the manuscript.
This work was funded by the National Science Foundation (EF0849615 and CCF1048261) and NIH (R01GM106013).
Availability of data and materials
All the source code, executables, sample datasets and documentations are available under the GPLv3 license in the github repository: github.com/etabari/PorthoMCL.
ZS conceived the project. ET implemented and tested the programs. Both authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Alexeyenko A, Lindberg J, Pérez-Bercoff A, Sonnhammer ELL. Overview and comparison of ortholog databases. Drug Discov Today Technol. 2006;3:137–43.View ArticleGoogle Scholar
- Sonnhammer EL, Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 2002;18:619–20.View ArticleGoogle Scholar
- Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–89.View ArticleGoogle Scholar
- Gabaldón T, Koonin EV. Functional and evolutionary implications of gene orthology. Nat Rev Genet Nat Res. 2013;14:360–6.View ArticleGoogle Scholar
- Kuzniar A, van Ham RCHJ, Pongor S, Leunissen JAM. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008;24:539–51.View ArticleGoogle Scholar
- Altschul S, Gish W, Miller W, Myers E, Lipman D. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.View ArticleGoogle Scholar
- Enright AJ, Dongen SV, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–84. [Internet] Oxford University Press.View ArticleGoogle Scholar
- Dongen S. Graph clustering by flow simulation. Centers Math. Comput. Sci. (CWI), 2000. http://micans.org/mcl/index.html?sec_thesisetc.
- Ekseth OK, Kuiper M, Mironov V. orthAgogue: an agile tool for the rapid prediction of orthology relations. Bioinformatics. 2014;30:734–6.View ArticleGoogle Scholar
- Graham R, Woodall T, Squyres J. Open MPI: A flexible high performance MPI. Parallel Process. Appl. 2005;3911:228–39.