Archive

  • Visit JGI.DOE.GOV
News & Publications
Home › Publications › Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Published in:

International Journal of Parallel Programming 46(4) , 762-775 (Aug 2018)

Author(s):

Lin, H., Su, Z. C., Meng, X. D., Jin, X., Wang, Z., Han, W. T., An, H., Chi, M. X., Wu, Z.

DOI:

10.1007/s10766-017-0524-z

Abstract:

Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193x speedup for the computing-intensive step and 9.6x speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These results suggest integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.

View Publication

Share this:

  • Click to share on Facebook (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to print (Opens in new window)
  • JGI.DOE.GOV
  • Disclaimer
  • Accessibility / Section 508
Lawrence Berkeley National Lab Biosciences Area
A project of the US Department of Energy, Office of Science

JGI is a DOE Office of Science User Facility managed by Lawrence Berkeley National Laboratory

© 1997-2025 The Regents of the University of California