Archive

  • Visit JGI.DOE.GOV
News & Publications
Home › Publications › BioPig: A Hadoop-based Analytic Toolkit for Large-Scale Sequence Data

BioPig: A Hadoop-based Analytic Toolkit for Large-Scale Sequence Data

Published in:

Bioinformatics (Sep 10 2013)

Author(s):

Nordberg, H., Bhatia, K., Wang, K., Wang, Z.

DOI:

10.1093/bioinformatics/btt528

Abstract:

MOTIVATION: The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this “data deluge”, here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation. RESULTS: We built BioPig upon the Apache’s Hadoop MapReduce system and the Pig data flow language. Compared to traditional serial and MPI based algorithms, BioPig has three major advantages: first, BioPig’s programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at NERSC and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis. AVAILABILITY: BioPig is released as open source software under the BSD license at https://sites.google.com/a/lbl.gov/biopig/ CONTACT: [email protected].

View Publication

Share this:

  • Click to share on Facebook (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to print (Opens in new window)
  • JGI.DOE.GOV
  • Disclaimer
  • Accessibility / Section 508
Lawrence Berkeley National Lab Biosciences Area
A project of the US Department of Energy, Office of Science

JGI is a DOE Office of Science User Facility managed by Lawrence Berkeley National Laboratory

© 1997-2025 The Regents of the University of California