High Performance Computing (HPC) has become an integral part of scientific research. In this article we will share our experience implementing HPC at the National Institute of Allergy and Infectious Diseases (NIAID) NIH as well as our plans to support scientific mission of the Institute in the future.
IT’s increased role in genome sequencing is due to several major differences between the original and next-generation technologies. First and foremost, the sheer amount of data that needs to be managed today is orders of magnitude greater than before. This is largely due to the type of data, images in addition to text that is generated by the instruments. Most labs do not have the storage capacity or the network and computational infrastructure needed to support and manage this amount of data. Therefore, NIAID’s Office of Cyber Infrastructure and Computational Biology (OCICB) primary goal has been implementing a centralized Next-Gen infrastructure as generic and scalable as possible to meet not only current but also future requirements.
Figure 1: Genome Sequencing IT Requirements
In the late 2009, NIAID acquired several Next-Gen sequencing machines, and what started out as a project to store 200 plus TB of sequencing information quickly morphed into a need to analyze this data with a wide range of tools for a growing number of users. It became clear that a centralized and dedicated HPC cluster for scientific computing was needed. Answering the increasing demand, OCICB established an HPC cluster as a resource available to the NIAID scientific community and their collaborators, providing computational capacity required to analyze large amount of research data. The cluster is used to store, analyze, and distribute results for the next generation sequencing, Phylogenetics, Structural Biology, and other high-throughput, data-intensive research. The HPC cluster quickly became a critical resource for scientists, presently hosting 176 unique bioinformatics applications and 750TB of data. In 2014, over 2,504,750 jobs were run on the cluster.
When the HPC was first constructed, ROCKS Cluster was utilized to simplify the deployment and management of the compute nodes, as it includes the OS (a variant of RHEL 5) and all the tools and applications needed to get cluster up and running. Sun Grid Engine (SGE) was chosen as the workload manager and scheduler, and would become the primary tool for users interacting with the cluster. On the storage side, a DDN S2A9900 was purchased with a mixture of 15k and 7.2k hard drives, and a decision was made to front-end this storage with two servers running IBM's GPFS parallel file system, which ensured greater performance and scalability. Initially, all of the nodes were communicating over a Gigabit Ethernet. Our centralized infrastructure solution is different from a traditional cluster compute environments because the data storage component is also covered. Data resides very close to the computer nodes that will process it with very high-speed interconnects to read and write the data. Original data and results are stored in one centralized place, without the need to fill up multiple volumes of drives and then reload them for analysis across multiple experiments. Consequently, we largely eliminated uploading or downloading delays, like waiting for Gigabytes of data to write, keeping track of versions, or dealing with duplicates when source data is copied for analysis.
OCICB also provides support to researchers beyond the physical infrastructure. One of the main impediments to a wide adoption of HPC is lack of easy to use tools for non-specialists. Most of the existing tools are command-line, and therefore require deep knowledge of Linux. In order to address this problem, OCICB implemented HPC Web, a front-end portal to the HPC that provides a point-and-click Graphical User Interface (GUI) to augment Linux command-line-interface (CLI) access to the cluster. Using HPC Web, users can submit jobs to the HPC, manage and share data, and create and share custom data analysis pipelines. HPC Web democratizes access to the high-throughput scientific computing by providing research community with alternative user-friendly methods for accessing sophisticated computational tools and infrastructure.
As the cluster grew in terms of data, compute nodes, and the number of users and applications, numerous challenges were faced. A 10 Gb Ethernet was added on the GPFS servers, and four more of these servers were added to cope with cluster-wide bandwidth concerns. File system migration and consolidation was necessary to manage the uneven spikes in data growth. A newer, faster storage system was purchased and integrated, along with SSD's to speed up both data and metadata operations. User training and clear, centralized documentation was added. Finally, a power and space constrained datacenter was a constant challenge, until recently, when NIAID migrated to a new, primary compute facility.
Going forward, NIAID can take advantage of this new compute facility to build a new cluster that supports up to 40kW per rack. The networking will consist of a 56 Gb FDR Infiniband network for all intra-cluster communication, which will provide both higher bandwidth and lower latency. New Dell C6220s will roughly double the number of CPU cores and increase available memory by 700 percent. Additional GPGPU servers, utilizing NVIDIA's latest K80 Tesla GPU accelerators, will tackle a high volume of molecular dynamic work that needs analysis. Bright Computing's Cluster Manager will take over for ROCKS, as it provides similar, but more extensive capabilities. OCICB has selected Univa Grid Engine (UGE) to take over for SGE, which no longer receives regular updates. IBM's Tivoli Storage Manager (TSM) is also being implemented as a more efficient backup and archive system, given its tight integration with GPFS, which allows for a transparent data tiering between a disk and tape. Lastly, we will be implementing Globus, a tool and service for collaboration with HPC clusters and scientific institutions around the world.
Established in 2009, HPC cluster is currently freely available to the NIAID researchers and their collaborators, and is continually evolving to better meet growing needs of the scientific community.