Pushing the Limits of the Maya Cluster

Team Members:	Adam Cunningham¹, Gerald Payton², Jack Slettebak¹, and Jordi Wolfson-Pou³
Graduate Research Assistant:	Jonathan Graf², Xuan Huang², and Samuel Khuvis²
Faculty Mentor:	Matthias K. Gobbert²
Client:	Thomas Salter⁴ and David J. Mountain⁴

¹Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County,
²Department of Mathematics and Statistics, University of Maryland, Baltimore County,
³Department of Physics, University of California, Santa Cruz,
⁴Advanced Computing Systems Research Program

team4
Team 4, from left to right: Jack Slettebak, Jordi Wolfson-Pou, Matthias K. Gobbert, Adam Cunningham, Gerald Payton, Thomas Salter.

About the Team

Our team, which consisted of Adam Cunningham, Gerald Payton, Jack Slettebak, and Jordi-Wolfson-Pou, participated in the Interdisciplinary Program in High Performance Computing located in the Department of Mathematics and Statistics at UMBC. Our project was to test the computing capabilities of the maya Cluster using industry benchmarks, a project proposed to us by our clients, Thomas Salter, and David J. Mountain. Assisting us in our research and providing insight and supervision was our faculty mentor, Dr. Matthias K. Gobbert, along with our graduate research assistants, Jonathan Graf, Xuan Huang, and Samuel Khuvis.

Benchmarking

Maya is the 240-node supercomputer in the UMBC High Performance Computing Facility.

The 72 newest nodes have two eight-core Intel E5-2650v2 Ivy Bridge CPUs, with 64 GB memory (in eight 8 GB DIMMs) each, making a single node capable of running 16 processes/threads simultaneously.

The nodes are connected by a high-performance quad-data rate (QDR) InfiniBand interconnect.

The new hardware requires testing and benchmarking to give insight into its full potential. We report here on the High Performance Conjugate Gradient (HPCG) Benchmark developed by Sandia National Laboratories.

HPCG Benchmark

The HPCG benchmark solves the Poisson equation on a three-dimensional domain. A discretization on a global grid with a 27-point stencil at each grid point generates a system of linear equations with a large, sparse, highly structured system matrix. This system is solved by a preconditioned conjugate gradient method. The unknowns in this system are distributed to a 3-D grid of parallel MPI processes.


27-point stencil	3-D process grid.

A problem with a sparse system matrix and an iterative solution technique is more relevant to many applications than the dense system matrix of the LINPACK benchmark.

Experimental Design

The HPCG benchmark uses a 3-D grid of P = p_x x p_y x p_z parallel MPI processes. We consider P = 1, 8, 64, 512 in our experiments. Each process hosts a local subgrid of size n_x x n_y x n_z. Thus, N_x = n_x p_x, N_y = n_y p_y, and N_z = n_z p_z, and the total number of unknowns N_x x N_y xN_z scales with the number of processes.

For example for P = 512 processes, the global grid ranges from millions to billions of unknowns:

n	P = 1	P = 8	P = 64	P = 512
16	4,096	32,768	262,144	2,097,152
32	32,768	262,144	2,097,152	16,777,216
64	262,144	2,097,152	16,777,216	134,217,728
128	2,097,152	16,777,216	134,217,728	1,073,741,824
256	16,777,216	134,217,728	1,073,741,824	8,589,934,592

Results

We ran the HPCG Benchmark Revision 2.4 with execution time 60 seconds using the Intel C++ compiler and MVAPICH2. The table shows the observed GFLOP/s for several local subgrid dimensions n_x x n_y x n_z. The table reports the results for P = 512 parallel MPI processes using Ncompute nodes withp_N processes per node and n_t OpenMP threads per MPI process. Possible combinations for P = 512 are N = 32 nodes withp_N = 16 processes per node and n_t = 1 thread per process or N = 64 nodes with p_N = 8 processes per node and n_t = 1 or 2 threads per process.

n_x = n_y = n_z = 16		n_t = 1	n_t = 2
	N = 32 p_N = 16	45.58	N/A
	N = 64 p_N = 8	113.50	112.36
n_x = n_y = n_z = 32		n_t = 1	n_t = 2
	N = 32 p_N = 16	170.03	N/A
	N = 64 p_N = 8	209.92	211.84
n_x = n_y = n_z = 32		n_t = 1	n_t = 2
	N = 32 p_N = 16	223.92	N/A
	N = 64 p_N = 8	209.62	238.82
n_x = n_y = n_z = 128		n_t = 1	n_t = 2
	N = 32 p_N = 16	233.98	N/A
	N = 64 p_N = 8	210.94	230.42

The table allows us to conclude:

Larger problems allow for better performance, as calculation time dominates communication time.
Increasing the number of threads per MPI process may increase computational throughput as delays in memory access are masked by process switching.
Optimal performance is achieved using all 16 cores per node, whether via MPI processes or via OpenMP multi-threading.

Links

Sandia HPCG Benchmark: http://software.sandia.gov/hpcg/

Adam Cunningham, Gerald Payton, Jack Slettebak, Jordi Wolfson-Pou, Jonathan Graf, Xuan Huang, Samuel Khuvis, Matthias K. Gobbert, Thomas Salter, and David J. Mountain. Pushing the Limits of the Maya Cluster. Technical Report HPCF-2014-14, UMBC High Performance Computing Facility, University of Maryland, Baltimore County, 2014. Reprint in HPCF publications list

Poster presented at the Summer Undergraduate Research Fest (SURF)

Click here to view Team 1’s project
Click here to view Team 2’s project
Click here to view Team 3’s project
Click here to view Team 5’s project

REU Site: Interdisciplinary Program in High Performance Computing

REU Site: Interdisciplinary Program in High Performance Computing

Pushing the Limits of the Maya Cluster

About the Team

Benchmarking

HPCG Benchmark

Experimental Design

Results

REU Site: Interdisciplinary Program in High Performance Computing

About the Team

Benchmarking

HPCG Benchmark

Experimental Design

Results

Subscribe to UMBC Weekly Top Stories

I am interested in: