Team 6 – Performance Comparison of Intel Xeon Phi Knights Landing – REU Site: Interdisciplinary Program in High Performance Computing

Team Members:	Ishmail A. Jabbie¹, George Owen², and Benjamin Whiteley³
Graduate Assistant:	Jonathan S. Graf¹
Faculty Mentor:	Matthias K. Gobbert¹
Client:	Samuel Khuvis⁴

¹Department of Mathematics and Statistics, University of Maryland, Baltimore County,
²Department of Mathematics, Louisiana State University,
³Department of Engineering and Aviation Sciences, University of Maryland, Eastern Shore,
⁴ParaTools, Inc.

About the Team

Team 6 consisted of Ishmail Jabbie, George Owen, and Benjamin Whiteley. We worked with our faculty mentor, Dr. Matthias K. Gobbert, along with our graduate assistant, Jonathan Graf, who provided insight and supervision of our research. Our project involved running performance studies on the new, second-generation Intel Xeon Phi processor, known as “Knights Landing,” and comparing these to performance studies on the first-generation Intel Xeon Phi co-processor, known as “Knights Corner.”

Motivation

The Intel Xeon Phi is a many-core processor family with theoretical peak performance of over 1 TFLOP/s in double precision, significantly better than even modern multi-core CPUs. A test problem in C using MPI and OpenMP compares the performance of the first and second-generations of the Phi, code-named Knights Corner (KNC) from 2013 and Knights Landing (KNL) from 2016, respectively, as well as contrasts the performance of the two different memories available on the KNL.

Intel Xeon Phi Knights Landing (KNL)

Second-generation KNL model 7250 from 2016
[compared to first-generation KNC model SE10P from 2013]:

68 cores [61 cores for KNC]
2 VPUs up to 16 double additions per core [1 VPU up to 8 double on KNC]
16 GB on-board MCDRAM: High performance 3D RAM,
much faster than GDDR5 or DDR4, designed for high bandwidth memory
[8 GB of on-board GDDR5 on KNC]
Server contains 98 GB of DDR4 RAM: larger but slower system memory
[no access to server memory in native mode from KNC]
2D mesh network [bi-directional ring bus for KNC]
Full Linux-based OS [Linux Micro-OS on KNC]
Full stand alone processor [KNC is a co-processor]

Test Problem

We use a classical elliptic test problem, the Poisson equation with homogeneous Dirichlet boundary conditions.

The equation is discretized by the finite difference method and the resulting system of linear equations solved by the conjugate gradient method.

The numerical method is parallelized in C with MPI and OpenMP. We use the Intel compiler suite on all systems.

Solution of Test Problem on 8192-by-8192 Mesh
———————————————

KNC with GDDR5 RAM using 240 threads: observed wall clock time in MM:SS
——————————————————————————-
MPI Proc 1 2 4 8 15 16 30 60 120 240
Threads 240 120 60 30 16 15 8 4 2 1
——————————————————————————-
GDDR5 28:24 28:20 27:51 23:08 23:06 23:00 22:24 22:45 22:43 25:37
——————————————————————————-

KNL with DDR4 and MCDRAM using 272 threads: observed wall clock time in MM:SS
——————————————————————————-
MPI Proc 1 2 4 8 16 17 34 68 136 272
Threads 272 136 68 34 17 16 8 4 2 1
——————————————————————————-
DDR4 26:02 25:07 24:38 24:25 24:24 36:29 37:40 37:54 39:06 41:00
MCDRAM 05:49 05:43 05:39 05:35 05:36 08:22 08:49 08:41 08:37 08:57
——————————————————————————-

Conclusions

The KNL using MCDRAM is dramatically faster than the KNC in all cases.
For both MCDRAM and DDR4 on the KNL, using more threads than
MPI processes is significantly faster than the inverse.
Despite DDR4 being a slower form of memory,
KNL using DDR4 is comparable in most cases to KNC using GDDR5.
KNL distributes cores optimally to use resources and channels
of the system. Threads allow the processor to assign the
cores in order, while MPI assigns processes randomly.

Links

Ishmail A. Jabbie, George Owen, Benjamin Whiteley, Jonathan S. Graf, Matthias K. Gobbert, and Samuel Khuvis. Performance Comparison of Intel Xeon Phi Knights Landing. Technical Report HPCF-2016-16, UMBC High Performance Computing Facility, University of Maryland, Baltimore County, 2016. (HPCF machines used: maya.). Reprint in HPCF publications list

Poster presented at the Summer Undergraduate Research Fest (SURF)

Click here to view Team 1’s project
Click here to view Team 2’s project
Click here to view Team 3’s project
Click here to view Team 4’s project
Click here to view Team 5’s project
Click here to view Team 7’s project

REU Site: Interdisciplinary Program in High Performance Computing