Intel Concurrent Collections as a Method for Parallel Programming

 

Team Members: Richard Adjogah1, Randal Mckissack1, and Ekene Sibeudu1
Graduate Assistant: Andrew M. Raim2
Faculty Mentor: Matthias K. Gobbert2
Client: Loring Craymer3

1Department of Computer Science and Elelectrical Engineering, University of Maryland, Baltimore County
2Department of Mathematics and Statistics, University of Maryland, Baltimore County
3DoD Center for Exceptional Computing

IMG_0416
Team 4, from left to right: Richard Adjogah, Randal Mckissack, Ekene Sibeudu


Our team consisted of UMBC undergraduates Richard Adjogah, Randal Mckissack, and Ekene Sibeudu led by faculty mentor Matthias K. Gobbert. During the 2011 REU site, we were presented a project involving new parallel programming software called Intel Concurrent Collections (CnC), given by Dr. Loring Craymer. We received assistance and feedback throughout the program by Dr. Gobbert, Dr. Craymer, graduate assistant Andrew Raim, and Kath Knobe, a research scientist for the CnC project at Intel.
Introduction:

Computer hardware has become parallel in order to run faster and more efficiently. One of the current standard parallel coding libraries is MPI (Message Passing Interface). When using MPI, the user has to explicitly send and receive messages to and from different processes with multiple calls to functions. These functions have numerous arguments that need to be passed in; this can be error-prone and a hassle. The Intel Corporation is developing a new parallel software and translator called CnC (Concurrent Collections) to make programming in parallel easier. CnC uses a system of collections comprised of steps, items, and tags which are organized by a graph system. The graph is writen in Intel’s textual notation and then CnC’s translator cnc compiles the notation into a C++ header file. This system allows the programmer to identify dependencies among code segments in the program and the parallelization of these code segments is handled automatically by CnC at runtime of the program. Our research evaluates if this new software works faster and more efficiently when creating parallel code and converting serial code to parallel.
Research Conducted:

To test the difference between the two methods, we used benchmark codes with both MPI and CnC and compared the results. We started with a prime number generator provided by Intel as sample code that familiarizes programmers with CnC. Then we moved on to a pi approximation, for which we used as starting point a classical MPI sample code that uses integration to approximate pi. We ran it in MPI first, then stripped it of all MPI, ported it to C++, and added calls to CnC. We ran performance studies afterwards to compare the two. Our last two tests involved doing parameter studies. The main test case is an extension of the Poisson equation by a term involving a non-negative parameter a. The finite difference method is used to discretize this equation, which results in a symmetric positive definite system of linear equations. This is solved by the iterative conjugate gradient method. To demonstrate how readily the setup for a parameter study can be transferred to a different problem, we apply it to a project to estimate entropy of DNA bindings by another team in the REU Site. In both cases, we took existing serial code for each problem and were readily able to run them repeatedly with varying parameters using CnC in parallel just by creating a couple of new files. These last two tests showcase a clear advantage of CnC over MPI in parallelization of parameter studies.
Example of a Parameter Study:

An excellent example of how CnC is useful as a tool for parallel parameter studies is the extension of the Poisson equation. For finer mesh resolutions, this is a computationally intensive problem and it is vital to run studies involving many parameters in parallel using several threads. Consider the situation where a randomly chosen value a is desired for this problem. The iterative conjugate gradient method used in the serial code requires then different numbers of iterations and thus an unpredictable amount of run time for each randomly chosen values ofa. Therefore, CnC which has no barriers and can execute a computation whenever its control and data are available is ideally suited to sending work to any thread that has completed work on a previously assigned parameter value. Because CnC handles all parallelization at runtime, it doesn’t matter that the programmer cannot predict how the parameter(s) will affect time and memory usage. We study the performance of CnC on one compute node with 8 cores.

The table below shows the run data of our parallelized parameter study for the extended Poisson problem for a small demonstration example of a parameter study of only 8 randomly chosen parameter values. The mesh resolution of the finite difference method controls the amount of work for each run of the Poisson code and is selected as 256-by-256 here to obtain conveniently readable times in units of seconds. The table shows the results ordered by the value of a, which confirms the decreasing iteration count for increasing a. The last two columns show the run times in seconds of each Poisson solve first for the case of using only 1 thread (serial code) and then for the case of a run using 8 threads (the number of computational cores available on one node). When run originally, each line of the table would appear in random order in the output, since different threads finish their work in random order; to make the output comparable and to clearly analyze the effect of a on the iteration count, the output was manually ordered. For the small amount of work in this demonstration example, we notice that there is some slow-down involved, since each individual run time for the study using 8 threads is larger than for the study using 1 thread. The final line of the table lists the total observed time for the entire parameter study. For the serial run, the total run time is the summation of all individual Poisson solve times, as one would expect, since each individual solve is run on the same computational core in turn. By contrast, for the run using 8 threads on a compute node with 8 computational cores, the total run time is driven only by the slowest individual runtime; this is clear evidence of CnC performing correctly for this small illustrative demonstration.

a #iter 1 Thread 8 Threads
Time Time
135.44 594 1.72 2.97
274.75 477 1.38 2.65
486.90 369 1.09 2.35
561.38 367 1.08 2.26
700.98 330 0.97 2.20
840.19 300 0.89 2.02
840.19 300 0.89 2.02
916.46 287 0.86 1.95
Total 8.88 2.98

The next table shows the results of a study where we extended the above study from 8 to M Poisson solves for randomly chosen parameter values. All runs used 8 threads. The Worst Case column shows the predicted total runtime if each Poisson solve was distributed to threads in order of increasing runtimes. This would make the last thread’s total time be the longest, given by M time the maximum individual runtime. The Best Case column shows if each thread had an even amount of work, given by M time the mean individual runtime. We can conclude that CnC is effective in its parallelization methods due to our Actual Times being extremely close to the Best Case times.

M Worst Case Best Case Actual Time
Time Time Time
8 2.97 2.30 2.98
64 31.75 20.33 21.19
256 137.82 86.84 88.44
1024 556.18 351.91 352.63

Conclusions:

Through our research, we have found CnC to be a useful method for parallel computing. By making the parallelization process as abstract as possible, the amount of coding a programmer has to do is reduced and task distribution can be done as effectively as possible at run-time. While the graph concept of how CnC works is a very different way of thinking than how parallelization is done in MPI, it is easy to follow once understood. The nature of CnC’s parallelization makes operations that require accessing parallel elements in order counter productive and time costly. However, CnC excels at parameter studies where multiple runs of a method each may vary in memory and run-time in unknown ways.
Future Work:

In the time frame of this REU Site, we only worked with CnC using all computational cores on one compute node. CnC has also a distributed version that can use all cores on several nodes in a cluster. Investigating this and analyzing its performance is the subject of future work.


Links

Richard Adjogah, Randal Mckissack, Ekene Sibeudu, Andrew M. Raim, Matthias K. Gobbert, and Loring Craymer. Intel Concurrent Collections as a Method for Parallel Programming. Technical Report HPCF-2011-14, UMBC High Performance Computing Facility, University of Maryland, Baltimore County, 2011. Reprint and associated files in HPCF publications list

Poster presented at the Summer Undergraduate Research Fest (SURF)

Click here to view Team 1’s project
Click here to view Team 2’s project
Click here to view Team 3’s project

Participants Randal Mckissack and Richard Adjogah Present at James Edward West Symposium