Block Cyclic Distribution of Data in pbdR and its Effects on Computational Efficiency

 

Team Members: Matthew G. Bachmann1, Ashley D. Dyas2, Shelby C. Kilmer3, and Julian Sass4
Graduate Research Assistant: Andrew Raim4
Faculty Mentor: Nagaraj K. Neerchal4 and Kofi P. Adragani4
Client: George Ostrouchov5 and Ian F. Thorpe6

1Department of Mathematics, Northeast Lakeview College,
2Department of Computer Science, Contra Costa College,
3Department of Mathematics, Bucknell University,
4Department of Mathematics and Statistics, University of Maryland, Baltimore County,
5Oak Ridge National Laboratory,
6Department of Chemistry and Biochemistry, University of Maryland, Baltimore County

 


About the Team

Our team, composed of Matthew Bachmann, Ashley Dyas, Shelby Kilmer, and Julian Sass, performed an efficiency study using a package for R, a popular statistical computing language, called pbdR (Programming with Big Data in R). This research took place at the UMBC REU Site: Interdisciplinary Program in High Performance Computing. Assisting us in our research and providing insight and supervision was our faculty mentor, Dr. Nagaraj Neerchal and our graduate assistant, Andrew M. Raim. Our client, Dr. George Ostrouchov, Senior Research Staff Member at the Oak Ridge National Laboratory, proposed our project. Dr. Ian Thorpe also provided us with data that was used in an application of our study.

Introduction to our Project

pbdR is an R package that is used to implement high performance statistical computing on very large data sets. Our study focused on efficiency while changing two main factors: block cyclic distribution and processor grid layout. We explored the impact of block size and grid layout on computation by implementing the statistical method PCA (Principal Component Analysis).

Methods and Results

For our study, we implemented PCA on a randomly generated data set and recorded the time it took for the code to run. Our pilot study varied n and k, the dimensions of our data matrix, and the results allowed us to show that that the relationship between the dimension of the matrix and the run time was predictable, which allowed us to keep n and k constant throughout the rest of our study.

 

When changing grid layout and block size, we found that grid layout has less of an effect on the runtime than the block size. We also observed that the 8×8 block size was consistently faster than the other block sizes. We concluded that the 8×8 block size was consistently faster than the other block sizes, no matter the n, k, or grid layout. Therefore, we can conclude that block size has a clear effect on computational efficiency.

Applications of our Study

As an application of our study, we used data containing the movement of amino acids in a protein from the lab of Dr. Ian Thorpe. The data was formatted as 3100 snapshots, each snapshot containing the x, y, and z coordinates of amino acids in different atoms of a protein. We performed PCA on the data matrix and also created a correlation matrix from the data. Once we had a correlation matrix, we created a level plot from the matrix and saw how different amino acids in various atoms correlate with each other.

 

We then greyed out the correlations that are not statistically significant and did a level plot of the same data set. There is a significant drop in the amount of data points, showing that few of these correlations are statistically significant.


Links

Matthew G. Bachmann, Ashley D. Dyas, Shelby C. Kilmer, Julian Sass, Andrew Raim, Nagaraj K. Neerchal, Kofi P. Adragani, George Ostrouchov, Ian F. Thorpe. Block Cyclic Distribution of Data in pbdR and its Effects on Computational Efficiency. Technical Report HPCF-2013-11, UMBC High Performance Computing Facility, University of Maryland, Baltimore County, 2013. Reprint in HPCF publications list

Poster presented at the Summer Undergraduate Research Fest (SURF)

Click here to view Team 2’s project
Click here to view Team 3’s project
Click here to view Team 4’s project