Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

Darren J. Kerbyson, Michael Lang, Scott Pakin
2011 Parallel Computing  
Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance especially in hybrid systems using accelerators. Processorcores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that
more » ... ains wave-front processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundary data downstream and whose cost is typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional steps in the parallel computation and higher use of on-chip communications. This tradeoff is explored using a performance model. An implementation using the reverse-acceleration programming model on the petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in communication performance exists. In this work we examine a class of applications that contain wave-front processing. These applications are characterized by a dependency in their processing flow in which grid-points can only be processed after their upstream neighbors. When mapped to a parallel system a local sub-grid can only be processed after boundary data is received from upstream neighbors, and produces boundary data for downstream processors. Blocking of the local sub-grid has been shown to be beneficial in improving the parallel efficiency on large-scale systems [4] . The wave-front processing corresponds to that of a pipeline which takes time to fill and reach full utilization. The slowest communication channel used impacts the performance of the pipeline processing. It has been estimated that a significant number of cycles on the Advanced Simulation and Computing (ASC) machines are used to process applications with wave-front processing [5] . A novel wave-front algorithm, termed the hierarchical wave-front, is developed and implemented in this work. The hierarchical wave-front performs the same processing requirements as in the standard approach and preserves all data dependencies. It uses knowledge of a local processor-core domain that typically contains all cores on the same chip. In the hierarchical wave-front the number of slower, inter-domain, communications are reduced but at the expense of increasing the number of parallel computation steps and increased intra-domain communications. A trade-off results between the savings in communication and increased on-chip activities. A performance model is used to quantify this trade-off and show that, for a range in the performance-space consisting of core computation performance, inter-domain communication performance and blocking factor, there is a system-size above which performance improvements will result when using the hierarchical wave-front. The hybrid petascale Roadrunner system [6] at Los Alamos is used as a test bed to demonstrate the performance improvements that can be obtained in practice. This, a hybrid system containing both AMD Opteron host processors and PowerXCell 8i accelerators, sees a performance improvement of 27% at full-system scale when using the hierarchical wave-front. But performance improvements only occur when using more than 16% of the system. The implementation uses the reverseacceleration programming model [7] in which each core of the accelerator is a separate MPI rank, and host processors simply support their activity. The paper is organized as follows. In Section 2 we provide an overview of wave-front processing and describe how the standard implementation is modified to implement the hierarchical wave-front. A performance model is used to compare the performance potential of the hierarchical wave-front in Section 3. An overview of the Roadrunner system is provided in Section 4 illustrating programming models that can be employed and also the performance of its communication hierarchy. A performance comparison of the wave-front implementations on Roadrunner is given in Section 5 that shows that significant improvements are achievable from the hierarchical wave-front.
doi:10.1016/j.parco.2011.02.008 fatcat:z6x2jsgy7nc4xi7acduyxvplye