Program for Wednesday, April 22nd

PROGRAM FOR WEDNESDAY, APRIL 22ND

Days:

previous day

next day

all days

View: session overview talk overview

09:30-10:30 Session 9: Keynote

Chair: Michele Weiland

Location: Pentland

09:30

Mark Taylor

HPC in Formula 1 Aerodynamics

SPEAKER: Mark Taylor

ABSTRACT.

This talk will give an introduction to Formula 1 racing car aerodynamics together with an insight into the numerical modelling technology that lies behind the car development process. A description of the challenges that are required to be met by the HPC technology community to meet the ambitions of McLaren Racing will be outlined. Our goal is to be able to efficiently exploit the next generation of aerodynamic simulation tools that will drive our technology forward in a way that leads to World Championships.

10:30-11:00Coffee Break

11:00-12:40 Session 10: Finalists for Best Contributed Talk Award (sponsored by NVIDIA)

Chair: Stefano Markidis

Location: Pentland

11:00	Jeffrey Vetter and Jeremy Meredith Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with Holistic Performance Modeling SPEAKER: Jeffrey Vetter ABSTRACT. Concerns about energy-efficiency and reliability have forced our community to reexamine the full spectrum of architectures, software, and algorithms that constitute our ecosystem. While architectures and programming models have remained relatively stable for almost two decades, new architectural features, such as heterogeneous processing, nonvolatile memory, and optical interconnection networks, will demand that software systems and applications be redesigned so that they expose massive amounts of hierarchical parallelism, carefully orchestrate data movement, and balance concerns over performance, power, resiliency, and productivity. In what DOE has termed 'co-design,' teams of architects, software designers, and applications scientists, are working collectively to realize an integrated solution to these challenges. A key capability of this activity is accurate modeling of performance, power, and resiliency. We have developed the Aspen performance modeling language that allows fast exploration of the holistic design space. Aspen is a domain specific language for structured analytical modeling of applications and architectures. Aspen specifies a formal grammar to describe an abstract machine model and describe an application's behaviors, including available parallelism, operation counts, data structures, and control flow. Aspen's DSL constrains models to fit the formal language specification, which enforces similar concepts across models and allows for correctness checks. Aspen is designed to enable rapid exploration of new algorithm and architectures. Because of the succinctness, expressiveness, and composability of Aspen, it can be used to model many properties of a system including performance, power, and resiliency. Aspen has been used to model traditional HPC applications, and recently extended to model scientific workflows for HPC systems and scientific instruments, like ORNL’s Spallation Neutron Source. Models can be written manually or automatically generated from other structured representations, such as application source code or execution DAGs. These Aspen models can then be used for a variety of purposes including to predict performance of future applications, evaluate system architectures, inform runtime scheduling decisions, and identify system anomalies.
11:25	Alex Breuer, Michael Bader and Alexander Heinecke End-to-end Optimization of SeisSol SPEAKER: Alex Breuer ABSTRACT. In this talk we give a comprehensive overview of our end-to-end optimization of the high order finite element software SeisSol. SeisSol simulates dynamic rupture and seismic wave propagation at petascale performance in production runs. The presented optimizations ultimately aim at minimizing time-to-solution and energy-to-solution. In this context we analyze a broad range of levels in the simulation pipeline and the hardware hierarchy. At single node we show the impact of the convergence order, frequency, vector instruction sets, alignment and chip-level parallelism. The set of architectures covers three generations of CPUs -- code-named Westmere, Sandy Bridge and Haswell -- and the Xeon Phi coprocessor. From a performance perspective, especially on state-of-the-art and future architectures, the shift from a memory- to a compute-bound scheme with increasing order is compelling. The second part of the presentation aims at large scale optimizations for heterogeneous and homogeneous supercomputers. Here we present a novel offloading scheme for Xeon Phi-accelerated systems, which carefully schedules the wave propagation component to the coprocessors and the complex dynamic rupture computation to the strong CPUs. Additionally for native settings we present our completely redesigned memory- and MPI-layout. Both of the presented strategies allow for overlapping computation and communication in a very natural way. We conclude the presentation with large-scale results of SeisSol including a sustained DP-PFLOP performance on homogenous and a multi-DP-PFLOP performance on heterogenous supercomputers.
11:50	Florian Wende, Thomas Steinke and Alexander Reinefeld The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for Exascale Computing SPEAKER: Florian Wende ABSTRACT. With the upcoming transition from petascale to exascale computers rad- ically new methods for scalable and robust computing are required. Com- puting at the speed of exascale, i.e. more than 1018 floating point operations per second, will only be possible on systems with millions of processing units. Unfortunately, the large number of functional components (computing cores, memory chips, network interfaces) will greatly increase the probability of failures, and it can thus not be expected that an exascale application will complete its execution on exactly the same resources it was started. As part of our research on the development of a fast and fault-tolerant, microkernel based systems infrastructure1, we developed an in-memory check- point mechanism that writes erasure-coded checkpoints to RAM disks. The checkpoints are application-triggered and hence only few state information needs to be written which allows very frequent checkpoints. In case of component failures, the state information of the crashed processes is re- assembled from the saved erasure-encoded blocks and the processes are restarted on other (non-faulty) resources. This results in a non-optimal process mapping after restarting a crashed application. In this paper, we investigate the impact of unfavorable process placement and the oversubscription of compute node resources on the performance and scalability of synthetic benchmarks as well as typical application workloads (CP2K, MOM5, BQCD). We provide results on two HPC architectures, a Cray XC40 with proprietary Aries network routers and dragonfly topology, and an InfiniBand cluster with fat tree topology. 1 DFG Priority Programme SPP 1648 “Software for Exascale Computing” project FFMK, “Fast and Fault-tolerant Micro Kernel”.
12:15	Panagiotis Hadjidoukas, Diego Rossinelli, Babak Hejazialhosseini, Fabian Wermelinger, Jonas Sukys and Petros Koumoutsakos Performance optimization of a petascale-enabled finite volume solver SPEAKER: Panagiotis Hadjidoukas ABSTRACT. Cloud cavitation collapse is detrimental to the lifetime of high pressure injection engines and ship propellers and instrumental to kidney lithotripsy and ultrasonic drug delivery. Its study presents a formidable challenge to experimental and computational studies due to its geometric complexity and the multitude of its spatiotemporal scales. Simulations of cloud cavitation collapse require compressible two phase flow solvers capable of capturing interactions between multiple deforming bubbles, traveling pressure waves, formation of shocks and their impact on solid walls. CUBISM-MPCF is a large-scale compressible, two-phase flow solver that has performed the largest fluid dynamics simulation by using 13 trillion computational elements and reaching an unprecedented performance of 14.4 PetaFLOPs on 1.6 million cores of the Sequoia IBM BlueGene/Q supercomputer, corresponding to 73% of its nominal peak performance. The code is written in C++ and integrates accurate numerical methods with extensive data reordering and use of vector intrinsics under a hy- brid MPI+OpenMP parallelization scheme. Simulating cloud cavitation collapse, the software managed to advance the state-of-the-art both in terms of time to solution and geometric complexity of the flow. In this talk, we first discuss our key software design decisions for ad- dressing simulation challenges with regard to floating point operations and memory traffic. Then, we present software optimization techniques that allowed us to improve the initial performance of the solver from 55% to 73% of the theoretical peak on BlueGene/Q systems. These tech- niques take advantage of the underlying hardware capabilities and were applied at all the three abstraction layers of the software, aiming at full exploitation of the inherent instruction/data-, thread- and cluster-level parallelism. Finally, we show how this performance can be further im- proved, focusing both on the achieved FLOPS and the execution time of the simulation, as well as on the relation between these two metrics.

12:40-13:40Lunch Break

13:40-14:40 Session 11: Keynote

Chair: Jeffrey Vetter

Location: Pentland

13:40

Simon Portegies Zwart

Massively-parallel GPU-accelerated galaxy simulation

SPEAKER: Simon Portegies Zwart

ABSTRACT. Simon Portegies Zwart, Professor of Computational Astrophysics at Leiden University will present massively parallel GPU-accelerated galaxy simulations which have been nominated for the 2014 Gordon Bell prize.

14:40-15:30 Session 12A: Contributed Talks

Chair: Toni Collis

Location: Pentland

14:40

Olga Olkhovskaya, Boris Chetverushkin and Vladimir Gasilov

Radiative Transfer Modeling at High Performance Computers Using Self-Ajoint Transport Equation

SPEAKER: Olga Olkhovskaya

ABSTRACT. We present a numerical tool for large-scale 3D radiative transport simulations related to high energy density plasmas (HEDP). Typical HEDP problems are intricate and computationally expensive due to strong coupling of numerous physical processes and wide range of spatial/temporal scales of plasma structures. These problems are a real challenge for specialists in applied mathematics and can be comprehensively studied only with the use of HPC systems. Energy transport via radiation is an indispensable part of high-temperature gas/plasma numerical analysis. The HPC functionality makes it possible to reproduce 3D radiation field with the desired precision of both spectral and angular photon distribution. To this end we propose a new parallel technique for the radiative transfer simulation. The most popular radiation model for HEDP problems is radiation diffusion. It allows significant economy of computing resources. Diffusion model provides precise energy balance, however it is valid only for almost thermodynamically equilibrium states and thus not applicable for highly anisotropic radiation fields. Keeping in mind modern HPC capabilities one may think about direct solution of photon transport equation (3-dimensional in space and 2-dimensional in angular variables). It's not difficult to solve the radiation transport equation along the characteristic lines, but the precise solution requires a great number of characteristics in each computational point. The number of characteristics should be proportional to a squared number of mesh cells. Otherwise a great loss of accuracy is probable because of known "ray effect" in discrete models. If a pair of computational cells exists not coupled by rays and thus not involved into radiative energy exchange, radiative flux from the "hot" cell is not accounted in the energy balance in the "cold" cell. Another problem is nonlocality: as a characteristic crosses the entire computational domain from one end to the other, the algorithm seems to be hardly scalable in case of domain decomposition. Angular non-uniformity of photon distribution function can be accounted in second-order self-adjoint transport equation. In 1951 V. S. Vladimirov established variational principle for one-velocity transport equation and derived the appropriate boundary conditions. In 1986 B. N. Chetverushkin has proposed similar approach to radiation transport computation. We apply DG procedure to self-adjoint transport equation which leads to a set of (MxN) elliptic-type equations. They may be solved independently giving good opportunity for using of any paradigm of parallelization. Note that accurate simulation requires the value of tens to hundreds for both M (the number of spectral groups) and N (the number of quadrature points on a sphere). Spatial discretization yields linear system with a symmetric positive definite matrix allowing application of effective linear solvers (Krylov solvers or Chebyshev - Richardson iterations). The energy balance is calculated via numerical radiative fluxes which are restored from discrete photon distributions by means of special quadratures (V. I. Lebedev, 1976) for radiative fluxes. The numerical algorithm was incorporated into the scientific CFD code MARPLE3D (Keldysh Institute of Applied Mathematics - KIAM). We employed mixed element computational meshes (hexahedral, tetrahedral, prismatic cells and their combinations) up to tens million cells. Numerical experiments demonstrated good scalability at KIAM RAS K-100 scalable GPGPU-based hybrid computing system. We have obtained robust numerical procedures suitable for multiscale simulations in finely discretized computational domains. It's a promising technique for upcoming exaflop computing.

15:05

Boris Chetverushkin, Nicola D'Ascenzo and Valeri Saveliev

A Novel Kinetic Consistent Algorithm of MHD for Massive Parallel Computing

SPEAKER: Boris Chetverushkin

ABSTRACT. The development of the new models and algorithms for the solution of the actual physics problems are the high priority task for the new generation of the high performance computer systems with massive parallelism. The more computing power gives a possibility to work with the more physical correct models but algorithms must be developed to deal with architectural realities in an high performance computing system. The tremendous progress of the kinetic Lattice Boltzmann and kinetic consistent methods in solution of the gas dynamic problems and development of the effective parallel algorithms for the modern high performance parallel computing systems leads to development of advanced models and methods for solution of the magneto gas dynamic problems for critical areas as plasma physics and astrophysics. The kinetic methods are based on the evolution of the statistical distribution function and Boltzmann equation which is fundamental basis of the kinetic theory of gases. The novel method proposed to extend the Boltzmann like distribution function with the implementation of electromagnetic terms for the solution of the magneto gas dynamic problems. This gives powerful tools for the solution of the magneto gas dynamic system of equations in frame of kinetic consistent schemes based on the fundamental physics models with common approach. As mentioned that the development of the algorithms is one of the critical point of using of the high performance computing systems with massive parallelism. Proposed numerical algorithm is based on the explicit scheme, considered preferable for the new generation of the high performance parallel computing systems. It is explained by the logical simplicity and efficiency the algorithms and possibility of easy adaptation to the modern high performance parallel computer systems, including hybrid computing systems with graphic processors. However, the stability conditions of the explicit algorithms are the price to be paid for algorithmic simplicity and it gives the strong limitation on the time discretization for the fine space discretization. The proposed algorithm allows include a regularization mechanism through hyperbolic terms of the equations. This mechanism improves the stability of the algorithm with a consequent relaxation of the time discretization on the high detailed spatial mesh. The proposed hyperbolic type kinetic magneto gas dynamics model provides a more stable condition for the numerical calculations and the possibility of realization of the high space discretization with acceptable time discretization. The analysis shows that in the region of the physical parameters of an ionized gas (plasma), the time discretization for the hyperbolic type of the magneto gas dynamic equations gives the relaxation in time discretization of the order of magnitude in comparison to the time discretization for the parabolic type of magneto gas dynamic equations. The proposed model and algorithm are used for the 3D simulations of the magneto gas dynamic processes in astrophysics, in particular for the detailed modeling of the accretion of the interstellar matter on the compact massive astrophysical object ("black hole").

REFERENCES: 1. B.Chetveruhskin, Kinetic schemes and Quasi-Gas Dynamic System of Equations, CIMNE, Barcelona, Spain, 2008 2. B.Chetverushkin, N.D'Ascenzo, V.Saveliev, Kinetically Consistent Magnetogasdynamics Equations and Their Use in Supercomputer Computing, Doclady Mathematics, v. 90, i 1, pp 495-498

14:40-15:30 Session 12B: Contributed Talks

Chair: Alastair Murray

Location: Prestonfield

14:40

Jean-Sylvain Camier

Improving Performance Portability and Exascale Software Productivity with the Nabla Numerical Programming Language

SPEAKER: Jean-Sylvain Camier

ABSTRACT. Addressing the major challenges of software productivity and performance portability is becoming necessary to take advantage of emerging extreme-scale computing architectures. As software development costs will continuously increase to address exascale hardware issues, higher-level programming abstraction will facilitate the path to go. There is a growing demand for new programming environments in order to improve scientific productivity, to facilitate the design and implementation, and to optimize large production codes. In this context, we present the numerical-analysis specific language Nabla, which improves applied mathematicians productivity throughput and enables new algorithmic developments for the construction of hierarchical and composable high-performance scientific applications.

The introduction of the hierarchical logical time (HLT) within the high-performance computing scientific community represents an innovation that addresses the major exascale challenges. This new dimension to parallelism is explicitly expressed to go beyond the classical single-program-multiple-data or bulk-synchronous-parallel programming models. Control and data concurrencies are combined consistently to achieve statically analyzable transformations and efficient code generation. Shifting the complexity to Nabla's compiler offers an ease of programming and a more intuitive approach, while reaching the ability to target new hardware and leading to performance portability.

The three main parts of the toolchain are presented: the front-end raises the level of abstraction with its grammar; the back-ends hold the effective generation stages, and the middle-end provides agile software engineering practices transparently to the application developer, such as: instrumentation (performance analysis, V&V, debugging at scale), data or resource optimization techniques (layout, locality, prefetching, caches awareness, vectorization, loop fusion) and the management of the hierarchical logical time which produces the graphs of all parallel tasks. The refactoring of existing legacy scientific applications is also possible by the incremental compositional approach of the method.

Feedback and grammatical patterns are given for several different numerical schemes that have been successfully ported or written in Nabla: the explicit unstructured Lagrangian hydrodynamics Sedov blast wave problem in three dimensions and another cartesian scheme are solved. An implicit two dimensions resolution of the Schrödinger equation is also illustrated.

The numerical-analysis specific language Nabla provides a productive development way for exascale HPC technologies, flexible enough to be competitive in terms of performances. As a demonstration of the potential and the efficiency of this approach, we present several benchmark implementations and evaluate their performances over a wide variety of hardware architectures (Xeon, XeonPHI and CUDA).

Raising the level of abstractions allows the framework to be prepared to address growing concerns of future systems. The generation stages will be able to incorporate and exploit algorithmic or low-level resiliency methods by coordinating co-designed techniques between the software stack and the underlying runtime system.

15:05

Karl Fuerlinger

Exploiting Hierarchical Exascale Hardware using a PGAS Approach

SPEAKER: Karl Fuerlinger

ABSTRACT. A number of daunting challenges have been identified [1] on the way to Exascale computing. Hard- ware architecture (particularly on the node-level) must change to achieve the desired performance and efficiency goals and this will have profound implications for the way in which high performance applications have to be written. Locality of data access will become an even more important aspect than it is today, as hardware vendors are forced to abandon node-wide cache coherence, several types of RAM (3D-Stacked/DRAM/NVRAM) are included, and complex on-chip networks interconnect are employed.
To accommodate these radical changes in hardware, programming models must transition from being compute-centric to being data-centric [5]. We argue that PGAS (partitioned global address space) approaches are particularly well suited for this transition and describe DASH, our own data-structure oriented PGAS approach realized in the form of a C++ library as a step towards this goal.
DASH [3] (http://www.dash-project.org) is a data-structure oriented C++ template library funded under DFG’s (German research foundation) priority programme on software for Exascale computing. DASH is based on existing one-sided communication substrates such as MPI-3 RMA and realizes PGAS semantics via operator overloading. DASH data structures are stored over multiple units (the individual constituents of a DASH program) which do not need to share the same physical memory. DASH is designed to inter-operate with existing MPI applications and can be incorporated into existing codes one data structure at a time.
As a PGAS model, DASH is well suited for increasingly complex Exascale node architectures. Similar to other PGAS approaches, such as UPC [2], and Co-Array Fortran [4], DASH already works across shared memory boundaries and the loss of node-wide cache coherence and on-chip networks pose no fundamental problem. Additionally, DASH was built from ground up to exploit a hierarchical machine organization using the concept of teams. Teams are subsets of units that are arranged in a hierarchy and they form the basis for collective communication as well as memory allocation operations. Using the notion of teams and hierarchical iterators it is possible to reason about the locality of data items on multiple levels instead of the typical local/remote distinction other PGAS approaches offer.
Support for software-managed scratchpad memory and layered RAM offering various capacity / bandwidth tradeoffs can also be realized within DASH by employing the notion of explicit memory spaces. Whereas current memory allocation requests are implicity satisfied by DRAM memory, alloca- tion options can be made explicit by naming a memory space such as NVRAM when allocating a DASH data structure – e.g., dash::array<int, NVRAM> a(1000). This direct exposure of the storage loca- tion allows a programmer to explicitly manage the locality properties of (parts of) a data structure using high-level programming constructs.
We describe our ongoing work within the DASH project as well as our plans for the future and we show how PGAS approaches are well suited for the expected hardware characteristics of Exascale class machines - especially if a data-structure oriented approach is followed instead of a compute-centric one.

References
[1] J. A. Ang, R. F. Barrett, R. E. Benner, D. Burke, C. Chan, J. Cook, D. Donofrio, S. D. Hammond, K. S. Hemmert, S. M. Kelly, H. Le, V. J. Leung, D. R. Resnick, A. F. Rodrigues, J. Shalf, D. Stark, D. Unat, and N. J. Wright. Abstract machine models and proxy architectures for exascale computing. In Proceedings of the 1st International Workshop on Hardware-Software Co- Design for High Performance Computing, Co-HPC ’14, pages 25–32. IEEE Press, 2014. Extended version available online at http://www.cal-design.org/publications/publications2.
[2] UPC Consortium. UPC language specification v1.2. June 2005. Technical Report, Lawrence Berkeley National Laboratory.
[3] Karl Fu ̈rlinger, Colin Glass, Andreas Knu ̈pfer, Jie Tao, Denis Hu ̈nich, Kamran Idrees, Matthias Maiterth, Yousri Mhedeb, and Huan Zhou. Dash: Data structures and algorithms with support for hierarchical locality. In Euro-Par 2014 Workshops (Porto, Portugal), 2014.
[4] Robert W. Numrich and John Reid. Co-array fortran for parallel programming. SIGPLAN Fortran Forum, 17(2):1–31, August 1998.
[5] Adrian Tate, Amir Kamil, Anshu Dubey, Armin Gr ̈oßlinger, Brad Chamberlain, Brice Goglin, Carter Edwards, Chris J. Newburn, David Padua, Didem Unat, Emmanuel Jeannot, Frank Han- nig, Tobias Gysi, Hatem Ltaief, James Sexton, Jesus Labarta, John Shalf, Karl Fu ̈rlinger, Kathryn OBrien, Leonidas Linardakis, Maciej Besta, Marie-Christine Sawley, Mark Abraham, Mauro Bianco, Miquel Peric`as, Naoya Maruyama, Paul H. J. Kelly, Peter Messmer, Robert B. Ross, Romain Cledat, Satoshi Matsuoka, Thomas Schulthess, Torsten Hoefler, and Vitus J. Leung. Pro- gramming Abstractions for Data Locality. Research report, PADAL Workshop 2014, April 28–29, Swiss National Supercomputing Center (CSCS), Lugano, Switzerland, November 2014.

15:30-16:00Coffee Break

16:00-16:50 Session 13A: Contributed Talks

Chair: Karthee Sivilingam

Location: Pentland

16:00

Manish Modani, Rupert Ford and Constantinos Evangelinos

Evaluating the Performance of Stencil Codes at Scale

SPEAKER: Manish Modani

ABSTRACT. Stencil-based codes are widely used in Scientific Computing and are considered to be good candidates for running at scale. A halo kernel has been run for a range of halo sizes and architectures on up-to 64000 cores. Results demonstrate that stencil communication does not show weak scaling properties when running on these architectures. This lack of scaling is analysed on the Blue Gene Q and is found to be due to network contention in the Torus. A task-to-topology mapping is described devised which allows the kernel to scale as expected.

16:25

Tao Cui, Wei Leng and Linbo Zhang

PHG: a parallel adaptive finite element toolbox towards exascale and its applications

SPEAKER: Tao Cui

ABSTRACT. PHG is a toolbox for developing parallel adaptive finite element programs which is currently under active development at State Key Laboratory of Scientific and Engineering Computing of Chinese Academy of Sciences. PHG deals with conforming tetrahedral meshes and uses bisection for adaptive local mesh refinement and MPI for message passing. PHG has an object oriented design which hides parallelization details and provides common operations on meshes and finite element functions in an abstract way, allowing the users to concentrate on their numerical algorithms. In this talk, the main algorithms in PHG and some applications, such as parasitic extraction simulation of large scale interconnects and simulation of 3D seismic waves, will be introduced. Numerical experiments with up to 8 billion unknowns and using up to more than 196608 CPU cores, which is performed on Tianhe-2, will be presented to demonstrate that PHG is robust and scalable.

16:00-16:50 Session 13B: Contributed Talks

Chair: Alan Gray

Location: Prestonfield

16:00

Tom Vander Aa, Imen Chakroun, Mirko Rahn, Christian Simmendinger and Roel Wuyts

ExaShark+GASPI: Reducing the burden to program large HPC systems since 2014

SPEAKER: Tom Vander Aa

ABSTRACT. Several trends in HPC systems make it challenging to quickly and easily develop applications that perform well. One important trend is an increased number of levels in HPC systems: more levels of memory and interconnect, more levels of parallelism (SIMD, multi-threading, multi-core,… ). Since each level of each type has its own properties, this has lead to a plethora of different programming models: shared-memory threading techniques like Pthreads, OpenMP or TBB, inter-node distribution techniques like MPI. It is challenging for an HPC programmer to choose the right mix of libraries and learn how to use each of them. Another trend and a side effect of this increased number of levels is that there are more entities in the system and those entities are on average further apart. Communication latencies increase, making it unfeasible to get good performance with applications that solely rely on global and/or synchronous communication (such as “traditional MPI” applications). An alternative and less traditional approach are PGAS (Partitioned Global Address Space) languages, which are asynchronous and more distributed by nature and allow for overlap in communication and computation. Since PGAS languages are relatively new and require programmers to rethink their applications’ communication patterns, they are not yet widely used nor widely mastered. This talk is about how ExaShark - a library for handling n-dimensional distributed arrays – combined with the GASPI PGAS language aims to reduce the increasing programming burden while still providing good performance. ExaShark offers its users a global array like usability while its underlying runtime builds on GASPI to take optimal advantage of the PGAS paradigm. Next to GASPI, ExaShark can be configured to also include any of the aforementioned programming models. ExaShark has been used by to develop applications of different scale: ranging from standalone advanced pipelined conjugate gradient solvers, to a complete particle-in-cell simulator. In this talk we will demonstrate that by using GASPI as the main underlying programming model for ExaShark, the application can take advantage of the PGAS library without having to know it is underneath (code portability). On the other hand we will also show that by actively changing the communication patterns in the application to better exploit the asynchronous nature of GASPI significant performance can be gained (no performance portability).

16:25

Konstantina Panagiotopoulou and Hans-Wolfgang Loidl

Towards Resilient Chapel

SPEAKER: Konstantina Panagiotopoulou

ABSTRACT. The rapidly increasing number of components in modern High Performance Comput- ing (HPC) systems provides a challenge on their resilience: predictions of time between failures on ExaScale systems range from hours to minutes [6]. Yet, the prevalent HPC programming model today does not tolerate faults. In this work, we outline the design for transparent resilience for Chapel [2], a parallel HPC language with focus on scala- bility, portability and productivity, following the Partitioned Global Address Space [5] (PGAS) programming model. The PGAS model assumes a global address space, logi- cally partitioned so that each portion has affinity to a process and exploits locality of reference.
Resilience [4] is the ability of a system to maintain state awareness and an accepted level of operational normalcy in response to disturbances, including threats of an unexpected and malicious nature. We address cases of failure, in particular hardware failure, of one or multiple nodes during program execution in a distributed setup. Rather than terminating the execution, by triggering system-wide fatal errors, we aim to embed resilience in Chapel’s runtime system (RTS) using both detection and recovery mecha- nisms. This design provides flexibility and scalability, following similar design principles to Resilient X10 [3].
Our design focuses on the RTS, particularly on the communication (GASNet [1]) and tasking layer. We examine both sequential operations, such as blocking forks, and parallel operations. On language level, task-parallelism is expressed as combination of on (fork) constructs and task parallel constructs (begin, cobegin, coforall). Our current implementation covers blocking and part of non-blocking fork operations.
Fault detection is based on timeouts and on-demand failure updates, while monitoring in-transit messages between locales. In the blocking case, the parent locale waits on a timeout. In the non-blocking case, we detect failures based on the number of pending in-transit messages.
Fault recovery applies a simple task adoption strategy, allowing tasks to get adopted by the closest non-failed parent/ancestor locale. A core assumption is that the master locale is unfailed. Redundancy is provided by keeping copies of tasks locally on the parent, in order to relaunch them on remote failure. This adoption-based policy needs to be integrated with a load balancing policy: on failure more tasks execute towards the root of the node tree.
Our testing framework is signal-based to flexibly simulate node failures for small scale experiments. We assess both robustness and overhead of our prototype implementation.
Our early results focus on functionality and show high success rates, indicating both timely program termination and compliance to the task adoption strategy. Success rates are particularly high for nested parallelism (reaching 95 %), even when more than 3 of 4 the nodes fail. Preliminary results show negligible overhead in the failure-free case. Ongoing work focuses on a resilient version of the non-blocking fork operations. Future work will address advanced distributed task adoption strategies and will integrate resilience with Chapel’s default data distributions.

References
[1] Dan Bonachea. Gasnet specification, v1. l. Univ. California, Berkeley, Tech. Rep. UCB/CSD-02-1207, 2002.
[2] Bradford L Chamberlain, David Callahan, and Hans P Zima. Parallel programmabil- ity and the chapel language. International Journal of High Performance Computing Applications, 21(3):291–312, 2007.
[3] David Cunningham, David Grove, Benjamin Herta, Arun Iyengar, Kiyokuni Kawachiya, Hiroki Murata, Vijay Saraswat, Mikio Takeuchi, and Olivier Tardieu. Resilient x10: efficient failure-aware programming. In Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 67– 80. ACM, 2014.
[4] Craig G. Rieger, David I. Gertman, and Miles A. McQueen. Resilient control systems: Next generation design research. In Proceedings of the 2Nd Conference on Human System Interactions, HSI’09, pages 629–633, Piscataway, NJ, USA, 2009. IEEE Press.
[5] Timothy Stitt. An Introduction to the Partitioned Global Address Space (PGAS) Programming Model. Technical report, 2010.
[6] Xianghui Xie, Xing Fang, Sutai Hu, and Dong Wu. Evolution of supercomputers. Frontiers of Computer Science in China, 4(4):428–436, 2010.

19:00-21:00 Session 14: Conference Meal at The Royal College of Surgeons