View: session overviewtalk overview
09:30 | HPC in Formula 1 Aerodynamics SPEAKER: Mark Taylor ABSTRACT. This talk will give an introduction to Formula 1 racing car aerodynamics together with an insight into the numerical modelling technology that lies behind the car development process. A description of the challenges that are required to be met by the HPC technology community to meet the ambitions of McLaren Racing will be outlined. Our goal is to be able to efficiently exploit the next generation of aerodynamic simulation tools that will drive our technology forward in a way that leads to World Championships.
|
11:00 | Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with Holistic Performance Modeling SPEAKER: Jeffrey Vetter ABSTRACT. Concerns about energy-efficiency and reliability have forced our community to reexamine the full spectrum of architectures, software, and algorithms that constitute our ecosystem. While architectures and programming models have remained relatively stable for almost two decades, new architectural features, such as heterogeneous processing, nonvolatile memory, and optical interconnection networks, will demand that software systems and applications be redesigned so that they expose massive amounts of hierarchical parallelism, carefully orchestrate data movement, and balance concerns over performance, power, resiliency, and productivity. In what DOE has termed 'co-design,' teams of architects, software designers, and applications scientists, are working collectively to realize an integrated solution to these challenges. A key capability of this activity is accurate modeling of performance, power, and resiliency. We have developed the Aspen performance modeling language that allows fast exploration of the holistic design space. Aspen is a domain specific language for structured analytical modeling of applications and architectures. Aspen specifies a formal grammar to describe an abstract machine model and describe an application's behaviors, including available parallelism, operation counts, data structures, and control flow. Aspen's DSL constrains models to fit the formal language specification, which enforces similar concepts across models and allows for correctness checks. Aspen is designed to enable rapid exploration of new algorithm and architectures. Because of the succinctness, expressiveness, and composability of Aspen, it can be used to model many properties of a system including performance, power, and resiliency. Aspen has been used to model traditional HPC applications, and recently extended to model scientific workflows for HPC systems and scientific instruments, like ORNL’s Spallation Neutron Source. Models can be written manually or automatically generated from other structured representations, such as application source code or execution DAGs. These Aspen models can then be used for a variety of purposes including to predict performance of future applications, evaluate system architectures, inform runtime scheduling decisions, and identify system anomalies. |
11:25 | End-to-end Optimization of SeisSol SPEAKER: Alex Breuer ABSTRACT. In this talk we give a comprehensive overview of our end-to-end optimization of the high order finite element software SeisSol. SeisSol simulates dynamic rupture and seismic wave propagation at petascale performance in production runs. The presented optimizations ultimately aim at minimizing time-to-solution and energy-to-solution. In this context we analyze a broad range of levels in the simulation pipeline and the hardware hierarchy. At single node we show the impact of the convergence order, frequency, vector instruction sets, alignment and chip-level parallelism. The set of architectures covers three generations of CPUs -- code-named Westmere, Sandy Bridge and Haswell -- and the Xeon Phi coprocessor. From a performance perspective, especially on state-of-the-art and future architectures, the shift from a memory- to a compute-bound scheme with increasing order is compelling. The second part of the presentation aims at large scale optimizations for heterogeneous and homogeneous supercomputers. Here we present a novel offloading scheme for Xeon Phi-accelerated systems, which carefully schedules the wave propagation component to the coprocessors and the complex dynamic rupture computation to the strong CPUs. Additionally for native settings we present our completely redesigned memory- and MPI-layout. Both of the presented strategies allow for overlapping computation and communication in a very natural way. We conclude the presentation with large-scale results of SeisSol including a sustained DP-PFLOP performance on homogenous and a multi-DP-PFLOP performance on heterogenous supercomputers. |
11:50 | The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for Exascale Computing SPEAKER: Florian Wende ABSTRACT. With the upcoming transition from petascale to exascale computers rad- ically new methods for scalable and robust computing are required. Com- puting at the speed of exascale, i.e. more than 1018 floating point operations per second, will only be possible on systems with millions of processing units. Unfortunately, the large number of functional components (computing cores, memory chips, network interfaces) will greatly increase the probability of failures, and it can thus not be expected that an exascale application will complete its execution on exactly the same resources it was started. |
12:15 | Performance optimization of a petascale-enabled finite volume solver SPEAKER: Panagiotis Hadjidoukas ABSTRACT. Cloud cavitation collapse is detrimental to the lifetime of high pressure injection engines and ship propellers and instrumental to kidney lithotripsy and ultrasonic drug delivery. Its study presents a formidable challenge to experimental and computational studies due to its geometric complexity and the multitude of its spatiotemporal scales. Simulations of cloud cavitation collapse require compressible two phase flow solvers capable of capturing interactions between multiple deforming bubbles, traveling pressure waves, formation of shocks and their impact on solid walls. |
13:40 | Massively-parallel GPU-accelerated galaxy simulation SPEAKER: Simon Portegies Zwart ABSTRACT. Simon Portegies Zwart, Professor of Computational Astrophysics at Leiden University will present massively parallel GPU-accelerated galaxy simulations which have been nominated for the 2014 Gordon Bell prize. |
14:40 | Radiative Transfer Modeling at High Performance Computers Using Self-Ajoint Transport Equation SPEAKER: Olga Olkhovskaya ABSTRACT. We present a numerical tool for large-scale 3D radiative transport simulations related to high energy density plasmas (HEDP). Typical HEDP problems are intricate and computationally expensive due to strong coupling of numerous physical processes and wide range of spatial/temporal scales of plasma structures. These problems are a real challenge for specialists in applied mathematics and can be comprehensively studied only with the use of HPC systems. Energy transport via radiation is an indispensable part of high-temperature gas/plasma numerical analysis. The HPC functionality makes it possible to reproduce 3D radiation field with the desired precision of both spectral and angular photon distribution. To this end we propose a new parallel technique for the radiative transfer simulation. The most popular radiation model for HEDP problems is radiation diffusion. It allows significant economy of computing resources. Diffusion model provides precise energy balance, however it is valid only for almost thermodynamically equilibrium states and thus not applicable for highly anisotropic radiation fields. Keeping in mind modern HPC capabilities one may think about direct solution of photon transport equation (3-dimensional in space and 2-dimensional in angular variables). It's not difficult to solve the radiation transport equation along the characteristic lines, but the precise solution requires a great number of characteristics in each computational point. The number of characteristics should be proportional to a squared number of mesh cells. Otherwise a great loss of accuracy is probable because of known "ray effect" in discrete models. If a pair of computational cells exists not coupled by rays and thus not involved into radiative energy exchange, radiative flux from the "hot" cell is not accounted in the energy balance in the "cold" cell. Another problem is nonlocality: as a characteristic crosses the entire computational domain from one end to the other, the algorithm seems to be hardly scalable in case of domain decomposition. Angular non-uniformity of photon distribution function can be accounted in second-order self-adjoint transport equation. In 1951 V. S. Vladimirov established variational principle for one-velocity transport equation and derived the appropriate boundary conditions. In 1986 B. N. Chetverushkin has proposed similar approach to radiation transport computation. We apply DG procedure to self-adjoint transport equation which leads to a set of (MxN) elliptic-type equations. They may be solved independently giving good opportunity for using of any paradigm of parallelization. Note that accurate simulation requires the value of tens to hundreds for both M (the number of spectral groups) and N (the number of quadrature points on a sphere). Spatial discretization yields linear system with a symmetric positive definite matrix allowing application of effective linear solvers (Krylov solvers or Chebyshev - Richardson iterations). The energy balance is calculated via numerical radiative fluxes which are restored from discrete photon distributions by means of special quadratures (V. I. Lebedev, 1976) for radiative fluxes. The numerical algorithm was incorporated into the scientific CFD code MARPLE3D (Keldysh Institute of Applied Mathematics - KIAM). We employed mixed element computational meshes (hexahedral, tetrahedral, prismatic cells and their combinations) up to tens million cells. Numerical experiments demonstrated good scalability at KIAM RAS K-100 scalable GPGPU-based hybrid computing system. We have obtained robust numerical procedures suitable for multiscale simulations in finely discretized computational domains. It's a promising technique for upcoming exaflop computing. |
15:05 | A Novel Kinetic Consistent Algorithm of MHD for Massive Parallel Computing SPEAKER: Boris Chetverushkin ABSTRACT. The development of the new models and algorithms for the solution of the actual physics problems are the high priority task for the new generation of the high performance computer systems with massive parallelism. The more computing power gives a possibility to work with the more physical correct models but algorithms must be developed to deal with architectural realities in an high performance computing system. The tremendous progress of the kinetic Lattice Boltzmann and kinetic consistent methods in solution of the gas dynamic problems and development of the effective parallel algorithms for the modern high performance parallel computing systems leads to development of advanced models and methods for solution of the magneto gas dynamic problems for critical areas as plasma physics and astrophysics. The kinetic methods are based on the evolution of the statistical distribution function and Boltzmann equation which is fundamental basis of the kinetic theory of gases. The novel method proposed to extend the Boltzmann like distribution function with the implementation of electromagnetic terms for the solution of the magneto gas dynamic problems. This gives powerful tools for the solution of the magneto gas dynamic system of equations in frame of kinetic consistent schemes based on the fundamental physics models with common approach. As mentioned that the development of the algorithms is one of the critical point of using of the high performance computing systems with massive parallelism. Proposed numerical algorithm is based on the explicit scheme, considered preferable for the new generation of the high performance parallel computing systems. It is explained by the logical simplicity and efficiency the algorithms and possibility of easy adaptation to the modern high performance parallel computer systems, including hybrid computing systems with graphic processors. However, the stability conditions of the explicit algorithms are the price to be paid for algorithmic simplicity and it gives the strong limitation on the time discretization for the fine space discretization. The proposed algorithm allows include a regularization mechanism through hyperbolic terms of the equations. This mechanism improves the stability of the algorithm with a consequent relaxation of the time discretization on the high detailed spatial mesh. The proposed hyperbolic type kinetic magneto gas dynamics model provides a more stable condition for the numerical calculations and the possibility of realization of the high space discretization with acceptable time discretization. The analysis shows that in the region of the physical parameters of an ionized gas (plasma), the time discretization for the hyperbolic type of the magneto gas dynamic equations gives the relaxation in time discretization of the order of magnitude in comparison to the time discretization for the parabolic type of magneto gas dynamic equations. The proposed model and algorithm are used for the 3D simulations of the magneto gas dynamic processes in astrophysics, in particular for the detailed modeling of the accretion of the interstellar matter on the compact massive astrophysical object ("black hole"). REFERENCES: 1. B.Chetveruhskin, Kinetic schemes and Quasi-Gas Dynamic System of Equations, CIMNE, Barcelona, Spain, 2008 2. B.Chetverushkin, N.D'Ascenzo, V.Saveliev, Kinetically Consistent Magnetogasdynamics Equations and Their Use in Supercomputer Computing, Doclady Mathematics, v. 90, i 1, pp 495-498 |
14:40 | Improving Performance Portability and Exascale Software Productivity with the Nabla Numerical Programming Language SPEAKER: Jean-Sylvain Camier ABSTRACT. Addressing the major challenges of software productivity and performance portability is becoming necessary to take advantage of emerging extreme-scale computing architectures. As software development costs will continuously increase to address exascale hardware issues, higher-level programming abstraction will facilitate the path to go. There is a growing demand for new programming environments in order to improve scientific productivity, to facilitate the design and implementation, and to optimize large production codes. In this context, we present the numerical-analysis specific language Nabla, which improves applied mathematicians productivity throughput and enables new algorithmic developments for the construction of hierarchical and composable high-performance scientific applications. The introduction of the hierarchical logical time (HLT) within the high-performance computing scientific community represents an innovation that addresses the major exascale challenges. This new dimension to parallelism is explicitly expressed to go beyond the classical single-program-multiple-data or bulk-synchronous-parallel programming models. Control and data concurrencies are combined consistently to achieve statically analyzable transformations and efficient code generation. Shifting the complexity to Nabla's compiler offers an ease of programming and a more intuitive approach, while reaching the ability to target new hardware and leading to performance portability. The three main parts of the toolchain are presented: the front-end raises the level of abstraction with its grammar; the back-ends hold the effective generation stages, and the middle-end provides agile software engineering practices transparently to the application developer, such as: instrumentation (performance analysis, V&V, debugging at scale), data or resource optimization techniques (layout, locality, prefetching, caches awareness, vectorization, loop fusion) and the management of the hierarchical logical time which produces the graphs of all parallel tasks. The refactoring of existing legacy scientific applications is also possible by the incremental compositional approach of the method. Feedback and grammatical patterns are given for several different numerical schemes that have been successfully ported or written in Nabla: the explicit unstructured Lagrangian hydrodynamics Sedov blast wave problem in three dimensions and another cartesian scheme are solved. An implicit two dimensions resolution of the Schrödinger equation is also illustrated. The numerical-analysis specific language Nabla provides a productive development way for exascale HPC technologies, flexible enough to be competitive in terms of performances. As a demonstration of the potential and the efficiency of this approach, we present several benchmark implementations and evaluate their performances over a wide variety of hardware architectures (Xeon, XeonPHI and CUDA). Raising the level of abstractions allows the framework to be prepared to address growing concerns of future systems. The generation stages will be able to incorporate and exploit algorithmic or low-level resiliency methods by coordinating co-designed techniques between the software stack and the underlying runtime system. |
15:05 | Exploiting Hierarchical Exascale Hardware using a PGAS Approach SPEAKER: Karl Fuerlinger ABSTRACT. A number of daunting challenges have been identified [1] on the way to Exascale computing. Hard- ware architecture (particularly on the node-level) must change to achieve the desired performance and efficiency goals and this will have profound implications for the way in which high performance applications have to be written. Locality of data access will become an even more important aspect than it is today, as hardware vendors are forced to abandon node-wide cache coherence, several types of RAM (3D-Stacked/DRAM/NVRAM) are included, and complex on-chip networks interconnect are employed. |
16:00 | Evaluating the Performance of Stencil Codes at Scale SPEAKER: Manish Modani ABSTRACT. Stencil-based codes are widely used in Scientific Computing and are considered to be good candidates for running at scale. A halo kernel has been run for a range of halo sizes and architectures on up-to 64000 cores. Results demonstrate that stencil communication does not show weak scaling properties when running on these architectures. This lack of scaling is analysed on the Blue Gene Q and is found to be due to network contention in the Torus. A task-to-topology mapping is described devised which allows the kernel to scale as expected. |
16:25 | PHG: a parallel adaptive finite element toolbox towards exascale and its applications SPEAKER: Tao Cui ABSTRACT. PHG is a toolbox for developing parallel adaptive finite element programs which is currently under active development at State Key Laboratory of Scientific and Engineering Computing of Chinese Academy of Sciences. PHG deals with conforming tetrahedral meshes and uses bisection for adaptive local mesh refinement and MPI for message passing. PHG has an object oriented design which hides parallelization details and provides common operations on meshes and finite element functions in an abstract way, allowing the users to concentrate on their numerical algorithms. In this talk, the main algorithms in PHG and some applications, such as parasitic extraction simulation of large scale interconnects and simulation of 3D seismic waves, will be introduced. Numerical experiments with up to 8 billion unknowns and using up to more than 196608 CPU cores, which is performed on Tianhe-2, will be presented to demonstrate that PHG is robust and scalable. |
16:00 | ExaShark+GASPI: Reducing the burden to program large HPC systems since 2014 SPEAKER: Tom Vander Aa ABSTRACT. Several trends in HPC systems make it challenging to quickly and easily develop applications that perform well. One important trend is an increased number of levels in HPC systems: more levels of memory and interconnect, more levels of parallelism (SIMD, multi-threading, multi-core,… ). Since each level of each type has its own properties, this has lead to a plethora of different programming models: shared-memory threading techniques like Pthreads, OpenMP or TBB, inter-node distribution techniques like MPI. It is challenging for an HPC programmer to choose the right mix of libraries and learn how to use each of them. Another trend and a side effect of this increased number of levels is that there are more entities in the system and those entities are on average further apart. Communication latencies increase, making it unfeasible to get good performance with applications that solely rely on global and/or synchronous communication (such as “traditional MPI” applications). An alternative and less traditional approach are PGAS (Partitioned Global Address Space) languages, which are asynchronous and more distributed by nature and allow for overlap in communication and computation. Since PGAS languages are relatively new and require programmers to rethink their applications’ communication patterns, they are not yet widely used nor widely mastered. This talk is about how ExaShark - a library for handling n-dimensional distributed arrays – combined with the GASPI PGAS language aims to reduce the increasing programming burden while still providing good performance. ExaShark offers its users a global array like usability while its underlying runtime builds on GASPI to take optimal advantage of the PGAS paradigm. Next to GASPI, ExaShark can be configured to also include any of the aforementioned programming models. ExaShark has been used by to develop applications of different scale: ranging from standalone advanced pipelined conjugate gradient solvers, to a complete particle-in-cell simulator. In this talk we will demonstrate that by using GASPI as the main underlying programming model for ExaShark, the application can take advantage of the PGAS library without having to know it is underneath (code portability). On the other hand we will also show that by actively changing the communication patterns in the application to better exploit the asynchronous nature of GASPI significant performance can be gained (no performance portability). |
16:25 | Towards Resilient Chapel SPEAKER: Konstantina Panagiotopoulou ABSTRACT. The rapidly increasing number of components in modern High Performance Comput- ing (HPC) systems provides a challenge on their resilience: predictions of time between failures on ExaScale systems range from hours to minutes [6]. Yet, the prevalent HPC programming model today does not tolerate faults. In this work, we outline the design for transparent resilience for Chapel [2], a parallel HPC language with focus on scala- bility, portability and productivity, following the Partitioned Global Address Space [5] (PGAS) programming model. The PGAS model assumes a global address space, logi- cally partitioned so that each portion has affinity to a process and exploits locality of reference. |