View: session overviewtalk overview
09:50 | Challenges of the exascale SPEAKER: Pete Beckman ABSTRACT. Pete Beckman, Director of the Exascale Technology and Computing Institute at Argonne National Laboratory, will provide an update on recent progress in the US with a particular focus on software for the exascale. |
11:20 | Leveraging a Cluster-Booster Architecture for Brain-Scale Simulations SPEAKER: Pramod Kumbhar ABSTRACT. HPC architectures need dramatic changes to reach sustainable exascale computing. An exploratory european effort for exascale is the DEEP (Dynamic Exascale Entry Platform) project. The DEEP concept is motivated by the fact that many HPC applications show more than one level of concurrency: highly scalable kernels with O(N) concurrency and other kernels with O(K) concurrency, 1 ≤ K ≪ N, where N is large number of cores exploited in parallel. To help overcome the performance and scaling challenges of these applications on current accelerator based systems, the DEEP platform consists of an x86-based “Cluster” with InfiniBand interconnect to run complex, less scalable code parts and an intel MIC based “Booster” with EXTOLL network to run highly scalable compute kernels. |
11:45 | Energy measurement at the Exascale SPEAKER: Nick Johnson ABSTRACT. When exascale is being discussed, the figures of 1 Exaflop and 20MW are usually cited as the targets to aim for. Runtime performance measurement is well established and a variety of tools available on HPC systems can give a rich variety of accurate information from cache misses to peak flop rate for scientific applications. We find that there is often little consideration given to energy monitoring and when it is present, it is often at low resolution, compared to runtime measurements. |
12:10 | System Software Stack for Efficiency of Exascale Supercomputer Centers SPEAKER: Vladimir Voevodin ABSTRACT. We take supercomputers’ colossal abilities for granted and expect them to give those returns accordingly. These expectations may be justified, but it turns out real life is not that optimistic. Many people know that performance of supercomputers is very low on real-life applications: in most cases it’s just a few percent of the peak values. But few really know how inefficient a supercomputing center generally is. Nothing is too small here, and every element of supercomputing centers has to be reviewed thoroughly – from the accepted policy of task queuing to system software configuration and engineering infrastructure. |
12:35 | Efficiently Scheduling Task Dataflow Parallelism SPEAKER: Hans Vandierendonck ABSTRACT. Increased system variability and irregularity of parallelism in applications put increasing demands on the ef- ficiency of dynamic task schedulers. This paper presents a new design for a work-stealing scheduler supporting both Cilk- style recursively parallel code and parallelism deduced from dataflow dependences. Initial evaluation on a set of linear algebra kernels demonstrates that our scheduler outperforms PLASMA’s QUARK scheduler by up to 16.3% on 32 threads. Moreover, by reducing scheduling overhead, our new design supports finer- grain tasks, which improves strong scaling. The many-core roadmap for processors dictates that the number of cores on a processor chip increases at an exponential rate. Moreover, cores tend to operate at different speeds due to process variability and thermal constraints. As such, parallel task schedulers in the exascale era must make dynamic (runtime) scheduling decisions. The task dataflow notation has been studied widely as a viable approach to facilitate the specification of highly parallel codes. Task dataflow dependences specify an ordering of tasks (they leverage a task graph), which by its nature exposes a higher degree of parallelism than barrier-based models where one must wait periodically for all running tasks to complete. Dynamic schedulers are, however, prone to result in less per- formance than static schedulers due to runtime task scheduling overhead. This work investigates a new design for a task dataflow dynamic scheduler. The key aim for this design is to minimize runtime overhead without affecting the task dataflow pro- gramming interface. The scheduler supports programs mixing recursive divide-and-conquer parallelism and task dataflow parallelism. This hybrid design simplifies, for instance, the exploitation of parallelism across multiple kernels called in succession. The scheduler combines the efficiency of Cilk’s work stealing scheduler for recursively parallel programs with the efficiency of Aurora’s task queue for programs generating large numbers of simultaneously ready tasks. We evaluate our design experimentally and compare against a prior design and against PLASMA’s QUARK scheduler on a set of BLAS kernels. The experimental platform is a 2-node, 32-core AMD 6272 Bulldozer machine with 2.1 GHz clock frequency. Our new scheduler reduces end-to-end execution time on several linear algebra kernels by 7%–9% on 32 threads over our previous design. Moreover, our new scheduler outperforms the QUARK scheduler by up to 16.3%. Further evaluation shows that our scheduler works efficiently also on smaller tile sizes than QUARK. This has important implications for strong scaling on exascale systems as a higher degree of fine-grain parallelism can be utilized by our scheduler. |
11:20 | Portability and Performance of Nuclear Reactor Simulations on Many-Core Architectures SPEAKER: Ronald Rahaman ABSTRACT. High-fidelity simulation of a full scale nuclear reactor core is a computational challenge that has yet to be met but is predicted to be achievable on exascale-class supercomputers. Several reactor simulations (such as OpenMC and SimpleMOC) are being developed specifically to run efficiently on exascale machines through established hardware-specific programming models (such as OpenMP and CUDA). Recently-developed, hardware-agnostic programming models offer opportunities to express multi-threaded parallelism in a portable fashion and allow a single, more-unified code base to run on many divergent high-performance computing architectures. Though the benefits of portability are clear, questions remain as to what practical performance tradeoffs apply to real world applications. In the present study, we port two existing proxy applications that represent key algorithms in nuclear reactor simulations to the hardware-agnostic language of OCCA. Performance and efficiency of the OCCA ports are compared to the native OpenMP versions on CPU and CUDA versions on GPU architectures. This study attempts to quantify tradeoffs between performance and portability of real world applications, specifically on exascale-class simulations for nuclear industry, using newer programming models. |
11:45 | A highly scalable Met Office NERC Cloud model SPEAKER: Nick Brown ABSTRACT. The existing Met Office Large Eddy Simulation model, the LEM, was initially developed in the 1980s and whilst the scientific output from the model is cutting edge, the code itself has become outdated. Hardcoded assumptions made about parallelism, which were sensible 20 years ago, are now the source of severe limitations and this means that the model does not scale beyond 512 cores. Whilst scientists wish to use this application to model clouds at a very high resolution (down to 1m or on domains of many 1000s of km) on the latest HPC machines, this is currently impossible due to the inherent limits of the code. As machines become larger, and significantly different from the architectures that a code was initially designed for, it is can sometimes be easier to rewrite poorly performing applications rather than attempt to modernise them through refactoring. We present here MONC (Met Office NERC Cloud model), a complete rewrite of the LEM, designed to exploit parallelism in the order of hundreds of thousands of cores. This provides the scientific community with a tool for modelling clouds at very high resolutions and/or near real time. The fact that this is a community code, along with the desire to future proof to as great an extent as possible, has heavily influenced the design of the code. We have adopted a “plug in” architecture, where the model is organised as a series of distinctive, independent, components which can be selected at runtime. This approach, which we will discuss in detail, not only allows for a variety of science to be easily integrated but it also supports development targeting different architectures and technologies by simply replacing one component with another. Other innovative aspects of the code will also be presented and these include our approach to data analysis and processing, which is a major feature of the model. These models analyse their raw data to produce higher level information, for instance the average temperature in a cloud or tracking of how specific clouds move through the atmosphere. The existing LEM, like many codes, performs data analysis inline as part of the model timestep which, along with the I/O operation time, is a major bottleneck. Instead MONC uses the notion of an I/O server, where typically one core per processor is dedicated to handling and analysing the data produced by the model, which is running on the remaining cores. In this manner, MONC can act in a “fire and forget” fashion, asynchronously sending data to the “local” I/O server and continuing on with the next timestep whilst it is being processed. Based upon the innovative approaches adopted, we will present the performance and scalability improvements in MONC compared to the existing LEM. We will discuss some of the lessons learnt, both in terms of large scale parallelism and software engineering techniques, that have become apparent whilst redeveloping this code. |
12:10 | Performance Analysis and Optimisation of MetUM on CRAY XC30 SPEAKER: Karthee Sivalingam ABSTRACT. The Met Office Unified Model (MetUM) needs to be ported to different supercomputing platforms to facilitate effective collaboration of climate scientists around the world. We present an update on the ongoing efforts to port and optimise the MetUM on ARCHER. ARCHER is a Cray XC30 supercomputer and can achieve performance in the order of petaflops. The performance of UM on ARCHER is analysed using Cray performance Analysis (CrayPAT) tools and is compared with other supercomputers like Cray XE6 and IBM 775. With significant improvement in Network bandwidth, 10 to 15% speedup is achieved using MPI rank placements and other MPI settings. UM has 65% thread load imbalance and thread performance is improved by adding new OPENMP regions using CRAY Reveal. We also discuss issues in scaling the UM to petaflops and more. |
12:35 | Exploring the Memory-Efficient Implementation Model for Incompressible Smoothed Particle Hydrodynamics(ISPH) SPEAKER: Xiaohu Guo ABSTRACT. As we are in the era of transition from the current Petascale to the future Exascale, there are many cases will force application developers to reformulate the fundamental algorithm and implementation approach. e.g.: overall levels of concurrency, the relative cost of FLOP/s compared to data movement, available memory per floating point unit, depth and complexity of the memory hierarchy, awareness of power costs, and overall resilience characteristics. Based on previous EPSRC and the current eCSE projects, this paper is about to present the results of memory efficient implementation for incompressible SPH target to the architecture with the large number of lower frequency cores with a decreasing memory to core ratio. This is imposing a strong evolutionary pressure on numerical algorithms and software to efficiently utilise the available memory and network bandwidth. Smoothed Particle Hydrodynamics code is an efficient numerical method which has been application to many industrial applications. The area includes coastal, offshore and naval hydrodynamics, multi-phase mixtures such as oil \& water, geotechniques and high temperature applications in manufacturing. Developing an efficient parallel scalable SPH Toolkit would attract significant interest from industrial companies. In HPC research, SPH application is one of few those applications which have potentials to scale on the future Exascale platforms. The portability to heterogeneous architectures of SPH/particle-based methods have been already proved due to their nature of large floating operations and relative ease to be implemented as a cache friendly method. Due to require solving pressure Poisson equation, ISPH\cite{lind} does not only inherited the SPH nature of large floating operations, but also have the memory bounded nature of large sparse linear system solver. Developing such Exascale oriented software application require us to design the software data structure and algorithms carefully in order to achieve strong scalability in large-scale systems. The parallel implementation typically involves using a combination of data locality aware algorithms (eg: space filling curve) and unstructured communication mechanisms (eg: fast distributed hash table searching algorithms). Based on previous EPSRC and the current eCSE projects\cite{spheric13}, this paper is about to present the results of memory efficient implementation for incompressible SPH. we will discuss using preconditioned dynamic vector\cite{Dominguez} for nearest neighbour list construction, the trade-off between reducing memory accesses , memory footprint and increased floating point operations by using loop optimisation techniques on nearest neighbour list searching kernels and the other SPH related kernels. The performance analysis and results using large number of particles showed the promising efficiency with very large number of cores. |
14:00 | HPC in the private sector SPEAKER: Cynthia McIntyre ABSTRACT. Cynthia McIntyre, Senior Vice President, Council on Competitiveness, will focus on the value of HPC engagement with industry. |
15:30 | Zacros Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis SPEAKER: Jens Hedegaard Nielsen ABSTRACT. We present work done in the embedded CSE project “Zacros Software Package Devel- opment: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis”. Zacros (http://zacros.org) is a graph theoretical kinetic monte carlo (KMC) code for simulating chemical reactions on catalytic surfaces. During the course of the simulation, the ap- plication stores all the possible chemical reaction steps that can happen for the current lattice configuration. The simulation proceeds by executing one of these steps at a time. Event selection is random and based on the event’s kinetic rate constants, also known as propensities. Events with higher propensity are executed more frequently during the course of the simulation. |
15:55 | Swift: task-based hydrodynamics at Durham’s IPCC SPEAKER: Tom Theuns ABSTRACT. Simulations of the evolution of structures in the Universe have become increasingly successful at reproducing observations of the cosmos. Such calculations use major HPC resources on national and international supercomputers, and play a crucial role in interpreting observations as well as planning new observatories. The huge dynamic range of such calculations severely limits good strong scaling behaviour of the community codes in use, mostly due to poor load balancing, especially on newer shared-memory massively parallel architectures, i.e. multicores. This limits the science return from such calculations. The rapid increase in parallelism of current and future architectures, combined with the promise of accelerators, is likely to increasingly exacerbate these limitations. In this talk I will present results from a new cosmological hydrodynamical code called Swift, which aims to overcome these limitations. Swift uses task-based parallelism designed for many-core compute nodes interacting over MPI using asynchronous communication, i.e. hybrid shared/distributed-memory parallelism. A graph-based domain decomposition schedules interdependent tasks over available resources, using the QuickSched library. Strong scaling tests on realistic particle distributions yield excellent parallel efficiency (60 per cent parallel efficiency for a run with a 1000-fold increase in core count), and efficient cache usage provides a large speedup compared to current codes even on a single core. Future work will concentrate on developing a self-tuning strategy to adapt to hardware specific parameters (e.g. advanced vector extensions and scatter-gather intrinsics), as well as extensions to accelerators. The techniques and algorithms used in Swift are likely to benefit other computational physics areas as well, for example that of compressible hydrodynamics. For details of this open-source project, see www.swiftsim.com |
15:30 | Exascale Computing for Everyone: Cloud-based, Distributed and Heterogeneous SPEAKER: Gordon Inggs ABSTRACT. We argue that widespread adoption of exascale computing will be in the form of distributed computing resources pro- vided on a utility basis i.e. by a Cloud Computing ven- dor. Furthermore, these resources will be heterogeneous, comprised of not only Von Neumann Machine CPUs, but also massively parallel compute devices such as Graphics Processing Units (GPUs) and accelerators implemented us- ing reconfigurable computing devices, i.e. large Field Pro- grammable Gate Arrays (FPGAs). A further consideration is the growing range of hybrid devices that incorporate these elements within a single device such as Intel’s Xeon Phi and Xilinx’s Zynq system-on-chips. |
15:55 | Paradigm Shift for EXASCALE Computing SPEAKER: Paraskevas Evripidou ABSTRACT. HPC/Supercomputers target large problems that have a high degree of parallelism. Programming of such machines is mainly done through parallel extension of the sequential models like MPI and OpenMP. These extensions do facilitate a high productivity parallel programming, but also suffer from the limitations of the sequential synchronization and their inability to tolerate long latencies. Arvind and Iannucci, Data-Flow proponents, have been warning us since the 1980’s about the two fundamental issues in Multiprocessing: “long memory latencies and waits due to synchronization events”. Peter Kogge, the leader of the DARPA/USA study group for exascale computing confirmed that the communication and synchronization latency of the sequential model are getting out of hand for HPC/Exascale machines. Furthermore, Kogge, states that the power consumption of an exascale computer, will be around 500 MW. Michael Flynn stated in his keynote speech at PFL 2012 that “We have multi-threaded, superscalar cores with limited ILP; worse yet, most of the die area (80%) is devoted to two or three levels of cache to support the illusion of sequential model”. Hence a paradigm shift for exascale computing is necessary. Data-Flow is a programming/execution model that provides tolerance to communication and synchronization latencies. Data-Flow has been proposed by a number of researchers as an alternative to the control flow model. Data Driven Multithreading (DDM) is a dynamic threaded Data-Flow model that schedules threads based on data availability on sequential processors. A DDM program consists of a number of threads that have producer-consumer relationships. Data dependencies between threads are uncovered at compile time, while their scheduling is done dynamically at runtime by the Thread Scheduling Units (TSU). Threads are scheduled for execution using Data-Flow semantics, while the instructions within threads are executed in a control flow manner. Evaluation results of DDM on a variety of platforms showed that DDM can indeed tolerate synchronization and communication latency. When comparing DDM with OpenMP, it was found that for all benchmarks used DDM performed better. This is primarily due to the fact that DDM effectively tolerates latency. In the case of benchmarks with low thread granularities, DDM had also the advantage of its inherent low parallelization and thread switching overheads. Similar results were also obtained when comparing DDM implemented on a cluster of four Cell processors, with Cellss and Sequoia. Our work on CacheFlow, the memory hierarchy system for DDM, showed that the TSU is aware of the threads scheduled for execution in the near future, and hence the data that will be needed in the near future. This enables the implementation of optimized cache placement and replacement policies, resulting in the need of smaller caches, or the replacement of the cache with a small scratch-pad memory. Furthermore, taking advantage of the near future memory references results in more efficient cache replacement policies that reduce bus traffic. Data-Flow provides tolerance to communication and synchronization latency, and has the potential of making the processor smaller and more power efficient. Thus, it makes sense to consider a shift to the Data-Flow paradigm now. |
16:20 | Lightning Poster Talks (sponsored by ICEOTOPE) SPEAKER: Multiple Presenters ABSTRACT. 16:20 Peter Hopton (ICEOTOPE) - Poster session welcome talk 16:30 Multiple presenters - Lightning poster presentations |