Program for Tuesday, April 21st

PROGRAM FOR TUESDAY, APRIL 21ST

Days:

next day

all days

View: session overview talk overview

09:00-09:30 Session 1: Registration and Coffee

09:30-09:50 Session 2: Welcome and Exascale Overview (Mark Parsons, EPCC)

Location: Pentland

09:50-10:50 Session 3: Keynote

Chair: Mark Parsons

Location: Pentland

09:50

Pete Beckman

Challenges of the exascale

SPEAKER: Pete Beckman

ABSTRACT. Pete Beckman, Director of the Exascale Technology and Computing Institute at Argonne National Laboratory, will provide an update on recent progress in the US with a particular focus on software for the exascale.

10:50-11:20Coffee Break

11:20-13:00 Session 4A: Contributed Talks

Chair: Alan Simpson

Location: Pentland

11:20	Pramod Kumbhar, Michael Hines, Aleksandr Ovcharenko, Fabien Delalondre, Florentino Sainz, Damian Mallon and Felix Schuermann Leveraging a Cluster-Booster Architecture for Brain-Scale Simulations SPEAKER: Pramod Kumbhar ABSTRACT. HPC architectures need dramatic changes to reach sustainable exascale computing. An exploratory european effort for exascale is the DEEP (Dynamic Exascale Entry Platform) project. The DEEP concept is motivated by the fact that many HPC applications show more than one level of concurrency: highly scalable kernels with O(N) concurrency and other kernels with O(K) concurrency, 1 ≤ K ≪ N, where N is large number of cores exploited in parallel. To help overcome the performance and scaling challenges of these applications on current accelerator based systems, the DEEP platform consists of an x86-based “Cluster” with InfiniBand interconnect to run complex, less scalable code parts and an intel MIC based “Booster” with EXTOLL network to run highly scalable compute kernels. Taking full advantage of such new type of hybrid infrastructure requires carefully analyzing the application workflow in order to properly map the computing needs of different application kernels to the most appropriate part of the system. To support such a paradigm shift and allow fast prototyping, as part of DEEP project we have been contributing in the development of core modeling engine factored out from NEURON simulator, and optimising its data structures for Xeon Phi coprocessors. Using extensive performance analysis, a performance classification leading to critical design decisions has been obtained as follows: the kernels supporting the resolution of the cable equations using multi-compartment conductance-based modeling and phenomenological ion channels (Hodgkin Huxley) leading to the setup and resolution of large sparse linear algebraic system were ported onto the Booster architecture. The parts of the application workflow responsible for model initialization and the cell compartment level report generation were ported onto Cluster part of the system. Getting good performance on the Booster (Xeon Phi) is challenging, considering hundreds of small kernels auto-generated from the DSL converter. During this work, SoA (Structure of Array) memory layout support has been added to both simulation core of NEURON and the DSL converter to enable compiler vectorization of application kernels. Also, to avoid rewriting of whole application for Cluster Booster division while easily managing the necessary data exchange between the various software components distributed at run time over different parts of the system, we used the DEEP software stack which includes OmpSs programming model developed at the Barcelona Supercomputing Centre, MPI’s dynamic process management interface as well as extensions supporting the dynamic allocation of nodes across Cluster Booster system. With these developments, simulation core of NEURON is now able to fully take advantage of leading class supercomputing systems like JUQUEEN, Stampede and also heterogeneous systems such as DEEP. In this paper, we propose to present the design and implementation details of simulator as well as an analysis of its performance across different architectures including DEEP and IBM Blue Gene/Q. We will then complement this analysis describing further opportunities offered by the DEEP system both on the simulation application performance as well as the future support of complex scientific workflows that could include runtime data analysis and visualization.
11:45	Nick Johnson, James Perry and Michele Weiland Energy measurement at the Exascale SPEAKER: Nick Johnson ABSTRACT. When exascale is being discussed, the figures of 1 Exaflop and 20MW are usually cited as the targets to aim for. Runtime performance measurement is well established and a variety of tools available on HPC systems can give a rich variety of accurate information from cache misses to peak flop rate for scientific applications. We find that there is often little consideration given to energy monitoring and when it is present, it is often at low resolution, compared to runtime measurements. In this paper we examine the energy and power consumption of selected codes which may scale to the Exascale. These are SEISMO, a simple seismic wave propagation modelling code; and Ludwig, a parallel Lattice-Boltzmann code for fluid simulation. The algorithms and data access patterns used in these codes, as well as the strategies used to parallelise them, are representative of a large range of HPC codes, so any results gained from studying them them should have a wider relevance. Using custom-designed measurement hardware, we instrument a COTS x86 desktop PC with a specification similar to that used in modern HPC nodes. By monitoring the power rails of the CPU, DRAM, disk and network interface (via the PCIe bus), we can build a high-resolution power profile of a scientific code, one much more detailed than we would expect from profiling directly on a HPC node. Extrapolation and inference allows us to predict the component energy and power consumption of the same code running at an exascale node-count. We use a custom measurement system to allow us to perform out-of-band measurements based around a Xilinx Zynq board with appropriate sensors for each power rail. Sampling at 20MHz and with a resolution of 16 bits per measurement (both voltage and current) we can achieve a high quality set of measurements for any rail of interest with no overhead on the system-under-test. Correlation of runtime and power measurements are achieved by post-processing data to gain temporal alignment. The novelty of our system is that by using our energy-scaling ratio, we can compare the performance of the application in terms of energy to solution and time to solution as the core count scales. The goal then becomes one of minimising both energy to solution and time to solution given the constraints of the target system. We present results for the SEISMO and Ludwig applications gathered from our system using both our own hardware and, for the CPU and DRAM, from hardware counters such as RAPL. This allows to validate results against RAPL which is known to be accurate for Intel CPUs, and to estimate the overhead of using an in-band method like RAPL which can then be added to an energy model for the code.
12:10	Vladimir Voevodin System Software Stack for Efficiency of Exascale Supercomputer Centers SPEAKER: Vladimir Voevodin ABSTRACT. We take supercomputers’ colossal abilities for granted and expect them to give those returns accordingly. These expectations may be justified, but it turns out real life is not that optimistic. Many people know that performance of supercomputers is very low on real-life applications: in most cases it’s just a few percent of the peak values. But few really know how inefficient a supercomputing center generally is. Nothing is too small here, and every element of supercomputing centers has to be reviewed thoroughly – from the accepted policy of task queuing to system software configuration and engineering infrastructure. Efficiency is only one aspect of the issue. An equally important issue today is control over the proper operation of hardware and software environments of supercomputing systems themselves. The main reason is the unprecedented growth of degree of parallelism. Thousands of users and applications; hundreds of thousands of computing nodes, processors, accelerators, ports, cables, software and hardware components; many millions of processing cores, processes, events, messages... And everything has to work in complete harmony as a single structure. If the scheduler hangs, powerful resources are left idle. One InfiniBand cable generates errors resulting in many corrupt and resent packets – and application performance decreases (and this decline mainly goes unnoticed by users). Control over the state of supercomputer components is needed but the really tough challenge is to make this control total and permanent. To ensure efficiency of large supercomputing centers, we use an approach based on the whole set of closely related software systems, technologies and tools, designed for deep and thorough analysis: from hardware and system-level software to users and their applications. The total monitoring system is responsible for gathering system-level data. OctoShell combines high-level data: the hardware and software components being used, users, projects, quotas, software licenses, etc. The OctoTron system was designed to guarantee reliable autonomous functioning of large supercomputing centers based on a formal model of a supercomputer. The supercomputer constantly compares its current state with the information recorded in the model. The OctoScreen (situational screen) gives systems administrators full and prompt control over the state of supercomputing center. The screen provides detailed information on what is happening inside the supercomputer, updates on the status of hardware and system software, task flow, individual users and/or performance of individual supercomputing applications. The OctoStat statistics collection system provides highly valuable information on the performance of supercomputing systems with regards to the flow of tasks being queued for computation. JobDigest gives a good first idea of the efficiency and peculiarities of applications, etc. All the systems are closely linked to one another, ensuring high efficiency of large supercomputing centers. Of course, they were designed taking into account all key aspects of maintenance and usage for existing supercomputers. But at the same time, the architecture of these systems is ready for the extremely large scale, complexity, degree of parallelism and performance of exascale supercomputers.
12:35	Hans Vandierendonck Efficiently Scheduling Task Dataflow Parallelism SPEAKER: Hans Vandierendonck ABSTRACT. Increased system variability and irregularity of parallelism in applications put increasing demands on the ef- ficiency of dynamic task schedulers. This paper presents a new design for a work-stealing scheduler supporting both Cilk- style recursively parallel code and parallelism deduced from dataflow dependences. Initial evaluation on a set of linear algebra kernels demonstrates that our scheduler outperforms PLASMA’s QUARK scheduler by up to 16.3% on 32 threads. Moreover, by reducing scheduling overhead, our new design supports finer- grain tasks, which improves strong scaling. The many-core roadmap for processors dictates that the number of cores on a processor chip increases at an exponential rate. Moreover, cores tend to operate at different speeds due to process variability and thermal constraints. As such, parallel task schedulers in the exascale era must make dynamic (runtime) scheduling decisions. The task dataflow notation has been studied widely as a viable approach to facilitate the specification of highly parallel codes. Task dataflow dependences specify an ordering of tasks (they leverage a task graph), which by its nature exposes a higher degree of parallelism than barrier-based models where one must wait periodically for all running tasks to complete. Dynamic schedulers are, however, prone to result in less per- formance than static schedulers due to runtime task scheduling overhead. This work investigates a new design for a task dataflow dynamic scheduler. The key aim for this design is to minimize runtime overhead without affecting the task dataflow pro- gramming interface. The scheduler supports programs mixing recursive divide-and-conquer parallelism and task dataflow parallelism. This hybrid design simplifies, for instance, the exploitation of parallelism across multiple kernels called in succession. The scheduler combines the efficiency of Cilk’s work stealing scheduler for recursively parallel programs with the efficiency of Aurora’s task queue for programs generating large numbers of simultaneously ready tasks. We evaluate our design experimentally and compare against a prior design and against PLASMA’s QUARK scheduler on a set of BLAS kernels. The experimental platform is a 2-node, 32-core AMD 6272 Bulldozer machine with 2.1 GHz clock frequency. Our new scheduler reduces end-to-end execution time on several linear algebra kernels by 7%–9% on 32 threads over our previous design. Moreover, our new scheduler outperforms the QUARK scheduler by up to 16.3%. Further evaluation shows that our scheduler works efficiently also on smaller tile sizes than QUARK. This has important implications for strong scaling on exascale systems as a higher degree of fine-grain parallelism can be utilized by our scheduler.

11:20-13:00 Session 4B: Contributed Talks

Chair: Harvey Richardson

Location: Prestonfield

11:20	Ronald Rahaman, David Medina, Amanda Lund, John Tramm, Tim Warbuton and Andrew Siegel Portability and Performance of Nuclear Reactor Simulations on Many-Core Architectures SPEAKER: Ronald Rahaman ABSTRACT. High-fidelity simulation of a full scale nuclear reactor core is a computational challenge that has yet to be met but is predicted to be achievable on exascale-class supercomputers. Several reactor simulations (such as OpenMC and SimpleMOC) are being developed specifically to run efficiently on exascale machines through established hardware-specific programming models (such as OpenMP and CUDA). Recently-developed, hardware-agnostic programming models offer opportunities to express multi-threaded parallelism in a portable fashion and allow a single, more-unified code base to run on many divergent high-performance computing architectures. Though the benefits of portability are clear, questions remain as to what practical performance tradeoffs apply to real world applications. In the present study, we port two existing proxy applications that represent key algorithms in nuclear reactor simulations to the hardware-agnostic language of OCCA. Performance and efficiency of the OCCA ports are compared to the native OpenMP versions on CPU and CUDA versions on GPU architectures. This study attempts to quantify tradeoffs between performance and portability of real world applications, specifically on exascale-class simulations for nuclear industry, using newer programming models.
11:45	Nick Brown, Michele Weiland, Adrian Hill, Ben Shipway and Chris Maynard A highly scalable Met Office NERC Cloud model SPEAKER: Nick Brown ABSTRACT. The existing Met Office Large Eddy Simulation model, the LEM, was initially developed in the 1980s and whilst the scientific output from the model is cutting edge, the code itself has become outdated. Hardcoded assumptions made about parallelism, which were sensible 20 years ago, are now the source of severe limitations and this means that the model does not scale beyond 512 cores. Whilst scientists wish to use this application to model clouds at a very high resolution (down to 1m or on domains of many 1000s of km) on the latest HPC machines, this is currently impossible due to the inherent limits of the code. As machines become larger, and significantly different from the architectures that a code was initially designed for, it is can sometimes be easier to rewrite poorly performing applications rather than attempt to modernise them through refactoring. We present here MONC (Met Office NERC Cloud model), a complete rewrite of the LEM, designed to exploit parallelism in the order of hundreds of thousands of cores. This provides the scientific community with a tool for modelling clouds at very high resolutions and/or near real time. The fact that this is a community code, along with the desire to future proof to as great an extent as possible, has heavily influenced the design of the code. We have adopted a “plug in” architecture, where the model is organised as a series of distinctive, independent, components which can be selected at runtime. This approach, which we will discuss in detail, not only allows for a variety of science to be easily integrated but it also supports development targeting different architectures and technologies by simply replacing one component with another. Other innovative aspects of the code will also be presented and these include our approach to data analysis and processing, which is a major feature of the model. These models analyse their raw data to produce higher level information, for instance the average temperature in a cloud or tracking of how specific clouds move through the atmosphere. The existing LEM, like many codes, performs data analysis inline as part of the model timestep which, along with the I/O operation time, is a major bottleneck. Instead MONC uses the notion of an I/O server, where typically one core per processor is dedicated to handling and analysing the data produced by the model, which is running on the remaining cores. In this manner, MONC can act in a “fire and forget” fashion, asynchronously sending data to the “local” I/O server and continuing on with the next timestep whilst it is being processed. Based upon the innovative approaches adopted, we will present the performance and scalability improvements in MONC compared to the existing LEM. We will discuss some of the lessons learnt, both in terms of large scale parallelism and software engineering techniques, that have become apparent whilst redeveloping this code.
12:10	Karthee Sivalingam Performance Analysis and Optimisation of MetUM on CRAY XC30 SPEAKER: Karthee Sivalingam ABSTRACT. The Met Office Unified Model (MetUM) needs to be ported to different supercomputing platforms to facilitate effective collaboration of climate scientists around the world. We present an update on the ongoing efforts to port and optimise the MetUM on ARCHER. ARCHER is a Cray XC30 supercomputer and can achieve performance in the order of petaflops. The performance of UM on ARCHER is analysed using Cray performance Analysis (CrayPAT) tools and is compared with other supercomputers like Cray XE6 and IBM 775. With significant improvement in Network bandwidth, 10 to 15% speedup is achieved using MPI rank placements and other MPI settings. UM has 65% thread load imbalance and thread performance is improved by adding new OPENMP regions using CRAY Reveal. We also discuss issues in scaling the UM to petaflops and more.
12:35	Xiaohu Guo, Benedict Rogers, Tao Cui and Mike Ashworth Exploring the Memory-Efficient Implementation Model for Incompressible Smoothed Particle Hydrodynamics(ISPH) SPEAKER: Xiaohu Guo ABSTRACT. As we are in the era of transition from the current Petascale to the future Exascale, there are many cases will force application developers to reformulate the fundamental algorithm and implementation approach. e.g.: overall levels of concurrency, the relative cost of FLOP/s compared to data movement, available memory per floating point unit, depth and complexity of the memory hierarchy, awareness of power costs, and overall resilience characteristics. Based on previous EPSRC and the current eCSE projects, this paper is about to present the results of memory efficient implementation for incompressible SPH target to the architecture with the large number of lower frequency cores with a decreasing memory to core ratio. This is imposing a strong evolutionary pressure on numerical algorithms and software to efficiently utilise the available memory and network bandwidth. Smoothed Particle Hydrodynamics code is an efficient numerical method which has been application to many industrial applications. The area includes coastal, offshore and naval hydrodynamics, multi-phase mixtures such as oil \& water, geotechniques and high temperature applications in manufacturing. Developing an efficient parallel scalable SPH Toolkit would attract significant interest from industrial companies. In HPC research, SPH application is one of few those applications which have potentials to scale on the future Exascale platforms. The portability to heterogeneous architectures of SPH/particle-based methods have been already proved due to their nature of large floating operations and relative ease to be implemented as a cache friendly method. Due to require solving pressure Poisson equation, ISPH\cite{lind} does not only inherited the SPH nature of large floating operations, but also have the memory bounded nature of large sparse linear system solver. Developing such Exascale oriented software application require us to design the software data structure and algorithms carefully in order to achieve strong scalability in large-scale systems. The parallel implementation typically involves using a combination of data locality aware algorithms (eg: space filling curve) and unstructured communication mechanisms (eg: fast distributed hash table searching algorithms). Based on previous EPSRC and the current eCSE projects\cite{spheric13}, this paper is about to present the results of memory efficient implementation for incompressible SPH. we will discuss using preconditioned dynamic vector\cite{Dominguez} for nearest neighbour list construction, the trade-off between reducing memory accesses , memory footprint and increased floating point operations by using loop optimisation techniques on nearest neighbour list searching kernels and the other SPH related kernels. The performance analysis and results using large number of particles showed the promising efficiency with very large number of cores.

13:00-14:00Lunch Break

14:00-15:00 Session 5: Keynote

Chair: Erwin Laure

Location: Pentland

14:00

Cynthia McIntyre

HPC in the private sector

SPEAKER: Cynthia McIntyre

ABSTRACT. Cynthia McIntyre, Senior Vice President, Council on Competitiveness, will focus on the value of HPC engagement with industry.

14:55-15:15Coffee Break

15:30-16:20 Session 6A: Contributed Talks

Chair: Chris Maynard

Location: Pentland

15:30

Jens Hedegaard Nielsen, James Hetherington and Michail Stamatakis

Zacros Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis

SPEAKER: Jens Hedegaard Nielsen

ABSTRACT. We present work done in the embedded CSE project “Zacros Software Package Devel- opment: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis”. Zacros (http://zacros.org) is a graph theoretical kinetic monte carlo (KMC) code for simulating chemical reactions on catalytic surfaces. During the course of the simulation, the ap- plication stores all the possible chemical reaction steps that can happen for the current lattice configuration. The simulation proceeds by executing one of these steps at a time. Event selection is random and based on the event’s kinetic rate constants, also known as propensities. Events with higher propensity are executed more frequently during the course of the simulation.
At the beginning of the eCSE project Zacros was a shared memory OpenMP code with good scaling within a single computational node. We will describe the work done in the first part of this project on adapting Zacros to run on Archer and adding MPI parallelization. Scaling beyond a single node requires a spatial parallelization scheme where reactions on different parts on the surface are realized by different MPI processes. To this end, we employ the algorithm proposed by Lubachevsky [1] in the context of Parallel Ising model simulations.
We will discuss some of the challenges found in translating Lubachevsky’s algorithm from the simpler Ising system to a kinetic Monte Carlo simulation. One of the core challenges is the implicit assumption in Lubachevsky’s algorithm of time-steps being independent of the energetics of the spin flips in the Ising model. The independence of time steps significantly simplifies the algorithm. As shown by Lubachevsky this allows reaction steps to be performed in parallel provided that the local time is lowest among its nearest neighbours. In this case the algorithm exactly reproduces the serial results. We will show how this assumption breaks down in the more general case where the time steps depend on the energetics of the reaction. We will show how this can result in causality violations and the inability to simulate important chemical processes on the surface correctly.
In addition we hope to present early performance graphs for the spatial parallel version of the code demonstrating the scalability of the code across multiple computational nodes.

1. Lubachevsky BD (1988) Efficient parallel simulations of dynamic ising spin systems. Journal of Computational Physics 75: 103–122. Available: http://www.sciencedirect. com/science/article/pii/0021999188901015. Accessed 13 August 2014.

15:55

Tom Theuns, Pedro Gonnet, Aidan Chalk and Matthieu Schaller

Swift: task-based hydrodynamics at Durham’s IPCC

SPEAKER: Tom Theuns

ABSTRACT. Simulations of the evolution of structures in the Universe have become increasingly successful at reproducing observations of the cosmos. Such calculations use major HPC resources on national and international supercomputers, and play a crucial role in interpreting observations as well as planning new observatories. The huge dynamic range of such calculations severely limits good strong scaling behaviour of the community codes in use, mostly due to poor load balancing, especially on newer shared-memory massively parallel architectures, i.e. multicores. This limits the science return from such calculations. The rapid increase in parallelism of current and future architectures, combined with the promise of accelerators, is likely to increasingly exacerbate these limitations. In this talk I will present results from a new cosmological hydrodynamical code called Swift, which aims to overcome these limitations.

Swift uses task-based parallelism designed for many-core compute nodes interacting over MPI using asynchronous communication, i.e. hybrid shared/distributed-memory parallelism. A graph-based domain decomposition schedules interdependent tasks over available resources, using the QuickSched library. Strong scaling tests on realistic particle distributions yield excellent parallel efficiency (60 per cent parallel efficiency for a run with a 1000-fold increase in core count), and efficient cache usage provides a large speedup compared to current codes even on a single core. Future work will concentrate on developing a self-tuning strategy to adapt to hardware specific parameters (e.g. advanced vector extensions and scatter-gather intrinsics), as well as extensions to accelerators. The techniques and algorithms used in Swift are likely to benefit other computational physics areas as well, for example that of compressible hydrodynamics.

For details of this open-source project, see www.swiftsim.com

15:30-16:20 Session 6B: Contributed Talks

Chair: Vladimir Voevodin

Location: Prestonfield

15:30

Gordon Inggs, David Thomas, Wayne Luk and Eddie Hung

Exascale Computing for Everyone: Cloud-based, Distributed and Heterogeneous

SPEAKER: Gordon Inggs

ABSTRACT. We argue that widespread adoption of exascale computing will be in the form of distributed computing resources pro- vided on a utility basis i.e. by a Cloud Computing ven- dor. Furthermore, these resources will be heterogeneous, comprised of not only Von Neumann Machine CPUs, but also massively parallel compute devices such as Graphics Processing Units (GPUs) and accelerators implemented us- ing reconfigurable computing devices, i.e. large Field Pro- grammable Gate Arrays (FPGAs). A further consideration is the growing range of hybrid devices that incorporate these elements within a single device such as Intel’s Xeon Phi and Xilinx’s Zynq system-on-chips.
As of January 2015, a hypothetical exascale system would cost ≈ 82 $/s in either the Amazon Web Services (AWS) or the Google Compute Engine cloud computing offerings. A cloud-based system is attractive as it allows users to share the total cost of ownership while also taking advantage of economies of scale. However, the millions of CPUs required is far beyond that which could be made available to a single user.
A further challenge is that partitioning the workload on millions of CPU would be a considerable computational problem in its own right. If however Cloud-based GPUs are considered, currently only hundreds of thousands of devices would be required for exascale, making the partitioning prob- lem more tractable.
The use of Cloud-based FPGAs would also be advanta- geous as these devices typically provide throughput accel- eration comparable to, and often exceeding high-end GPUs as well as order of magnitudes energy saving over CPUs and GPUs. Although the use of cloub-based FPGAs is nascent, Microsoft has recently had success accelerating cloud-based applications such as their Bing search engine, while vendors such as Maxeler now provide publicly accessible cloud FPGA offerings. Increasingly these devices also support OpenCL, C and C++, making them usable by high performance soft- ware engineers with limited embedded computing design expertise.
We claim that domain specific abstractions can realise the opportunities of the heterogeneous cloud by providing three features: Portability - the capability to execute a single task description upon a wide range of platforms efficiently. Characterisation - predictive models of the domain specific computation outputs that allow for the relationship between a particular task and platform to be quantified. Partitioning - the structure of the application domain can unify the predictive models, allowing for Mixed Integer Linear Programming (MILP) techniques to be applied so as to produce provably optimal workload partitions.
Our initial experiments with a heterogeneous cluster support our claim. Our cluster was comprised of 16 CPU, GPU and FPGA computing platforms, 6 of which were located in the AWS cloud. Using our Open Source application framework for option pricing, the Forward Financial Framework1, we have validated all three properties for a PetaFLOP scale workload. The implementations automatically generated by our framework deliver comparable performance to state-of-art, platform-specific ones. The predictive domain latency and pricing accuracy models are within 5% of the true val- ues in the worst case. Finally, the workload partitioning achieved using the MILP approach shows a 100% or greater improvement over a heuristic-based workload partitioning.
Furthermore, we have used realistic simulations of clusters of the scale required for Exascale to validate that our partitioning approach could scale up. While we found that currently the initial workload partitioning computation would be considerable, an efficient workload partitioning is achievable.

15:55

Paraskevas Evripidou and Costas Kyriacou

Paradigm Shift for EXASCALE Computing

SPEAKER: Paraskevas Evripidou

ABSTRACT. HPC/Supercomputers target large problems that have a high degree of parallelism. Programming of such machines is mainly done through parallel extension of the sequential models like MPI and OpenMP. These extensions do facilitate a high productivity parallel programming, but also suffer from the limitations of the sequential synchronization and their inability to tolerate long latencies. Arvind and Iannucci, Data-Flow proponents, have been warning us since the 1980’s about the two fundamental issues in Multiprocessing: “long memory latencies and waits due to synchronization events”. Peter Kogge, the leader of the DARPA/USA study group for exascale computing confirmed that the communication and synchronization latency of the sequential model are getting out of hand for HPC/Exascale machines. Furthermore, Kogge, states that the power consumption of an exascale computer, will be around 500 MW. Michael Flynn stated in his keynote speech at PFL 2012 that “We have multi-threaded, superscalar cores with limited ILP; worse yet, most of the die area (80%) is devoted to two or three levels of cache to support the illusion of sequential model”. Hence a paradigm shift for exascale computing is necessary.

Data-Flow is a programming/execution model that provides tolerance to communication and synchronization latencies. Data-Flow has been proposed by a number of researchers as an alternative to the control flow model. Data Driven Multithreading (DDM) is a dynamic threaded Data-Flow model that schedules threads based on data availability on sequential processors. A DDM program consists of a number of threads that have producer-consumer relationships. Data dependencies between threads are uncovered at compile time, while their scheduling is done dynamically at runtime by the Thread Scheduling Units (TSU). Threads are scheduled for execution using Data-Flow semantics, while the instructions within threads are executed in a control flow manner.

Evaluation results of DDM on a variety of platforms showed that DDM can indeed tolerate synchronization and communication latency. When comparing DDM with OpenMP, it was found that for all benchmarks used DDM performed better. This is primarily due to the fact that DDM effectively tolerates latency. In the case of benchmarks with low thread granularities, DDM had also the advantage of its inherent low parallelization and thread switching overheads. Similar results were also obtained when comparing DDM implemented on a cluster of four Cell processors, with Cellss and Sequoia.

Our work on CacheFlow, the memory hierarchy system for DDM, showed that the TSU is aware of the threads scheduled for execution in the near future, and hence the data that will be needed in the near future. This enables the implementation of optimized cache placement and replacement policies, resulting in the need of smaller caches, or the replacement of the cache with a small scratch-pad memory. Furthermore, taking advantage of the near future memory references results in more efficient cache replacement policies that reduce bus traffic.

Data-Flow provides tolerance to communication and synchronization latency, and has the potential of making the processor smaller and more power efficient. Thus, it makes sense to consider a shift to the Data-Flow paradigm now.

16:20-17:30 Session 7: Lightning Poster Talks (sponsored by ICEOTOPE)

Chair: Alan Gray

Location: Pentland

16:20

Multiple Presenters

Lightning Poster Talks (sponsored by ICEOTOPE)

SPEAKER: Multiple Presenters

ABSTRACT. 16:20 Peter Hopton (ICEOTOPE) - Poster session welcome talk

16:30 Multiple presenters - Lightning poster presentations

17:30-19:00 Session 8: Poster and Drinks Reception (sponsored by ICEOTOPE)