EASC2015: EXASCALE APPLICATIONS AND SOFTWARE CONFERENCE 2015
PROGRAM FOR THURSDAY, APRIL 23RD
Days:
previous day
all days

View: session overviewtalk overview

09:30-10:30 Session 15: Keynote
Location: Pentland
09:30
Managing Tianhe-2, the world's fastest supercomputer
SPEAKER: Xue-Feng Yuan

ABSTRACT. Xue-feng Yuan, Director of The National Supercomputer Centre at Guangzhou, will describe the experience of managing Tianhe-2, the world's largest supercomputer.

10:30-10:50Coffee Break
10:50-12:30 Session 16A: Contributed Talks
Location: Pentland
10:50
Multiscale modelling: a paradigm case for the exascale
SPEAKER: James Suter

ABSTRACT. One future approach for the exascale is to use multiscale simulation, where many submodels on the petascale are combined. However, the concept of having multiple models form a single scientific simulation, with each model operating on its own spatiotemporal scale, gives rise to a range of new challenges. Multiscale simulations often require interdisciplinary scientific expertise to construct, as well as new advances in reliable and accurate model coupling methods, and new approaches in multi-model validation and error assessment. We have found that different communities adopt
very different approaches to constructing multiscale simulations. Communities may receive additional benefit from sharing methods and tools [1]. In the context of simulations of clay-polymer nancomposites, we have developed several tools to enable our multiscale simulations.
Within the MAPPER project (http://www.mapper-project.eu), we have developed a range of tools that allow us to create, deploy and run a variety of multiscale simulations based on the xMML specification of a multiscale model. These have been prototyped across a range of simulation communities.
Two of the main challenges in realizing successful multiscale simulations are to establish a fast and flexible coupling between submodels and to deploy and execute the implementations of these submodels efficiently on
computer resources of any size. Our approach, relies on coupling existing submodels and supports the use of resources ranging from a local laptop to large international supercomputing infrastructures, and distributed combinations of these. This is achieved through by using coupling tools developed within MAPPER, such as the Multiscale Library and Coupling Environment (MUSCLE) [1,2], FabMD [3] and GridSpace [1].
Here we will present some of our findings from modelling chemically specific combinations of clay and polymers, where we use multiscale methods and tools that take us from a parameter free quantum description to predictions of the materials properties of these nanocomposites [4]. The majority of our mulit-scale simulations take place at the coarse-grained level, which possesses good scaling at the petascale level. Until now, the only available approach to designing such materials has been experimental, largely using trial and error. Our multiscale approach provides us with predictions of the melt intercalation behaviour of montorillonite clay – polyvinyl-alcohol and montorillonite clay – polyethylene-glycol systems. Many hitherto unobserved phenomena heave into view as a result of this study. Inter alia, we observe the dynamical process of polymer intercalation into clay tactoids and the ensuing aggregation of polymer-entangled tactoids into larger structures, obtaining various characteristics of these nanocomposites, including clay-layer spacings, out-of-plane clay sheet bending energies, X-ray diffractograms and materials properties. Our approach opens up a route to computing the properties of complex soft materials based on knowledge of their chemical composition, molecular structure and processing conditions.

[1] D. Groen, S. J. Zasada, P. V. Coveney, Comput. Sci. Eng., 2012, 99, 1. [2] J. Borgdorff et al. Proc. Comp. Sci., 2012, 9, 596.
[3] http://www.github.com/djgroen/FabSim
[4] J. L. Suter, D. Groen, P.V. Coveney, Advanced Materials (2014), Available Online, DOI: 10.1002/adma.201403361

11:15
Support Operator Technique for 3D Simulations of Dissipative Processes at High Performance Computers

ABSTRACT. Problems related to transient high-temperature gas/plasma flows often reveal an essential dependence of a solution on various dissipative effects. A predictive simulation in that area of CFD is supported by finite difference (or finite volume) methods which satisfy common conditions like conservativity, monotonicity, stability in a wide range of flow parameters, high resolution etc. There exist other requirements and part of them - due to multiscale modeling in domains with complex shapes. An important property of any CFD technique is a possibility of its implementing using general computational meshes (block-structured, unstructured, mortar etc.) while general properties of original differential operators persists in their difference analogues. Traditionally the theory of difference approximations deals with such properties as positivness/nonpositiveness, self-adjointness of difference operators and some other basic mathematical properties. The support operators method is remarkable as it allows building approximations to differential operators using general meshes while the resulting difference operators preserve not only basic properties mentioned above, but, additionally, they provide rotationally-invariant difference schemes. It's important to pay special to the rotational invariance while working with systems describing deformations and dissipations in gas or liquid media, e.g. Navier-Stokes equations. Differencing via support operator method guarantees preserving of this property since it is inherent to original differential operator. The method allows building robust numerical procedures suitable for multiscale simulations requiring very finely discretized computational domains (billions of grid cells) – and it is just a case that requires the use of high performance computing. The support operator technique is developed for meshes formed with hexahedral, tetrahedral, prismatic cells and their various combinations. The appropriate numerical algorithms are incorporated into the scientific CFD code MARPLE3D (Keldysh Institute of Applied Mathematics - KIAM). MARPLE3D is an object-oriented, parallel code designed for scientific simulations at systems performing distributed computations. It is based primarily on radiative-magnetohydrodynamic model taking into account modern knowledge about high-temperature plasmas. MARPLE3D includes numerical tools able to perform large-scale 3D simulations. The code versatility enables its applications to diverse CFD problems in regions with complex shapes. As it follows from numerical experiments, MARPLE3D ensures quite good scalability when performing the calculations using a large number of processors. Moving towards the way to “exaflops” systems it is necessary working further on the elaboration of rational decomposition methods effective in applications to extremely large computational meshes. We believe that the application software GRIDSPIDERPAR which is under development in KIAM is promising in this aspect. The paper deals with the support operator method of constructing difference approximations using 3D unstructured grids. The data models and features of parallel implementation are discussed. Examples of physical problems solved via HPC are presented.

11:40
Large-scale Ultrasound Simulations Using the Hybrid OpenMP/MPI Decomposition
SPEAKER: Jiri Jaros

ABSTRACT. The simulation of ultrasound wave propagation through biological tissue has a wide range of practical applications. Recently, high intensity focused ultrasound has been applied to functional neurosurgery as an alternative, non-invasive treatment of various brain disorders such as brain tumours, cerebral haemorrhage, essential tremor, and Parkinson’s disease. The technique works by sending a focused beam of ultrasound into the tissue, typically using a large transducer. At the focus, the acoustic energy is sufficient to cause cell death in a localised region while the surrounding tissue is left unharmed. The major challenge is to ensure the focus is accurately placed at the desired target within the brain because the skull can significantly distort it. The accurate ultrasound simulations, however, require the simulation code to be able to exploit several thousands of processor cores and work with datasets on the order of tens of TB. We have recently developed a pure-MPI pseudo-spectral simulation code using the 1D domain decomposition that has allowed us to run decent simulations using up to 2048 compute cores. However, this implementation suffers from maximum parallelization being limited by the largest size along an axis of the 3D gird used. At the age of exascale systems more and more systems typically have numbers of processing cores far exceeding this limit. For example, cutting edge ultrasound simulations performed by the k-Wave toolbox use 20483 grids and so with the 1D pure-MPI decomposition would scale only to 2048 cores at most leading to the calculation time exceeding clinically acceptable time of 24 hours (here between 50 and 100 hours). This paper presents an improved implementation that exploits 2D hybrid OpenMP/MPI decomposition. The 3D grid is first decomposed by MPI processes into slabs. The slabs are further partitioned into pencils assigned to threads on demand. This allows us to employ 16 to 32 times more compute cores than the pure-MPI code wile reducing the amount of communication among processes due to efficient use of shared memory within compute nodes. The proposed technique was tested on the Fermi supercomputer (CINECA, Italy) with up to 16 384 compute cores and a domain size of 10243 grid points where the original code was able to use only 1024 compute cores. The speed-up reached by the proposed code stopped at a factor of 9, which is given by the nature of the method highly dependent on 3D fast Fourier transforms. This result has a huge practical impact on many ultrasound simulations. Speaking about the k-Wave project, deploying the hybrid decomposition has the potential to decrease the simulation time by a factor of 9, bringing the simulation time within the clinically meaningful timespan of 24 hours and allowing patient specific treatment plans to be created.

12:05
Multiscale ensemble simulation for vascular blood flow: a voyage towards the exascale
SPEAKER: Ulf Schiller

ABSTRACT. Cerebrovascular diseases such as brain aneurysms are a primary cause of adult disability. The flow dynamics in brain arteries, both during periods of rest and increased activity, are known to be a major factor in the risk of aneurysm formation and rupture, although the precise relation is still an open field of investigation. We present an automated ensemble simulation method for modelling cerebrovascular blood flow under a range of flow regimes. By automatically constructing and performing an ensemble of multiscale simulations, with a 1D solver unidirectionally coupled to a 3D lattice- Boltzmann simulation code, we are able to model the blood flow in a patient artery over a range of flow regimes. In addition, this approach allows us to use a very large number of cores to obtain more accurate and robust scientific results from our simulation. We apply the method to a model of a middle cerebral artery, which we model using 10,000s of cores in total, and find that this exercise helps us to fine-tune our modelling techniques, and opens up new ways to investigate cerebrovascular flow properties. Relying on our recent investigations on modelling the Circle of Willis, our approach can be trivially extended to use 100,000s of cores efficiently in this ensemble setup. In addition, our previously reported improvements in weighted domain decomposition allow us to introduce highly sophisticated boundary conditions with a very limited computational overhead.

10:50-12:30 Session 16B: Contributed Talks
Chair: Mark Bull
Location: Prestonfield
10:50
Enabling Adaptive, Fault-tolerant MPI Applications with Dynamic Resource Allocation

ABSTRACT. Recent HPC systems build their computational capabilities and performance on the extreme number of processing elements - either multicore nodes, hyperthreaded cores, accelerated nodes or others. This requires applications to be extremely parallel and scalable to run efficiently on these systems and benefit from their peak performance. On the other hand, in real usage several applications share one large system at the same time. It is quite obvious since such systems are expensive and consume a lot of resources even when idle. While applications are parallelized either using classic approaches like MPI and PGAS runtimes or novel models for large data processing, fair and optimal sharing of the computational resources is provided by resource manager (queuing system). One of the common constraints for programmers in the described shared computing environment is static resource allocation. Although MPI and other parallel programming models have constructs that allow dynamic process creation and management, it is not easily implementable in a shared multi-task system. Newly created processes must use already allocated resources which leads to either waste of resources that need to be preallocated or processes oversubscription which likely results in performance degradation. Moreover fault tolerance of parallel applications becomes a real need. Common scenarios for process fault recovery assume a repair stage which implies dynamic creation of new processes. In this work a simple model for the complete dynamic multi-process execution is proposed. The approach uses existing methods for resource allocation resizing (dynamic job allocation) in the SLURM environment. SLURM methods implement the PMI interim process management layer, a quasi-standard for processes management. This allows communication between MPI methods from the user application and job management layers handled by the SLURM system. The proposed approach is based on an MPI process spawning model which only allows for a blocking mode of process creation. While resource allocation is usually not immediate in a system with many users and their jobs, the non-blocking mode is more practical. A basic implementation which extends the MPI standard is proposed. In our approach we allocate new resources and create new processes which will enable an MPI application to utilize these resources dynamically within one SLURM job. It allows dynamic multi-process applications in a shared environment. The latter gives more opportunities for better resource utilization and allows programmers to create more flexible applications that may allocate resources dynamically and exclusively which is usually required to achieve full performance and essential when addressing fault recovery.

11:15
Simulation-based instruction-level statistics for optimizing software on future architectures
SPEAKER: Wim Heirman

ABSTRACT. In this paper we address the problem of optimizing applications for future hardware platforms. Performance projections of future systems are crucial for both software developers and CPU architects. Developers of applications, runtime libraries and compilers need predictions for tuning their software before the actual systems are available, and architects need them for architecture exploration and design optimization. Simulation is one of the most commonly used methods for performance prediction, and developing detailed simulators constitutes a major part of processor design. However, most detailed simulators, while very accurate, are too slow to simulate meaningful parts of applications running on many-core systems with large caches. But by trading off accuracy for the ability to run larger parts of the application, approximate simulators can play a significant role in software tuning and architectural exploration.

Sniper is an x86 many-core simulator that has acceptable accuracy (around 20% average absolute error compared to Nehalem hardware) and high simulation speed (around 1 MIPS). To increase its utility to software developers, we extend Sniper to collect hardware events and timing effects at a per-instruction granularity. Comparable statistics can be obtained on existing systems using hardware performance counters, however, these suffer from a number of drawbacks: many hardware counters have inaccuracies such as double-counting under certain conditions, skidding (meaning that events are not always associated with the correct instruction), sampling errors (instruction pointers are typically only sampled when a counter overflows), or a lack of insight into how hardware events such as cache misses contribute to execution time (which is non-trivial in the context of out-of-order processors that can overlap many types of stalls with useful computation).

In contrast, Sniper’s instruction-level statistics are based on the concept of cycle stacks. These break down the total execution time of the application, and assign each clock cycle of execution to that hardware component that was on the critical path of execution. Solving a given stall that is visible on the cycle stack will therefore be guaranteed to lead to increased performance. We extend Sniper’s existing method of cycle stack collection to keep track of the address for each instruction that causes an increment of the cycle stack, converting the stack into a matrix of cycle stack component and instruction pointer elements. We then combine this information with the application’s disassembly and (through debug symbols) its source code to construct cycle stack contributions at instruction or source line granularity. This gives application developers actionable information (cycle stack contributions have a direct correlation with obtainable performance improvements) that is moreover more accurate than what is provided by hardware performance counters (no sampling errors or skidding as the simulator can fully track the performance effects of each individual instruction), which allows them to start today with optimizing their software for the architectures of tomorrow.

11:40
Parallelisation of an Explicit MOT-TDVIE Solver for Transient Electromagnetics Simulations Using Hybrid MPI/OpenMP on Many-core Xeon Phi Coprocessors

ABSTRACT. Hardware accelerators, including multi and many-core processing technologies, are becoming ubiquitous in scientific computing as their ability to achieve considerable speedup against conventional CPUs have been demonstrated in many computational sciences and engineering applications. Further, accelerators are increasing in popularity as the preferred choice in many emerging HPC centers as they can provide cost effectiveness, power efficiency and physical density. Nevertheless, one of the limiting factors to the wide spread use of accelerators is the relatively expensive porting process required when using low-level programming models; for example, CUDA and OpenCL for GPUs. To this end, the recent development efforts have focused on the provision of new high-level directive based programming models, such as OpenACC for NVIDIA’s multi-core accelerators (GPU), as well as extending the scope of existing standards; for example, OpenMP 4.0 for porting applications onto Intel’s many-core coprocessors (Xeon Phi).
In this work, we report on our recent research and development efforts in parallelising and accelerating a fully explicit marching-on-in-time (MOT)-based time domain volume integral equation (TDVIE) solver [1]. The MOT-TDVIE is used for analysing transient electromagnetic wave interaction with inhomogeneous dielectric objects, which have various scientific applications in photonics, optoelectronics, and bio-electromagnetics. Crucially, in the simulation of electrically large structures, where millions of degrees of freedom are required for their solution, efficient computations of the MOT-TDVIE solver rely on large-scale parallelism [2]. Further, for the hardware-based acceleration of the solver, we have recently demonstrated OpenACC an efficient mechanism for its execution on GPUs. Importantly, significant performance improvements with up to 30X and 11X speedups relative to the sequential and parallel CPU codes are achieved. Moreover, we have demonstrated that the GPU-accelerated MOT-TDVIE solver can leverage energy
consumption gains in the order of 3X relative to its multi-threaded CPU version [3].
For exploiting the computational capability of Intel’s Xeon Phi Coprocessor, we present in this work a new parallelisation scheme for the MOT-TDVIE solver carefully fine-tuned for its efficient execution on this emerging many-core architecture. The parallel solver is executed on the Xeon Phi using OpenMP and hybrid MPI/OpenMP for the shared and distributed memory environments, respectively. This high-level approach of the OpenMP facilitates porting of the code originally parallelised for the CPU onto the Xeon Phi with minimal re-coding effort. Specifically, by exploiting the single instruction multiple data (SIMD) capabilities of the new OpenMP 4.0 standard, we demonstrate the coprocessor execution time faster than conventional CPUs by a factor of up to 6.26X and 2.32X relative to the single and multi-threaded CPU performance, respectively. Details on the code parallelisation and performance analysis, which compare the computational efficiency of the MOT-TDVIE solver’s implementation on the Xeon Phi to that on the CPU and GPU, will be presented at the conference.


References:
[1] A. Al-Jarro et al., IEEE Trans. Antennas Propag., vol. 60, no. 11, pp. 5203-5214, 2012.

[2] A. Al-Jarro et al., 28th Review of Progress in ACES, Ohio, USA, April 10-14, 2012.

[3] S. Feki et al., IEEE Antennas Propag. Mag., vol. 36, no. 2, pp. 256-277, 2014.

12:05
Performance optimisation on Xeon Phi

ABSTRACT. The flop to watt performance potentially available through Intel’s Xeon Phi co-processor makes it very attractive for computational simulation. With its’ full x86 instruction set, cache-coherent architecture, and support for MPI and OpenMP parallelisations, it is generally straight forward to port applications to the platform. However, the low clock speed of the cores, the in-order instruction restrictions, extra wide vector units, and low memory per core make it difficult to obtain good performance for a lot of codes. EPCC has been working with Intel, through our IPCC, to optimise a range of large scale simulation codes on Xeon Phi processors. In our talk we will describe the work undertaken; the performance achieved; and discuss the impact of the optimisations on the source code.

We have been working on three simulation codes, GS2, COSA, and CP2K. All are large, FORTRAN-based, MPI-parallelised simulation codes that are widely used by simulation scientists in the UK, Europe, and around the world. They are also mature, large codes, which have a range of different simulation functionality incorporated within them. For instance, CP2K is a molecular dynamics code that can undertake classical molecular mechanics, density functional calculations, quantum mechanics, molecular dynamics, and much more. COSA is a CFD system that implemented both frequency domain and time domain Navier Stokes solvers for different simulation categories. GS2 is a plasma simulation code that is used both full collisional, nonlinear, tokamak plasma simulations, for simple linear space plasma simulations, and as the core component of a multiscale gyrokinetic transport code (TRINITY). Whilst optimising a code that implements a single operation or type of simulation may be straight forward for hardware such as Xeon Phi, optimising codes that have many different functional features is more challenging.

We have, initially, worked on optimising vectorisation of these codes on standard processors and improving the performance of the MPI parallelisations as both of these activities will improve both the standard performance of the applications and the performance on Xeon Phi. We then moved on to optimising specifically for Xeon Phi, including implementing or improving hybrid parallelisations for each code, targeting the AVX-512 instructions on the Xeon Phi, and re-structuring some of the complex data structures used by the codes to improve cache and vector performance.

We will present results for these applications run on single and multiple Xeon Phi’s and compare will benchmarks on standard hardware. We will demonstrate the comparative performance and where Xeon Phi can benefit the applications.

12:30-13:30Lunch Break
13:30-14:45 Session 17A: Contributed Talks
Location: Pentland
13:30
Automatic Code Orchestration from Descriptive Implementations
SPEAKER: Brian Vinter

ABSTRACT. While achieving exascale computing in itself is a huge technical task, bringing scientific users to a competence level where they can utilize an exascale machine is likely to pose problems of the same scale. This talk will describe ongoing work on enabling the users to specify their algorithms descriptively, but still within existing programming languages; C++, .Net, or Python for instance. The advantages of descriptive programming is both that the programmer is less likely to make programming errors, since traversing arrays and lists is expressed at an abstraction level that excludes the use of index-variable and links, and more importantly, since the descriptive expression specifies what to do, rather than how to do it, the runtime environment have a higher degree of freedom to produce a solution that is efficient on the available hardware.
By providing a descriptive model of the algorithm, typically provided as a vectorized operation, such as:
a[1:-1,1:-1] = (a[-2:,1:-1]+ a[2:,1:-1]+a[1:-1,:2]+a[1:-1,2:])/4,
for a four point stencil operation, the runtime environment can efficiently decide on a data-layout and contract multiple array operations into a single loop for improved efficiency. The array contraction also allows optimizations to reduce temporary arrays into temporary scalars for a smaller memory footprint. The contracted array operations can then be laid out in a manner that fits the underlying hardware, conventional multicores, GPGPUs, or distributed memory architectures, such as clusters, and communication between components can be included in the cost for a static execution schedule.
We will show how the same, unmodified, Python implementation of a Jacobi solver, a Black-Scholes pricing, an n2 complexity NBody simulation, and a Shallow-Water simulation scales to a 32-core machine with 50.1, 29.8, 17.3, and 44.4 speedups compared to the NumPy execution, while the same Python benchmarks run on a NVidia GTX 680, achieves speedups of 55.7, 43.0, 77.1, and 140.2, and a eight node cluster with gb-ethernet interconnect (256 cores in total), obtain speedups of 4.8, 4.4, 3.5 and 5.1, compared to a single 32 core node.
The end result is that a scientist may move seamlessly from a laptop version of a code to a large, heterogeneous, parallel machine, without any changes to the code. In fact, we will show that the scientist can continue to work from a laptop, with interactive graphics if needed, while the contracted array operations are all executed on a remote machine, including supercomputers, granted that the scientist is willing to wait online for the job to be scheduled at the SC site. The descriptive approach not only makes scientists more productive, but also reduces the number of errors as no explicit parallelism is expressed, and synchronization requirements are fully derived from the descriptive implementation of the algorithm.

13:55
The Cost of Synchronizing a Billion Processes

ABSTRACT. A recent work [1] shows that a prolonged local work causes idle periods that can propagate to the whole system, affecting the global performance of the parallel application. In message passing applications, this work imbalance generates idle periods that propagate via point- to-point communication. Synchronization of processes on distributed memory machines is achieved by sending and receiving messages. For this reason, a synchronization point in a parallel application allows for a propagation of idle periods through processes. The goal of this work is to quantify the effect of local work imbalances on the synchronization of up to one billion processes.
In parallel applications using the message programming model, synchronization is achieved by exchanging messages among processes. For instance, in the case of a barrier, each process sends a message to another process to notify that it reached the barrier. The synchronization cost of message passing applications consists of a cost due to the communication of messages (communication cost) and a cost due to the fact that processes need to wait for the slowest process to enter the barrier (imbalance cost). This is evident from Figure 1, which presents the synchronization of four processes in a message passing application using a butterfly barrier [2]. In this figure, the red bar denotes a local work on the process (b stands for "busy period"). Different processes have different workloads, leading to a work imbalance among processes. The time interval between the longest and the shortest busy periods is the "maximum imbal- ance time". The time when a process is waiting for a message from another process is denoted with i (idle time).
The communication cost is modeled with the LogP model [3] using L for latency, and os and or the CPU overhead cost for sending and receiving a message respectively. In the case of Figure 1, the communication cost of a four processes barrier is log2(4) ∗ (L + os + or) = 2∗(L+os +or). If the workload on each of the n processes is assumed to be a random variable with associated probability density function, the maximum imbalance can be calculated from the expected value for the maximum of n random variables. Intuitively, increasing the number of processes, the maximum imbalance time increases. By running simulations of barrier algorithms with different random distributions, we determine the expected value of the maximum imbalance time and we compared it with the communication cost. By simulating different barrier algorithms, we estimate the synchronizing cost of a message passing based barrier with varying parameters: the number of application processes n, the number of processes per node, the random distribution function and its parameters, i.e. standard deviation for a Gaussian distribution. We identified two different regimes: in the first regime the cost of work imbalance is dominating the synchronization, while in the second regime the communication cost is dominating. Finally, we discuss the importance of quantifying synchronization cost on exascale machines.

References
[1] Stefano Markidis, Juris Vencels, Ivy Bo Peng, Dana Akhmetova, Erwin Laure, and Pierre Henri. Idle waves in high-performance computing. Phys. Rev. E, 91:013306, Jan 2015.
[2] Eugene D Brooks III. The butterfly barrier. International Journal of Parallel Programming, 15(4):295–307, 1986.
[3] Richard M Karp, Abhijit Sahay, Eunice E Santos, and Klaus Erik Schauser. Optimal broadcast and summation in the logp model. In Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures, pages 142–153. ACM, 1993.

14:20
A new thread support level for hybrid programming with MPI endpoints
SPEAKER: Daniel Holmes

ABSTRACT. Exascale is driving hardware towards nodes with many cores in order to keep down power consumption. On such hardware, running one MPI process per core will increasingly suffer from overheads in both time and memory, and we can expect the current trend of using hybrid programming, with multiple threads running in each MPI process, to continue.
A major problem with this approach is how the threading model accommodates communication: either one thread must perform all the off-node communication, or else the threads need to synchronise access to the communication network, either manually in the application code or by using the MPI_THREAD_MULTIPLE thread support level. Currently, implementing MPI_THREAD_MULTIPLE requires locking (or other protection) for all mutable internal shared state and shared resources, e.g. matching queues and communication hardware. This protection guarantees thread-safety and correct operation, but degrades performance in comparison to all other thread support levels.
To address this issue, the MPI Forum is currently considering a proposal to add endpoints to the MPI Standard, which will enable the creation of MPI communicators that have multiple ranks in each MPI process. Restricting the application code so that each MPI endpoint is accessed by only one thread obviates the need for all of, or at least the majority of, these performance overheads.
The EPiGRAM project has designed an implementation of endpoints for the T3DMPI library. This MPI library has a single protocol queue in each MPI process, which is a strict FIFO in symmetrically allocated memory. All communication from other MPI processes is achieved by inserting protocol messages into the appropriate remote protocol queue using SHMEM primitives. The progress engine in each MPI process removes items from its queue, in FIFO order. Both the network hardware performing the remotely-initiated SHMEM operations and the local progress engine use atomic operations to manipulate the protocol queue. In common with many other MPI libraries, multi- threaded operation requires that the internal matching structures, such as the unexpected queue, are protected by locks.
Decreasing the granularity of the locks can improve concurrency, which in turn improves performance in terms of throughput and message-rate. The creation of multiple local endpoints introduces the possibility of partitioning the matching structures by target, a parameter that cannot be replaced by a wildcard in either send or receive operations. The locking granularity can therefore be reduced to a single partition.
Furthermore, restricting the application code so that each MPI endpoint is never concurrently accessed by more than one thread removes the need for these locks altogether because each partition of the matching structures is accessed in accordance with the rules for MPI_THREAD_SERIALIZED. To enable the programmer to guarantee such behaviour, we propose MPI_THREAD_SERIAL_EP, a new thread support level that allows concurrent MPI calls from multiple application threads as long as a different MPI endpoint is used for each such call.
The EPiGRAM project is adding MPI endpoints two applications to verify the theoretical performance advantages and demonstrate the best methods to hybridise MPI software and take advantage of exascale hardware.

13:30-14:45 Session 17B: Contributed Talks
Location: Prestonfield
13:30
Flexible, Scalable Mesh and Data Management using PETSc DMPlex
SPEAKER: Michael Lange

ABSTRACT. Scalable file I/O and efficient domain topology management present important challenges for many scientific applications if they are to fully utilise future exascale computing resources. Although these operations are common to many scientific codes they have received little attention in recent optimisation efforts, resulting in potentially severe performance bottlenecks for realistic simulations that require and generate large data sets. Moreover, due to a multitude of formats and a lack of standards for mesh and output data in the community there is only limited interoperability and very little code reuse among scientific applications for common operations, such as reading and partitioning input meshes. Thus developers are often forced to create custom I/O routines or even use application-specific file formats, which further limits application portability. Designing a scientific software stack to meet next-generation simulation demands, not only requires scalable and efficient algorithms to perform data I/O and mesh management at scale, but also an abstraction layer that allows a wide variety of application codes to utilise them and thus promotes code reuse and interoperability. Such an intermediate representation of mesh topology has recently been added to PETSc, a widely used scientific library for the scalable solution of partial differential equations, in the form of the DMPlex data management API. DMPlex represents the topology of unstructured computational meshes in an graph that decouples mesh management tasks, such as partitioning and parallel data distribution, from the mesh file format. In this paper we demonstrate the use of PETSc’s DMPlex API to perform mesh input and domain topology management in multiple applications. We focus on the integration with Fluidity, a large scale CFD application code that already uses the PETSc library as its linear solver engine, while also highlighting how the same features are used in Firedrake, an automated system for Finite Element computation. By utilising DMPlex as a common mesh management abstraction we not only add support for new mesh and output file formats, such as ExodusII, CGNS, Gmsh, Fluent Case and HDF5/Xdmf, to both applications, but also enable the use of domain decomposition methods and mesh renumbering techniques at runtime. This provides significant performance benefits to both applications by improving cache locality anddrastically reducing the amount of file I/O required during pre-processing in the CFD application.

13:55
Algorithms in the parallel partitioning tool GridSpiderPar for large mesh decomposition

ABSTRACT. The problem of load balancing arises in parallel mesh-based numerical solution of problems of continuum mechanics, energetics, electrodynamics etc. on high-performance computing systems. Geometric parallelism is commonly used in most of applications for large-scale 3D simulations of the problems listed above. It implies that every branch of an application code processes a subset of computational mesh (a subdomain). In order to increase processors efficiency it is necessary to provide rational domain decomposition, taking into account the requirements of balanced mesh distribution among processors and reduction of interprocessor communications, which depend on the number of bonds between subdomains.

The number of processors to run a computational problem is often unknown. It makes sense, therefore, to partition a mesh into a great number of microdomains which then are used to create subdomains.

Graph partitioning methods implemented in state-of-the-art parallel partitioning tools ParMETIS, JOSTLE, PT-SCOTCH and Zoltan are based on multilevel algorithms consisting of three phases: graph coarsening, initial partitioning and uncoarsening with refinement of the partitions. That approach has a shortcoming of making subdomains with longer frontiers or irregular shapes. In particular these methods can form unconnected subdomains. Such worsening of subdomain quality adversely affects the performance of subsequent computations. For example, it may result in a larger number of iterations to achieve convergence of iterative linear system solving methods.

Another shortcoming of present graph partitioning methods is generation of strongly imbalanced partitions. This shortcoming is the most prominent in partitions made by ParMETIS, where number of vertices in some subdomains can be twice as large as in others. The imbalance can cause significant performance problems, especially in exascale computing.

To solve above mentioned problems the program package for parallel large mesh decomposition GridSpiderPar was developed. Two algorithms were implemented in the GridSpiderPar package: a parallel geometric algorithm of mesh partitioning and a parallel incremental algorithm of graph partitioning. The devised parallel algorithms support two main stages of large mesh partitioning. They are a preliminary mesh partitioning among processors and a parallel mesh partitioning of high quality. Both work with unstructured meshes with up to 10^9 elements. The main advantage of the second algorithm which is based on the incremental idea is creation of principally connected subdomains.

We compared different partitions into microdomains, microdomain graph partitions and partitions into subdomains of several meshes (10^8 vertices, 10^9 elements) obtained by means of the partitioning tool GridSpiderPar and the packages ParMETIS, Zoltan and PT-SCOTCH. Balance of the partitions, edge-cut and number of unconnected subdomains in different partitions were compared as well as the computational performance of gas-dynamic problem simulations run on different partitions. For the gas-dynamic problem simulations we used the MARPLE3D research code designed in KIAM RAS. The obtained results demonstrate advantages of the devised algorithms.

The work was supported by RFBR grants 13-01-12073-ofi_m, 14-01-00663-a, 14-07-00712-а, and 15-07-04213-a.

14:20
Scalable Multithreaded Algorithms for Mutable Irregular Data with Application to Anisotropic Mesh Adaptivity

ABSTRACT. Thread-level parallelism for applications with irregular data structures and mutable dependencies presents significant challenges because the underlying data is extensively modified during execution of the algorithm and a high degree of parallelism must be realized while keeping the code race-free.

In this talk I will describe a new irregular compute methodology for shared-memory parallelism, which guarantees safe parallel execution via processing workitems in batches of independent sets and using a deferred operations strategy to update the underlying data structures in parallel without data contention. Scalability is assisted by creating worklists using atomic operations, a synchronization-free alternative to reduction-based worklist algorithms, and by creating independent sets using an improved graph coloring method which runs up to 1.5x faster than existing techniques. Finally, I will describe some early work on an interrupt-driven work-sharing for-loop scheduler which performs better than existing options.

Using this compute methodology and applying it to adaptive mesh algorithms, it is shown that our codes achieve a parallel efficiency of 60% on an 8-core Intel(R) Xeon(R) Sandy Bridge and 40% using 16 cores on a dual-socket Intel(R) Xeon(R) Sandy Bridge ccNUMA system.

14:45-15:05Coffee Break
15:05-16:45 Session 18: Contributed Talks
Location: Pentland
15:05
Evaluating New Communication Models in the Nek5000 Code for Exascale
SPEAKER: Ilya Ivanov

ABSTRACT. Nek5000 is a CFD spectral code for modeling incompressible flow in a large spectrum of applications, ranging from nuclear reactor cores modeling to oceanography. One of the most important features of Nek5000 code is the scalability up to 1 million cores on IBM Blue Gene machines. Moreover, we have previously presented a case study of porting Nek5000 to parallel GPU-accelerated systems with up to 16384 GPUs [1] and [2].
Future exascale supercomputers will provide an unprecedented amount of parallelism likely from hybrid architectures. In addition, more and more interconnection networks will provide hardware support for Remote Direct Memory Access (RDMA) and fast one-sided communication. The goal of this work is to investigate if new parallel communication models, taking advantage of hybrid architectures and RDMA, can be used in Nek5000 to improve its scalability to exascale also.
The Nek5000 parallel communication is based on non-blocking MPI point-to-point communications. These communication operations are included in the so called “gather-scatter operator” (gs_ops). This communication is needed to make consistent the nodal values on the boundary of elements. At the moment three gs_ops strategies are present: pairwise communication between spectral elements, a crystal-router algorithm and an all-reduce approach. In this talk, we describe the new implementation of communication models in NekBone (a skeleton version of the full Nek5000 application):
i) Use of GPU-direct to communicate directly between GPU memory spaces without involving the CPU memory. For this work, we use an OpenACC accelerated version of NekBone.
ii) Use of the GPI-2 communication library to perform RDMA one-sided operations.
In addition, we report the initial performance results of the new communication models. Finally, we discuss the experience and the challenges we faced during this work.

References:
[1] S. Markidis, J. Gong, M. Schliephake, E. Laure, A. Hart, D. Henty, K. Heisey and P. F. Fischer, OpenACC Acceleration of Nek5000, Spectral Element Code
Advances in Engineering Software Journal (under review)
[2] J. Gong, S. Markidis, M. Schliephake, E. Laure, D.S. Henningson, P. Schlatter, A. Peplinski, A. Hart, J. Doleschal, D. Henty and P. F. Fischer, Nek5000 with OpenACC, Lecture Notes in Computer Science, Springer (accepted)

15:30
Performance of Parallel IO on Lustre and GPFS
SPEAKER: David Henty

ABSTRACT. File input and output often become a severe bottleneck when parallel applications run on large numbers of processors. Simple methods such as writing a separate file per process or performing all IO via a single master process are no longer feasible at scale. In order to take advantage of the full potential of modern parallel file systems, IO also needs to be done in parallel.

In this paper we investigate the performance that can be achieved using MPI-IO, HDF5 and NetCDF. We benchmark two codes: a simple benchmark test case of a three-dimensional distributed dataset, and Code_Saturne which is a real fluid dynamics application using unstructured meshes. These are benchmarked on a Cray XC30 system with the Lustre file system, and on and IBM BG/Q with GPFS. There are substantial differences between these two architectures: for example the Lustre IO servers are decoupled from the compute nodes on the XC30, whereas on the BG/Q they are integrated into the same network as the compute nodes.

Our current results are mainly from MPI-IO. For both the benchmark and the real code, we find that MPI-IO can improve performance by at least an of magnitude over more naive IO strategies on large core counts. However, this requires some tuning of parameters. First, MPI-IO must be configured to use collective routines; second, the Lustre file system must be configured to use appropriate parallel striping to ensure the available IO bandwidth scales with system size (this happens automatically on the BG/Q).

We are currently extending these studies to HDF5 and NetCDF. Initial results indicate the same general behaviour, i.e. that collective IO is crucial to achieving performance on both systems. By the time of the conference we will have complete sets of data on both systems. These results are very important as codes scale up in core count: increased IO bandwidth is required to ensure that overall application performance continues to increase and does not become IO-limited.

15:55
HPC and CFD in the Marine Industry: Past, Present and Future
SPEAKER: Kurt Mizzi

ABSTRACT. This paper explores the use of Computational Fluid Dynamics applications on High Performance Computing (HPC) platforms from the point of view of a user engaged in Naval Architecture research. The paper will consider the significant limitations which were imposed on research boundaries prior to present HPC capabilities, how this impacted development in the field and the implications for industry. One particular example is the costly experimental testing which, due to resource constraints, is generally restricted to model scale.
It will then present an overview of the numerical simulation capabilities using current HPC performance and capacity. Several examples of research fields which are extremely important in the marine sector, both academically and also, crucially, for industry, will be used to further these discussions. More specifically, the ability to directly simulate resistance, the effects of fouling on ship performance and underwater noise experimental tests at model and full scale will be discussed. HPC capabilities for the optimisation of Propeller Boss Cap Fin (PBCF) designs will also be outlined. The ability to simulate at full scale is very important as it removes the error and correction factors associated with model scale testing due to scaling effects, and in cases where improvements in the range of 2-3% are being sought, it is very beneficial to have the sources of error minimised as far as possible.
With the increase of computational power and capacity, CFD simulations are proving to be more accurate and reliable. Being relatively cheaper and more time efficient, numerical methods are becoming the preferred choice within the industry compared to traditional experimental tests. However, this being said, certain experimental procedures cannot be numerically replicated with the current levels of computational capacity. A typical example would be propeller cavitation which is best predicted using a Detached Eddy Simulation (DES) solver that requires high computational capacity. This indicates that computational capacity dictates research boundaries as well as processes, which can be significant in an industry which tends to be conservative and reactive. The future needs and challenges in research and development will be outlined and discussed, highlighting the significant impact exascale computing will have.
Although maximising HPC capabilities has broadened the horizons of research and analyses in recent years, the issue remains as to how to convince industry to rely on numerical simulations. Although these methods may be beneficial financially while generating high levels of useful data, quality assurance of any product is always a prime concern and new procedures need to be well proven and acceptable to industry. Given the important role played by industry in funding and driving research in this field, the paper will also discuss possible strategies to earn its trust.

16:20
GPU porting with directives

ABSTRACT. CASTEP is a density functional theory software package used for both commercial and academic research. Using ab initio molecular dynamics and first principles calculations it is able to calculate physical properties of atomistic systems. We will discuss the process of porting such a large code to GPUs, the challenges and benefits of using a directives based systems such as OpenACC, and the performance we can achieve when running simulation on GPUs.

As CASTEP is a large, FORTRAN based, simulation code that has been extensively parallelised using MPI, there are significant challenges in porting the code to GPUs. Re-writing the full code base using a different programming language or programming language extension, such as CUDA for FORTRAN, would require very large amounts of work and would potentially create two different versions of the code base which is not desirable for future development and maintenance work. Therefore, we have investigated using OpenACC, which offers GPU functionality using OpenMP-like directives, to port CASTEP to GPUs. This should enable a common code base to exploit both GPUs and standard processors.

However, as OpenACC is a maturing language it does not fully support some of the features of FORTRAN that are using in CASTEP. We will discuss the challenges of porting this code to GPUs using CASTEP, highlight where code modifications were necessary, and what the design decisions/trade-offs we encountered were.

We will also present the performance we have achieved using this strategy, comparing the GPU version of the code with the MPI version of the code and demonstrating that we can get comparable performance with a single node of a HPC system. Furthermore, we will discuss the challenges of extending this approach to a multi-gpu parallelisation, fully integrating the functionality with the MPI parallelisation in the code.

 
This is an archived website, preserved and hosted by EPCC at the University of Edinburgh. Please email info [at] epcc [dot] ed [dot] ac [dot] uk for enquiries about this archive.