View: session overviewtalk overview
09:30 | Managing Tianhe-2, the world's fastest supercomputer SPEAKER: Xue-Feng Yuan ABSTRACT. Xue-feng Yuan, Director of The National Supercomputer Centre at Guangzhou, will describe the experience of managing Tianhe-2, the world's largest supercomputer. |
10:50 | Multiscale modelling: a paradigm case for the exascale SPEAKER: James Suter ABSTRACT. One future approach for the exascale is to use multiscale simulation, where many submodels on the petascale are combined. However, the concept of having multiple models form a single scientific simulation, with each model operating on its own spatiotemporal scale, gives rise to a range of new challenges. Multiscale simulations often require interdisciplinary scientific expertise to construct, as well as new advances in reliable and accurate model coupling methods, and new approaches in multi-model validation and error assessment. We have found that different communities adopt |
11:15 | Support Operator Technique for 3D Simulations of Dissipative Processes at High Performance Computers SPEAKER: Vladimir Gasilov ABSTRACT. Problems related to transient high-temperature gas/plasma flows often reveal an essential dependence of a solution on various dissipative effects. A predictive simulation in that area of CFD is supported by finite difference (or finite volume) methods which satisfy common conditions like conservativity, monotonicity, stability in a wide range of flow parameters, high resolution etc. There exist other requirements and part of them - due to multiscale modeling in domains with complex shapes. An important property of any CFD technique is a possibility of its implementing using general computational meshes (block-structured, unstructured, mortar etc.) while general properties of original differential operators persists in their difference analogues. Traditionally the theory of difference approximations deals with such properties as positivness/nonpositiveness, self-adjointness of difference operators and some other basic mathematical properties. The support operators method is remarkable as it allows building approximations to differential operators using general meshes while the resulting difference operators preserve not only basic properties mentioned above, but, additionally, they provide rotationally-invariant difference schemes. It's important to pay special to the rotational invariance while working with systems describing deformations and dissipations in gas or liquid media, e.g. Navier-Stokes equations. Differencing via support operator method guarantees preserving of this property since it is inherent to original differential operator. The method allows building robust numerical procedures suitable for multiscale simulations requiring very finely discretized computational domains (billions of grid cells) – and it is just a case that requires the use of high performance computing. The support operator technique is developed for meshes formed with hexahedral, tetrahedral, prismatic cells and their various combinations. The appropriate numerical algorithms are incorporated into the scientific CFD code MARPLE3D (Keldysh Institute of Applied Mathematics - KIAM). MARPLE3D is an object-oriented, parallel code designed for scientific simulations at systems performing distributed computations. It is based primarily on radiative-magnetohydrodynamic model taking into account modern knowledge about high-temperature plasmas. MARPLE3D includes numerical tools able to perform large-scale 3D simulations. The code versatility enables its applications to diverse CFD problems in regions with complex shapes. As it follows from numerical experiments, MARPLE3D ensures quite good scalability when performing the calculations using a large number of processors. Moving towards the way to “exaflops” systems it is necessary working further on the elaboration of rational decomposition methods effective in applications to extremely large computational meshes. We believe that the application software GRIDSPIDERPAR which is under development in KIAM is promising in this aspect. The paper deals with the support operator method of constructing difference approximations using 3D unstructured grids. The data models and features of parallel implementation are discussed. Examples of physical problems solved via HPC are presented. |
11:40 | Large-scale Ultrasound Simulations Using the Hybrid OpenMP/MPI Decomposition SPEAKER: Jiri Jaros ABSTRACT. The simulation of ultrasound wave propagation through biological tissue has a wide range of practical applications. Recently, high intensity focused ultrasound has been applied to functional neurosurgery as an alternative, non-invasive treatment of various brain disorders such as brain tumours, cerebral haemorrhage, essential tremor, and Parkinson’s disease. The technique works by sending a focused beam of ultrasound into the tissue, typically using a large transducer. At the focus, the acoustic energy is sufficient to cause cell death in a localised region while the surrounding tissue is left unharmed. The major challenge is to ensure the focus is accurately placed at the desired target within the brain because the skull can significantly distort it. The accurate ultrasound simulations, however, require the simulation code to be able to exploit several thousands of processor cores and work with datasets on the order of tens of TB. We have recently developed a pure-MPI pseudo-spectral simulation code using the 1D domain decomposition that has allowed us to run decent simulations using up to 2048 compute cores. However, this implementation suffers from maximum parallelization being limited by the largest size along an axis of the 3D gird used. At the age of exascale systems more and more systems typically have numbers of processing cores far exceeding this limit. For example, cutting edge ultrasound simulations performed by the k-Wave toolbox use 20483 grids and so with the 1D pure-MPI decomposition would scale only to 2048 cores at most leading to the calculation time exceeding clinically acceptable time of 24 hours (here between 50 and 100 hours). This paper presents an improved implementation that exploits 2D hybrid OpenMP/MPI decomposition. The 3D grid is first decomposed by MPI processes into slabs. The slabs are further partitioned into pencils assigned to threads on demand. This allows us to employ 16 to 32 times more compute cores than the pure-MPI code wile reducing the amount of communication among processes due to efficient use of shared memory within compute nodes. The proposed technique was tested on the Fermi supercomputer (CINECA, Italy) with up to 16 384 compute cores and a domain size of 10243 grid points where the original code was able to use only 1024 compute cores. The speed-up reached by the proposed code stopped at a factor of 9, which is given by the nature of the method highly dependent on 3D fast Fourier transforms. This result has a huge practical impact on many ultrasound simulations. Speaking about the k-Wave project, deploying the hybrid decomposition has the potential to decrease the simulation time by a factor of 9, bringing the simulation time within the clinically meaningful timespan of 24 hours and allowing patient specific treatment plans to be created. |
12:05 | Multiscale ensemble simulation for vascular blood flow: a voyage towards the exascale SPEAKER: Ulf Schiller ABSTRACT. Cerebrovascular diseases such as brain aneurysms are a primary cause of adult disability. The flow dynamics in brain arteries, both during periods of rest and increased activity, are known to be a major factor in the risk of aneurysm formation and rupture, although the precise relation is still an open field of investigation. We present an automated ensemble simulation method for modelling cerebrovascular blood flow under a range of flow regimes. By automatically constructing and performing an ensemble of multiscale simulations, with a 1D solver unidirectionally coupled to a 3D lattice- Boltzmann simulation code, we are able to model the blood flow in a patient artery over a range of flow regimes. In addition, this approach allows us to use a very large number of cores to obtain more accurate and robust scientific results from our simulation. We apply the method to a model of a middle cerebral artery, which we model using 10,000s of cores in total, and find that this exercise helps us to fine-tune our modelling techniques, and opens up new ways to investigate cerebrovascular flow properties. Relying on our recent investigations on modelling the Circle of Willis, our approach can be trivially extended to use 100,000s of cores efficiently in this ensemble setup. In addition, our previously reported improvements in weighted domain decomposition allow us to introduce highly sophisticated boundary conditions with a very limited computational overhead. |
10:50 | Enabling Adaptive, Fault-tolerant MPI Applications with Dynamic Resource Allocation SPEAKER: Maciej Szpindler ABSTRACT. Recent HPC systems build their computational capabilities and performance on the extreme number of processing elements - either multicore nodes, hyperthreaded cores, accelerated nodes or others. This requires applications to be extremely parallel and scalable to run efficiently on these systems and benefit from their peak performance. On the other hand, in real usage several applications share one large system at the same time. It is quite obvious since such systems are expensive and consume a lot of resources even when idle. While applications are parallelized either using classic approaches like MPI and PGAS runtimes or novel models for large data processing, fair and optimal sharing of the computational resources is provided by resource manager (queuing system). One of the common constraints for programmers in the described shared computing environment is static resource allocation. Although MPI and other parallel programming models have constructs that allow dynamic process creation and management, it is not easily implementable in a shared multi-task system. Newly created processes must use already allocated resources which leads to either waste of resources that need to be preallocated or processes oversubscription which likely results in performance degradation. Moreover fault tolerance of parallel applications becomes a real need. Common scenarios for process fault recovery assume a repair stage which implies dynamic creation of new processes. In this work a simple model for the complete dynamic multi-process execution is proposed. The approach uses existing methods for resource allocation resizing (dynamic job allocation) in the SLURM environment. SLURM methods implement the PMI interim process management layer, a quasi-standard for processes management. This allows communication between MPI methods from the user application and job management layers handled by the SLURM system. The proposed approach is based on an MPI process spawning model which only allows for a blocking mode of process creation. While resource allocation is usually not immediate in a system with many users and their jobs, the non-blocking mode is more practical. A basic implementation which extends the MPI standard is proposed. In our approach we allocate new resources and create new processes which will enable an MPI application to utilize these resources dynamically within one SLURM job. It allows dynamic multi-process applications in a shared environment. The latter gives more opportunities for better resource utilization and allows programmers to create more flexible applications that may allocate resources dynamically and exclusively which is usually required to achieve full performance and essential when addressing fault recovery. |
11:15 | Simulation-based instruction-level statistics for optimizing software on future architectures SPEAKER: Wim Heirman ABSTRACT. In this paper we address the problem of optimizing applications for future hardware platforms. Performance projections of future systems are crucial for both software developers and CPU architects. Developers of applications, runtime libraries and compilers need predictions for tuning their software before the actual systems are available, and architects need them for architecture exploration and design optimization. Simulation is one of the most commonly used methods for performance prediction, and developing detailed simulators constitutes a major part of processor design. However, most detailed simulators, while very accurate, are too slow to simulate meaningful parts of applications running on many-core systems with large caches. But by trading off accuracy for the ability to run larger parts of the application, approximate simulators can play a significant role in software tuning and architectural exploration. Sniper is an x86 many-core simulator that has acceptable accuracy (around 20% average absolute error compared to Nehalem hardware) and high simulation speed (around 1 MIPS). To increase its utility to software developers, we extend Sniper to collect hardware events and timing effects at a per-instruction granularity. Comparable statistics can be obtained on existing systems using hardware performance counters, however, these suffer from a number of drawbacks: many hardware counters have inaccuracies such as double-counting under certain conditions, skidding (meaning that events are not always associated with the correct instruction), sampling errors (instruction pointers are typically only sampled when a counter overflows), or a lack of insight into how hardware events such as cache misses contribute to execution time (which is non-trivial in the context of out-of-order processors that can overlap many types of stalls with useful computation). In contrast, Sniper’s instruction-level statistics are based on the concept of cycle stacks. These break down the total execution time of the application, and assign each clock cycle of execution to that hardware component that was on the critical path of execution. Solving a given stall that is visible on the cycle stack will therefore be guaranteed to lead to increased performance. We extend Sniper’s existing method of cycle stack collection to keep track of the address for each instruction that causes an increment of the cycle stack, converting the stack into a matrix of cycle stack component and instruction pointer elements. We then combine this information with the application’s disassembly and (through debug symbols) its source code to construct cycle stack contributions at instruction or source line granularity. This gives application developers actionable information (cycle stack contributions have a direct correlation with obtainable performance improvements) that is moreover more accurate than what is provided by hardware performance counters (no sampling errors or skidding as the simulator can fully track the performance effects of each individual instruction), which allows them to start today with optimizing their software for the architectures of tomorrow. |
11:40 | Parallelisation of an Explicit MOT-TDVIE Solver for Transient Electromagnetics Simulations Using Hybrid MPI/OpenMP on Many-core Xeon Phi Coprocessors SPEAKER: Ahmed Al-Jarro ABSTRACT. Hardware accelerators, including multi and many-core processing technologies, are becoming ubiquitous in scientific computing as their ability to achieve considerable speedup against conventional CPUs have been demonstrated in many computational sciences and engineering applications. Further, accelerators are increasing in popularity as the preferred choice in many emerging HPC centers as they can provide cost effectiveness, power efficiency and physical density. Nevertheless, one of the limiting factors to the wide spread use of accelerators is the relatively expensive porting process required when using low-level programming models; for example, CUDA and OpenCL for GPUs. To this end, the recent development efforts have focused on the provision of new high-level directive based programming models, such as OpenACC for NVIDIA’s multi-core accelerators (GPU), as well as extending the scope of existing standards; for example, OpenMP 4.0 for porting applications onto Intel’s many-core coprocessors (Xeon Phi).
[2] A. Al-Jarro et al., 28th Review of Progress in ACES, Ohio, USA, April 10-14, 2012. [3] S. Feki et al., IEEE Antennas Propag. Mag., vol. 36, no. 2, pp. 256-277, 2014. |
12:05 | Performance optimisation on Xeon Phi SPEAKER: Adrian Jackson ABSTRACT. The flop to watt performance potentially available through Intel’s Xeon Phi co-processor makes it very attractive for computational simulation. With its’ full x86 instruction set, cache-coherent architecture, and support for MPI and OpenMP parallelisations, it is generally straight forward to port applications to the platform. However, the low clock speed of the cores, the in-order instruction restrictions, extra wide vector units, and low memory per core make it difficult to obtain good performance for a lot of codes. EPCC has been working with Intel, through our IPCC, to optimise a range of large scale simulation codes on Xeon Phi processors. In our talk we will describe the work undertaken; the performance achieved; and discuss the impact of the optimisations on the source code. We have been working on three simulation codes, GS2, COSA, and CP2K. All are large, FORTRAN-based, MPI-parallelised simulation codes that are widely used by simulation scientists in the UK, Europe, and around the world. They are also mature, large codes, which have a range of different simulation functionality incorporated within them. For instance, CP2K is a molecular dynamics code that can undertake classical molecular mechanics, density functional calculations, quantum mechanics, molecular dynamics, and much more. COSA is a CFD system that implemented both frequency domain and time domain Navier Stokes solvers for different simulation categories. GS2 is a plasma simulation code that is used both full collisional, nonlinear, tokamak plasma simulations, for simple linear space plasma simulations, and as the core component of a multiscale gyrokinetic transport code (TRINITY). Whilst optimising a code that implements a single operation or type of simulation may be straight forward for hardware such as Xeon Phi, optimising codes that have many different functional features is more challenging. We have, initially, worked on optimising vectorisation of these codes on standard processors and improving the performance of the MPI parallelisations as both of these activities will improve both the standard performance of the applications and the performance on Xeon Phi. We then moved on to optimising specifically for Xeon Phi, including implementing or improving hybrid parallelisations for each code, targeting the AVX-512 instructions on the Xeon Phi, and re-structuring some of the complex data structures used by the codes to improve cache and vector performance. We will present results for these applications run on single and multiple Xeon Phi’s and compare will benchmarks on standard hardware. We will demonstrate the comparative performance and where Xeon Phi can benefit the applications. |
13:30 | Automatic Code Orchestration from Descriptive Implementations SPEAKER: Brian Vinter ABSTRACT. While achieving exascale computing in itself is a huge technical task, bringing scientific users to a competence level where they can utilize an exascale machine is likely to pose problems of the same scale. This talk will describe ongoing work on enabling the users to specify their algorithms descriptively, but still within existing programming languages; C++, .Net, or Python for instance. The advantages of descriptive programming is both that the programmer is less likely to make programming errors, since traversing arrays and lists is expressed at an abstraction level that excludes the use of index-variable and links, and more importantly, since the descriptive expression specifies what to do, rather than how to do it, the runtime environment have a higher degree of freedom to produce a solution that is efficient on the available hardware. |
13:55 | The Cost of Synchronizing a Billion Processes SPEAKER: Stefano Markidis ABSTRACT. A recent work [1] shows that a prolonged local work causes idle periods that can propagate to the whole system, affecting the global performance of the parallel application. In message passing applications, this work imbalance generates idle periods that propagate via point- to-point communication. Synchronization of processes on distributed memory machines is achieved by sending and receiving messages. For this reason, a synchronization point in a parallel application allows for a propagation of idle periods through processes. The goal of this work is to quantify the effect of local work imbalances on the synchronization of up to one billion processes. |
14:20 | A new thread support level for hybrid programming with MPI endpoints SPEAKER: Daniel Holmes ABSTRACT. Exascale is driving hardware towards nodes with many cores in order to keep down power consumption. On such hardware, running one MPI process per core will increasingly suffer from overheads in both time and memory, and we can expect the current trend of using hybrid programming, with multiple threads running in each MPI process, to continue. |
13:30 | Flexible, Scalable Mesh and Data Management using PETSc DMPlex SPEAKER: Michael Lange ABSTRACT. Scalable file I/O and efficient domain topology management present important challenges for many scientific applications if they are to fully utilise future exascale computing resources. Although these operations are common to many scientific codes they have received little attention in recent optimisation efforts, resulting in potentially severe performance bottlenecks for realistic simulations that require and generate large data sets. Moreover, due to a multitude of formats and a lack of standards for mesh and output data in the community there is only limited interoperability and very little code reuse among scientific applications for common operations, such as reading and partitioning input meshes. Thus developers are often forced to create custom I/O routines or even use application-specific file formats, which further limits application portability. Designing a scientific software stack to meet next-generation simulation demands, not only requires scalable and efficient algorithms to perform data I/O and mesh management at scale, but also an abstraction layer that allows a wide variety of application codes to utilise them and thus promotes code reuse and interoperability. Such an intermediate representation of mesh topology has recently been added to PETSc, a widely used scientific library for the scalable solution of partial differential equations, in the form of the DMPlex data management API. DMPlex represents the topology of unstructured computational meshes in an graph that decouples mesh management tasks, such as partitioning and parallel data distribution, from the mesh file format. In this paper we demonstrate the use of PETSc’s DMPlex API to perform mesh input and domain topology management in multiple applications. We focus on the integration with Fluidity, a large scale CFD application code that already uses the PETSc library as its linear solver engine, while also highlighting how the same features are used in Firedrake, an automated system for Finite Element computation. By utilising DMPlex as a common mesh management abstraction we not only add support for new mesh and output file formats, such as ExodusII, CGNS, Gmsh, Fluent Case and HDF5/Xdmf, to both applications, but also enable the use of domain decomposition methods and mesh renumbering techniques at runtime. This provides significant performance benefits to both applications by improving cache locality anddrastically reducing the amount of file I/O required during pre-processing in the CFD application. |
13:55 | Algorithms in the parallel partitioning tool GridSpiderPar for large mesh decomposition SPEAKER: Evdokia N. Golovchenko ABSTRACT. The problem of load balancing arises in parallel mesh-based numerical solution of problems of continuum mechanics, energetics, electrodynamics etc. on high-performance computing systems. Geometric parallelism is commonly used in most of applications for large-scale 3D simulations of the problems listed above. It implies that every branch of an application code processes a subset of computational mesh (a subdomain). In order to increase processors efficiency it is necessary to provide rational domain decomposition, taking into account the requirements of balanced mesh distribution among processors and reduction of interprocessor communications, which depend on the number of bonds between subdomains. The number of processors to run a computational problem is often unknown. It makes sense, therefore, to partition a mesh into a great number of microdomains which then are used to create subdomains. Graph partitioning methods implemented in state-of-the-art parallel partitioning tools ParMETIS, JOSTLE, PT-SCOTCH and Zoltan are based on multilevel algorithms consisting of three phases: graph coarsening, initial partitioning and uncoarsening with refinement of the partitions. That approach has a shortcoming of making subdomains with longer frontiers or irregular shapes. In particular these methods can form unconnected subdomains. Such worsening of subdomain quality adversely affects the performance of subsequent computations. For example, it may result in a larger number of iterations to achieve convergence of iterative linear system solving methods. Another shortcoming of present graph partitioning methods is generation of strongly imbalanced partitions. This shortcoming is the most prominent in partitions made by ParMETIS, where number of vertices in some subdomains can be twice as large as in others. The imbalance can cause significant performance problems, especially in exascale computing. To solve above mentioned problems the program package for parallel large mesh decomposition GridSpiderPar was developed. Two algorithms were implemented in the GridSpiderPar package: a parallel geometric algorithm of mesh partitioning and a parallel incremental algorithm of graph partitioning. The devised parallel algorithms support two main stages of large mesh partitioning. They are a preliminary mesh partitioning among processors and a parallel mesh partitioning of high quality. Both work with unstructured meshes with up to 10^9 elements. The main advantage of the second algorithm which is based on the incremental idea is creation of principally connected subdomains. We compared different partitions into microdomains, microdomain graph partitions and partitions into subdomains of several meshes (10^8 vertices, 10^9 elements) obtained by means of the partitioning tool GridSpiderPar and the packages ParMETIS, Zoltan and PT-SCOTCH. Balance of the partitions, edge-cut and number of unconnected subdomains in different partitions were compared as well as the computational performance of gas-dynamic problem simulations run on different partitions. For the gas-dynamic problem simulations we used the MARPLE3D research code designed in KIAM RAS. The obtained results demonstrate advantages of the devised algorithms. The work was supported by RFBR grants 13-01-12073-ofi_m, 14-01-00663-a, 14-07-00712-а, and 15-07-04213-a. |
14:20 | Scalable Multithreaded Algorithms for Mutable Irregular Data with Application to Anisotropic Mesh Adaptivity SPEAKER: Georgios Rokos ABSTRACT. Thread-level parallelism for applications with irregular data structures and mutable dependencies presents significant challenges because the underlying data is extensively modified during execution of the algorithm and a high degree of parallelism must be realized while keeping the code race-free. In this talk I will describe a new irregular compute methodology for shared-memory parallelism, which guarantees safe parallel execution via processing workitems in batches of independent sets and using a deferred operations strategy to update the underlying data structures in parallel without data contention. Scalability is assisted by creating worklists using atomic operations, a synchronization-free alternative to reduction-based worklist algorithms, and by creating independent sets using an improved graph coloring method which runs up to 1.5x faster than existing techniques. Finally, I will describe some early work on an interrupt-driven work-sharing for-loop scheduler which performs better than existing options. Using this compute methodology and applying it to adaptive mesh algorithms, it is shown that our codes achieve a parallel efficiency of 60% on an 8-core Intel(R) Xeon(R) Sandy Bridge and 40% using 16 cores on a dual-socket Intel(R) Xeon(R) Sandy Bridge ccNUMA system. |
15:05 | Evaluating New Communication Models in the Nek5000 Code for Exascale SPEAKER: Ilya Ivanov ABSTRACT. Nek5000 is a CFD spectral code for modeling incompressible flow in a large spectrum of applications, ranging from nuclear reactor cores modeling to oceanography. One of the most important features of Nek5000 code is the scalability up to 1 million cores on IBM Blue Gene machines. Moreover, we have previously presented a case study of porting Nek5000 to parallel GPU-accelerated systems with up to 16384 GPUs [1] and [2]. |
15:30 | Performance of Parallel IO on Lustre and GPFS SPEAKER: David Henty ABSTRACT. File input and output often become a severe bottleneck when parallel applications run on large numbers of processors. Simple methods such as writing a separate file per process or performing all IO via a single master process are no longer feasible at scale. In order to take advantage of the full potential of modern parallel file systems, IO also needs to be done in parallel. In this paper we investigate the performance that can be achieved using MPI-IO, HDF5 and NetCDF. We benchmark two codes: a simple benchmark test case of a three-dimensional distributed dataset, and Code_Saturne which is a real fluid dynamics application using unstructured meshes. These are benchmarked on a Cray XC30 system with the Lustre file system, and on and IBM BG/Q with GPFS. There are substantial differences between these two architectures: for example the Lustre IO servers are decoupled from the compute nodes on the XC30, whereas on the BG/Q they are integrated into the same network as the compute nodes. Our current results are mainly from MPI-IO. For both the benchmark and the real code, we find that MPI-IO can improve performance by at least an of magnitude over more naive IO strategies on large core counts. However, this requires some tuning of parameters. First, MPI-IO must be configured to use collective routines; second, the Lustre file system must be configured to use appropriate parallel striping to ensure the available IO bandwidth scales with system size (this happens automatically on the BG/Q). We are currently extending these studies to HDF5 and NetCDF. Initial results indicate the same general behaviour, i.e. that collective IO is crucial to achieving performance on both systems. By the time of the conference we will have complete sets of data on both systems. These results are very important as codes scale up in core count: increased IO bandwidth is required to ensure that overall application performance continues to increase and does not become IO-limited. |
15:55 | HPC and CFD in the Marine Industry: Past, Present and Future SPEAKER: Kurt Mizzi ABSTRACT. This paper explores the use of Computational Fluid Dynamics applications on High Performance Computing (HPC) platforms from the point of view of a user engaged in Naval Architecture research. The paper will consider the significant limitations which were imposed on research boundaries prior to present HPC capabilities, how this impacted development in the field and the implications for industry. One particular example is the costly experimental testing which, due to resource constraints, is generally restricted to model scale. |
16:20 | GPU porting with directives SPEAKER: Adrian Jackson ABSTRACT. CASTEP is a density functional theory software package used for both commercial and academic research. Using ab initio molecular dynamics and first principles calculations it is able to calculate physical properties of atomistic systems. We will discuss the process of porting such a large code to GPUs, the challenges and benefits of using a directives based systems such as OpenACC, and the performance we can achieve when running simulation on GPUs. As CASTEP is a large, FORTRAN based, simulation code that has been extensively parallelised using MPI, there are significant challenges in porting the code to GPUs. Re-writing the full code base using a different programming language or programming language extension, such as CUDA for FORTRAN, would require very large amounts of work and would potentially create two different versions of the code base which is not desirable for future development and maintenance work. Therefore, we have investigated using OpenACC, which offers GPU functionality using OpenMP-like directives, to port CASTEP to GPUs. This should enable a common code base to exploit both GPUs and standard processors. However, as OpenACC is a maturing language it does not fully support some of the features of FORTRAN that are using in CASTEP. We will discuss the challenges of porting this code to GPUs using CASTEP, highlight where code modifications were necessary, and what the design decisions/trade-offs we encountered were. We will also present the performance we have achieved using this strategy, comparing the GPU version of the code with the MPI version of the code and demonstrating that we can get comparable performance with a single node of a HPC system. Furthermore, we will discuss the challenges of extending this approach to a multi-gpu parallelisation, fully integrating the functionality with the MPI parallelisation in the code. |