SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Pall Szilard) "

Sökning: WFRF:(Pall Szilard)

  • Resultat 1-10 av 15
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Abraham, Mark James, 1977-, et al. (författare)
  • GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers
  • 2015
  • Ingår i: SoftwareX. - : Elsevier. - 2352-7110. ; 1-2, s. 19-25
  • Tidskriftsartikel (refereegranskat)abstract
    • GROMACS is one of the most widely used open-source and free software codes in chemistry, used primarily for dynamical simulations of biomolecules. It provides a rich set of calculation types, preparation and analysis tools. Several advanced techniques for free-energy calculations are supported. In version 5, it reaches new performance heights, through several new and enhanced parallelization algorithms. These work on every level; SIMD registers inside cores, multithreading, heterogeneous CPU–GPU acceleration, state-of-the-art 3D domain decomposition, and ensemble-level parallelization through built-in replica exchange and the separate Copernicus framework. The latest best-in-class compressed trajectory storage format is supported.
  •  
2.
  • Alekseenko, Andrey, 1990-, et al. (författare)
  • Comparing the Performance of SYCL Runtimes for Molecular Dynamics Applications
  • 2023
  • Ingår i: International Workshop on OpenCL (IWOCL ’23). - : ACM Digital Library. - 9798400707452
  • Konferensbidrag (refereegranskat)abstract
    • SYCL is a cross-platform, royalty-free standard for programming a wide range of hardware accelerators. It is a powerful and convenient way to write standard C++ 17 code that can take full advantage of available devices. There are already multiple SYCL implementations targeting a wide range of platforms, from embedded to HPC clusters. Since several implementations can target the same hardware, application developers and users must know how to choose the most fitting runtime for their needs. In this talk, we will compare the runtime performance of two major SYCL runtimes targeting GPUs, oneAPI DPC++ and Open SYCL [3], to the native implementations for the purposes of GROMACS, a high-performance molecular dynamics engine.Molecular dynamics (MD) applications were one of the earliest adopters of GPU acceleration, with force calculations being an obvious target for offloading. It is an iterative algorithm where, in its most basic form, on each step, forces acting between particles are computed, and then the equations of motions are integrated. As the computational power of the GPUs grew, the strong scaling problem became apparent: the biophysical systems modeled with molecular dynamics typically have fixed sizes, and the goal is to perform more time steps, each taking less than a millisecond of wall time. This places high demands on the underlying GPU framework, requiring it to efficiently schedule multiple small tasks with minimal overhead, allowing to achieve overlap between CPU and GPU work for large systems and allowing to keep GPU occupied for smaller systems. Another requirement is the ability of application developers to have control over the scheduling to optimize for external dependencies, such as MPI communication.GROMACS is a widely-used MD engine, supporting a wide range of hardware and software platforms, from laptops to the largest supercomputers [1]. Portability and performance across multiple architectures have always been one of the primary goals of the project, necessary to keep the code not only efficient but also maintainable. The initial support for NVIDIA accelerators, using CUDA, was added to GROMACS in 2010. Since then, heterogeneous parallelization has been a major target for performance optimization, not limited to NVIDIA devices but later adding support for GPUs of other vendors, as well as Xeon Phi accelerators. GROMACS initially adopted SYCL in its 2021 release to replace its previous GPU portability layer, OpenCL [2]. In further releases, the number of offloading modes supported by the SYCL backend steadily increased. As of GROMACS 2023, SYCL support in GROMACS achieved near feature parity with CUDA while allowing the use of a single code to target the GPUs of all three major vendors with minimal specialization.While this clearly supports the portability promise of modern SYCL implementations, the performance of such portable code remains an open question, especially given the strict requirements of MD algorithms. In this talk, we compare the performance of GROMACS across a wide range of system sizes when using oneAPI DPC++ and Open SYCL runtimes on high-performance NVIDIA, AMD, and Intel GPUs. Besides the analysis of individual kernel performance, we focus on the runtime overhead and the efficiency of task scheduling when compared to a highly optimized implementation using the native frameworks and discuss the possible sources of suboptimal performance and the amount of vendor-specific code branches, such as intrinsics or workarounds for compiler bugs, required to achieve the optimal performance.
  •  
3.
  • Alekseenko, Andrey, et al. (författare)
  • Experiences with Adding SYCL Support to GROMACS
  • 2021
  • Ingår i: IWOCL'21. - New York, NY, USA : Association for Computing Machinery (ACM).
  • Konferensbidrag (refereegranskat)abstract
    • GROMACS is an open-source, high-performance molecular dynamics (MD) package primarily used for biomolecular simulations, accounting for 5% of HPC utilization worldwide. Due to the extreme computing needs of MD, significant efforts are invested in improving the performance and scalability of simulations. Target hardware ranges from supercomputers to laptops of individual researchers and volunteers of distributed computing projects such as Folding@Home. The code has been designed both for portability and performance by explicitly adapting algorithms to SIMD and data-parallel processors. A SIMD intrinsic abstraction layer provides high CPU performance. Explicit GPU acceleration has long used CUDA to target NVIDIA devices and OpenCL for AMD/Intel devices. In this talk, we discuss the experiences and challenges of adding support for the SYCL platform into the established GROMACS codebase and share experiences and considerations in porting and optimization. While OpenCL offers the benefits of using the same code to target different hardware, it suffers from several drawbacks that add significant development friction. Its separate-source model leads to code duplication and makes changes complicated. The need to use C99 for kernels, while the rest of the codebase uses C++17, exacerbates these issues. Another problem is that OpenCL, while supported by most GPU vendors, is never the main framework and thus is not getting the primary support or tuning efforts. SYCL alleviates many of these issues, employing a single-source model based on the modern C++ standard. In addition to being the primary platform for Intel GPUs, the possibility to target AMD and NVIDIA GPUs through other implementations (e.g., hipSYCL) might make it possible to reduce the number of separate GPU ports that have to be maintained. Some design differences from OpenCL, such as flow directed acyclic graphs (DAGs) instead of in-order queues, made it necessary to reconsider the GROMACS's task scheduling approach and architectural choices in the GPU backend. Additionally, supporting multiple GPU platforms presents a challenge of balancing performance (low-level and hardware-specific code) and maintainability (more generalization and code-reuse). We will discuss the limitations of the existing codebase and interoperability layers with regards to adding the new platform; the compute performance and latency comparisons; code quality considerations; and the issues we encountered with SYCL implementations tested. Finally, we will discuss our goals for the next release cycle for the SYCL backend and the overall architecture of GPU acceleration code in GROMACS.
  •  
4.
  • Alekseenko, Andrey, 1990-, et al. (författare)
  • GROMACS on AMD GPU-Based HPC Platforms : Using SYCL for Performance and Portability
  • 2024
  • Ingår i: CUG2024 Proceedings.
  • Konferensbidrag (refereegranskat)abstract
    • GROMACS is a widely-used molecular dynamics software package with a focus on performance, portability, and maintainability across a broad range of platforms. Thanks to its early algorithmic redesign and flexible heterogeneous parallelization, GROMACS has successfully harnessed GPU accelerators for more than a decade.With the diversification of accelerator platforms in HPC and no obvious choice for a well-suited multi-vendor programming model, the GROMACS project found itself at a crossroads. The performance and portability requirements, as well as a strong preference for a standards-based programming model, motivated our choice to use SYCL for production on both new HPC GPU platforms: AMD and Intel.Since the GROMACS 2022 release, the SYCL backend has been the primary means to target AMD GPUs in preparation for exascale HPC architectures like LUMI and Frontier.SYCL is a cross-platform, royalty-free, C++17-based standard for programming hardware accelerators, from embedded to HPC.It allows using the same code to target GPUs from all three major vendors with minimal specialization, which offers major portability benefits.While SYCL implementations build on native compilers and runtimes, whether such an approach is performant is not immediately evident.Biomolecular simulations have challenging performance characteristics: latency sensitivity, the need for strong scaling, and typical iteration times as short as hundreds of microseconds. Hence, obtaining good performance across the range of problem sizes and scaling regimes is particularly challenging.Here, we share the results of our work on readying GROMACS for AMD GPU platforms using SYCL,and demonstrate performance on Cray EX235a machines with MI250X accelerators. Our findings illustrate that portability is possible without major performance compromises.We provide a detailed analysis of node-level kernel and runtime performance with the aim of sharing best practices with the HPC community on using SYCL as a performance-portable GPU framework.
  •  
5.
  •  
6.
  • Jansson, Niclas, 1983-, et al. (författare)
  • Exploring the Ultimate Regime of Turbulent Rayleigh–Bénard Convection Through Unprecedented Spectral-Element Simulations
  • 2023
  • Ingår i: SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. - : Association for Computing Machinery (ACM). ; , s. 1-9
  • Konferensbidrag (refereegranskat)abstract
    • We detail our developments in the high-fidelity spectral-element code Neko that are essential for unprecedented large-scale direct numerical simulations of fully developed turbulence. Major inno- vations are modular multi-backend design enabling performance portability across a wide range of GPUs and CPUs, a GPU-optimized preconditioner with task overlapping for the pressure-Poisson equation and in-situ data compression. We carry out initial runs of Rayleigh–Bénard Convection (RBC) at extreme scale on the LUMI and Leonardo supercomputers. We show how Neko is able to strongly scale to 16,384 GPUs and obtain results that are not pos- sible without careful consideration and optimization of the entire simulation workflow. These developments in Neko will help resolv- ing the long-standing question regarding the ultimate regime in RBC. 
  •  
7.
  • Kutzner, Carsten, et al. (författare)
  • Best bang for your buck : GPU nodes for GROMACS biomolecular simulations
  • 2015
  • Ingår i: Journal of Computational Chemistry. - : Wiley. - 0192-8651 .- 1096-987X. ; 36:26, s. 1990-2008
  • Tidskriftsartikel (refereegranskat)abstract
    • The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well-exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)-based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off-loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime.
  •  
8.
  • Kutzner, Carsten, et al. (författare)
  • More bang for your buck : Improved use of GPU nodes for GROMACS 2018
  • 2019
  • Ingår i: Journal of Computational Chemistry. - : Wiley. - 0192-8651 .- 1096-987X. ; 40:27, s. 2418-2431
  • Tidskriftsartikel (refereegranskat)abstract
    • We identify hardware that is optimal to produce molecular dynamics (MD) trajectories on Linux compute clusters with the GROMACS 2018 simulation package. Therefore, we benchmark the GROMACS performance on a diverse set of compute nodes and relate it to the costs of the nodes, which may include their lifetime costs for energy and cooling. In agreement with our earlier investigation using GROMACS 4.6 on hardware of 2014, the performance to price ratio of consumer GPU nodes is considerably higher than that of CPU nodes. However, with GROMACS 2018, the optimal CPU to GPU processing power balance has shifted even more toward the GPU. Hence, nodes optimized for GROMACS 2018 and later versions enable a significantly higher performance to price ratio than nodes optimized for older GROMACS versions. Moreover, the shift toward GPU processing allows to cheaply upgrade old nodes with recent GPUs, yielding essentially the same performance as comparable brand-new hardware.
  •  
9.
  • Páll, Szilard, et al. (författare)
  • A flexible algorithm for calculating pair interactions on SIMD architectures
  • 2013
  • Ingår i: Computer Physics Communications. - : Elsevier BV. - 0010-4655 .- 1879-2944. ; 184:12, s. 2641-2650
  • Tidskriftsartikel (refereegranskat)abstract
    • Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have traditionally worked well, especially since compilers usually do a good job of unrolling the inner loop. In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. Avoiding memory bottlenecks is also increasingly important and requires reducing the ratio of memory to arithmetic operations. Moreover, when pairs only interact within a certain cut-off distance, good SIMD utilization can only be achieved by reordering input and output data, which quickly becomes a limiting factor. Here we present an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. Adjusting the cluster size allows the algorithm to map to SIMD units of various widths. This flexibility not only enables fast and efficient implementation on current CPUs and accelerator architectures like GPUs or Intel MIC, but it also makes the algorithm future-proof. We present the algorithm with an application to molecular dynamics simulations, where we can also make use of the effective buffering the method introduces.
  •  
10.
  • Pall, Szilard, et al. (författare)
  • Advances in the OpenCL offload support in GROMACS
  • 2019
  • Ingår i: Proceedings of the international workshop on OPENCL (IWOCL'19). - New York, NY, USA : Association for Computing Machinery (ACM).
  • Konferensbidrag (refereegranskat)abstract
    • GROMACS is a molecular dynamics (MD) simulation package widely used in research and education on machines ranging from laptops to workstation to the largest supercomputers. Built on a highly portable free and open source codebase GROMACS is known to have among the fastest simulation engines thanks to highly tuned kernels for more than a dozen processor architectures. For CPU architectures it relies on SIMD intrinsics-based code, while for GPUs besides the dominance CUDA platform, OpenCL is also supported on NVIDIA, AMD and Intel GPUs and is actively developed. This talk aims to present the recent advances in improved offload capabilities and broader platform support of the GROMACS OpenCL codebase.With a long history of CUDA support, in an effort to maintain the portability to platforms alternative to the dominant accelerator platform, an OpenCL port was developed four years ago and has been successfully used predominantly on AMD GPUs. Despite the modest user-base, recent efforts have focused on achieving feature parity with the CUDA codebase. The offload of additional computation (the particle mesh ewald solver) aims to compensate for the shift in the performance advantage of GPUs and resulting runtime imbalance as well as to better support dense accelerator nodes. Performance improvement of up to 1.5x can be seen on workstations equipped with AMD Vega GPUs.Additionally, platform support has been expanded to Intel iG-PUs. Tweaks to the underlying pair-interaction algorithm setup were necessary to reach a good performance. We observe 5-25% performance benefit in an asynchronous offload scenario running concurrently on both on the CPU cores and the iGPU compared to only using the highly tuned SIMD intrinsics code on the CPU cores. By leaving the iGPU a larger fraction of the limited power budget of a mobile processor, application performance improved which suggests that a configurable TDP allocation to match the computational load with the hardware balance would be beneficial. Such results will become especially useful as most future high performance processor architectures will increase integration and will feature on-chip heterogeneity with different components more or less well suited for different parts of an HPC application.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 15

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy