SwePub - sökning: WFRF:(Jansson Niclas 1983 )

Numrering	Referens	Omslagsbild	Hitta
1.	Hoffman, Johan, 1974-, et al. (författare) Computability and Adaptivity in CFD 2018 Ingår i: Encyclopedia of Computational Mechanics. - : John Wiley & Sons. Bokkapitel (refereegranskat)
2.	Hoffman, Johan, et al. (författare) Computation of slat noise sources using adaptive FEM and lighthill's analogy 2013 Ingår i: 19th AIAA/CEAS Aeroacoustics Conference. Konferensbidrag (refereegranskat)abstract This is a summary of preliminary results from simulations with the 30P30N high-lift device. We used the General Galerkin finite element method (G2), where no explicit subgrid model is used, and where the computational mesh is adaptively refined with respect to a posteriori error estimates for a quantity of interest. The mesh is fully unstructured and the solutions are time-resolved, which are key ingredients for solving challenging industrial applications in the field of aeroacoustics. We present preliminary results containing time-averaged quantities and snapshots of unsteady quantities, all reasonably agreeing with previous computational efforts. One important finding is that the use of adaptively generated meshes seems to be a more effcient way of computing aeroacoustic sources than by using "handmade" meshes.
3.	Hoffman, Johan, 1974-, et al. (författare) Turbulent flow and Fluid–structure interaction 2012 Ingår i: Lecture Notes in Computational Science and Engineering. - : Springer Science and Business Media Deutschland GmbH. ; , s. 543-552 Bokkapitel (refereegranskat)abstract The FEniCS Project aims towards the goals of generality, efficiency, and simplicity, concerning mathematical methodology, implementation and application, and the Unicorn project is an implementation aimed at FSI and high Re turbulent flow guided by these principles. Unicorn is based on the DOLFIN/FFC/FIAT suite and the linear algebra package PETSc. We here present some key elements of Unicorn, and a set of computational results from applications. The details of the Unicorn implementation are described in Chapter 18.
4.	Hoffman, Johan, 1974-, et al. (författare) Unicorn : A unified continuum mechanics solver 2012 Ingår i: Lecture Notes in Computational Science and Engineering. - : Springer Science and Business Media Deutschland GmbH. ; , s. 339-361 Bokkapitel (refereegranskat)abstract This chapter provides a description of the technology of Unicorn focusing on simple, efficient and general algorithms and software for the Unified Continuum (UC) concept and the adaptive General Galerkin (G2) discretization as a unified approach to continuum mechanics. We describe how Unicorn fits into the FEniCS framework, how it interfaces to other FEniCS components, what interfaces and functionality Unicorn provides itself and how the implementation is designed. We also present some examples in fluid–structure interaction and adaptivity computed with Unicorn.
5.	Jansson, Niclas, 1983- (författare) High Performance Adaptive Finite Element Methods : With Applications in Aerodynamics 2013 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract The massive computational cost for resolving all scales in a turbulent flow makes a direct numerical simulation of the underlying Navier-Stokes equations impossible in most engineering applications. Recent advances in adaptive finite element methods offer a new powerful tool in Computational Fluid Dynamics (CFD). The computational cost for simulating turbulent flow can be minimized by adaptively resolution of the mesh, based on a posteriori error estimation. Such adaptive methods have previously been implemented for efficient serial computations, but the extension to an efficient parallel solver is a challenging task. This work concerns the development of an adaptive finite element method that enables efficient computation of time resolved approximations of turbulent flow for complex geometries with a posteriori error control. We present efficient data structures and data decomposition methods for distributed unstructured tetrahedral meshes. Our work also concerns an efficient parallelization of local mesh refinement methods such as recursive longest edge bisection, and the development of an a priori predictive dynamic load balancing method, based on a weighted dual graph. We also address the challenges of emerging supercomputer architectures with the development of new hybrid parallel programming models, combining traditional message passing with lightweight one-sided communication. Our implementation has proven to be both general and efficient, scaling up to more than twelve thousands cores.
6.	Spühler, Jeannette Hiromi, 1981-, et al. (författare) 3D Fluid-Structure Interaction Simulation of Aortic Valves Using a Unified Continuum ALE-FEM Model Annan publikation (övrigt vetenskapligt/konstnärligt)
7.	Spühler, Jeannette H., et al. (författare) 3D Fluid-Structure Interaction Simulation of Aortic Valves Using a Unified Continuum ALE FEM Model 2018 Ingår i: Frontiers in Physiology. - : Frontiers Media S.A.. - 1664-042X. ; 9 Tidskriftsartikel (refereegranskat)abstract Due to advances in medical imaging, computational fluid dynamics algorithms and high performance computing, computer simulation is developing into an important tool for understanding the relationship between cardiovascular diseases and intraventricular blood flow. The field of cardiac flow simulation is challenging and highly interdisciplinary. We apply a computational framework for automated solutions of partial differential equations using Finite Element Methods where any mathematical description directly can be translated to code. This allows us to develop a cardiac model where specific properties of the heart such as fluid-structure interaction of the aortic valve can be added in a modular way without extensive efforts. In previous work, we simulated the blood flow in the left ventricle of the heart. In this paper, we extend this model by placing prototypes of both a native and a mechanical aortic valve in the outflow region of the left ventricle. Numerical simulation of the blood flow in the vicinity of the valve offers the possibility to improve the treatment of aortic valve diseases as aortic stenosis (narrowing of the valve opening) or regurgitation (leaking) and to optimize the design of prosthetic heart valves in a controlled and specific way. The fluid-structure interaction and contact problem are formulated in a unified continuum model using the conservation laws for mass and momentum and a phase function. The discretization is based on an Arbitrary Lagrangian-Eulerian space-time finite element method with streamline diffusion stabilization, and it is implemented in the open source software Unicorn which shows near optimal scaling up to thousands of cores. Computational results are presented to demonstrate the capability of our framework.
8.	Spühler, Jeannette Hiromi, 1981-, et al. (författare) A High Performance Computing Framework for Finite Element Simulation of Blood Flow in the Left Ventricle of the Human Heart 2020 Ingår i: Lecture Notes in Computational Science and Engineering. - Cham : Springer. ; , s. 155-164 Konferensbidrag (refereegranskat)abstract We present a high performance computing framework for finite element simulation of blood flow in the left ventricle of the human heart. The mathematical model is described together with the discretization method and the parallel implementation in Unicorn which is part of the open source software framework FEniCS-HPC. We show results based on patient-specific data that capture essential features observed with other computational models and imaging techniques, and thus indicate that our framework possesses the potential to provide relevant clinical information for diagnosis and medical treatment. Several other studies have been conducted to simulate the three dimensional blood flow in the left ventricle of the human heart with prescribed wall movement. Our contribution to the field of cardiac research lies in establishing an open source framework modular both in modelling and numerical algorithms.
9.	Andersson, Måns (författare) Leveraging Intermediate Representations for High-Performance Portable Discrete Fourier Transform Frameworks : with Application to Molecular Dynamics 2023 Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract The Discrete Fourier Transform (DFT) and its improved formulations, the Fast Fourier Transforms (FFTs), are vital for scientists and engineers in a range of domains from signal processing to the solution of partial differential equations. A growing trend in Scientific Computing is heterogeneous computing, where accelerators are used instead or together with CPUs. This has led to problems for developers in unifying portability, performance, and productivity. This thesis first motivates this work by showing the importance of having efficient DFT calculations, describes the DFT algorithm and a formulation based on matrix-factorizations which has been developed to formulate FFT algorithms and express their parallelism to exploit modern computer architectures, such as accelerators.The first paper is a motivating study of the breakdown of the performance and scalability of the high-performance Molecular Dynamics code GROMACS where DFT calculations are a main performance bottleneck. In particular, the long-range interactions are solved with the Particle-Mesh Ewald algorithm which uses a three-dimensional Fast Fourier Transform. The two following papers present two approaches to leverage factorization with the help of two different frameworks using Intermediate Representation and compiler technology, for the development of fast and portable code. The second paper presents a front-end and a pipeline for code generation in a domain-specific language based on Multi-Level Intermediate Representation (MLIR) for developing Fast Fourier Transform libraries. The last paper investigates and optimizes an implementation of an important kernel within the matrix-factorization framework: the batched DFT. It is implemented with data-centric programming and a data-centric intermediate representation called Stateful Dataflow multi-graphs (SDFG). The paper evaluates strategies for complex-valued data layout for performance and portability and we show that there is a trade-off between portability and maintainability in using the native complex data type and that an SDFG-level abstraction could be beneficial for developing higher-level applications.
10.	Atzori, Marco, et al. (författare) In-situ visualization of large-scale turbulence simulations in Nek5000 with ParaView Catalyst 2021 Rapport (övrigt vetenskapligt/konstnärligt)abstract In-situ visualization on HPC systems allows us to analyze simulation results that would otherwise be impossible, given the size of the simulation data sets and offline post-processing execution time. We design and develop in-situ visualization with Paraview Catalyst in Nek5000, a massively parallel Fortran and C code for computational fluid dynamics applications. We perform strong scalability tests up to 2,048 cores on KTH's Beskow Cray XC40 supercomputer and assess in-situ visualization's impact on the Nek5000 performance. In our study case, a high-fidelity simulation of turbulent flow, we observe that in-situ operations significantly limit the strong scalability of the code, reducing the relative parallel efficiency to only ~21\% on 2,048 cores (the relative efficiency of Nek5000 without in-situ operations is ~99\%). Through profiling with Arm MAP, we identified a bottleneck in the image composition step (that uses Radix-kr algorithm) where a majority of the time is spent on MPI communication. We also identified an imbalance of in-situ processing time between rank 0 and all other ranks. Better scaling and load-balancing in the parallel image composition would considerably improve the performance and scalability of Nek5000 with in-situ capabilities in large-scale simulation.
11.	Atzori, Marco, 1992-, et al. (författare) In situ visualization of large-scale turbulence simulations in Nek5000 with ParaView Catalyst 2022 Ingår i: Journal of Supercomputing. - : Springer. - 0920-8542 .- 1573-0484. ; 78:3, s. 3605-3620 Tidskriftsartikel (refereegranskat)abstract In situ visualization on high-performance computing systems allows us to analyze simulation results that would otherwise be impossible, given the size of the simulation data sets and offline post-processing execution time. We develop an in situ adaptor for Paraview Catalyst and Nek5000, a massively parallel Fortran and C code for computational fluid dynamics. We perform a strong scalability test up to 2048 cores on KTH’s Beskow Cray XC40 supercomputer and assess in situ visualization’s impact on the Nek5000 performance. In our study case, a high-fidelity simulation of turbulent flow, we observe that in situ operations significantly limit the strong scalability of the code, reducing the relative parallel efficiency to only ≈ 21 % on 2048 cores (the relative efficiency of Nek5000 without in situ operations is ≈ 99 %). Through profiling with Arm MAP, we identified a bottleneck in the image composition step (that uses the Radix-kr algorithm) where a majority of the time is spent on MPI communication. We also identified an imbalance of in situ processing time between rank 0 and all other ranks. In our case, better scaling and load-balancing in the parallel image composition would considerably improve the performance of Nek5000 with in situ capabilities. In general, the result of this study highlights the technical challenges posed by the integration of high-performance simulation codes and data-analysis libraries and their practical use in complex cases, even when efficient algorithms already exist for a certain application scenario.
12.	Bale, Rahul, et al. (författare) Stencil Penalty approach based constraint immersed boundary method 2020 Ingår i: Computers & Fluids. - : Elsevier BV. - 0045-7930 .- 1879-0747. ; 2000, s. 104457- Tidskriftsartikel (refereegranskat)abstract The constraint-based immersed boundary (cIB) method has been shown to be accurate between low and moderate Reynolds number (Re) flows when the immersed body constraint is imposed as a volumetric constraint force. When the IB is modelled as a zero-thickness interface, where it is no longer possible to model a volumetric constraint force, we found that cIB is not able to produce accurate results. The main source of inaccuracies in the cIB method is the distribution of the pressure field around the IB surface. An IB surface results in a jump in the pressure field across the IB. Evaluation of the discrete gradient of pressure close to the IB leads to a pressure gradient that does not satisfy the Neumann boundary condition for pressure at the IB. Furthermore, a non-zero discrete pressure gradient on the IB results in spurious flow at grid points close to the IB. We present a novel numerical formulation which adapts the cIB formulation for ‘zero-thickness’ immersed bodies. In order to impose the Neumann boundary condition on pressure on the IB more accurately, we introduce an additional body force to the momentum equation. A WENO based stencil penalization technique is used to define the new force term. Due to the more accurate imposition on the Neumann pressure boundary condition on the IB, spurious flow is reduced and the accuracy of no penetration velocity boundary condition on the IB is improved.
13.	Chien, Steven W.D., et al. (författare) Improving Cloud Storage Network Bandwidth Utilization of Scientific Applications 2023 Ingår i: Proceedings of the 7th Asia-Pacific Workshop on Networking, APNET 2023. - : Association for Computing Machinery (ACM). ; , s. 172-173 Konferensbidrag (refereegranskat)abstract Cloud providers began to provide managed services to attract scientific applications, which have been traditionally executed on supercomputers. One example is AWS FSx for Lustre, a fully managed parallel file system (PFS) released in 2018. However, due to the nature of scientific applications, the frontend storage network bandwidth is left completely idle for the majority of its lifetime. Furthermore, the pricing model does not match the scalability requirement. We propose iFast, a novel host-side caching mechanism for scientific applications that improves storage bandwidth utilization and end-to-end application performance: by overlapping compute and data writeback through inexpensive local storage. iFast supports the Massage Passing Interface (MPI) library that is widely used by scientific applications and is implemented as a preloaded library. It requires no change to applications, the MPI library, or support from cloud operators. We demonstrate how iFast can accelerate the end-to-end time of a representative scientific application Neko, by 13-40%.
14.	Jansson, Niclas, 1983- (författare) A Hybrid MPI+PGAS Approach to Improve Strong Scalability Limits of Finite Element Solvers 2020 Ingår i: Proceedings - IEEE International Conference on Cluster Computing, ICCC. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 303-313 Konferensbidrag (refereegranskat)abstract Current finite element codes scale reasonably well as long as each core has sufficient amount of local work that can balance communication costs. However, achieving efficient performance at exascale will require unreasonable large problem sizes, in particular for low-order methods, where the small amount of work per element already is a limiting factor on current post petascale machines. Key bottlenecks for these methods are sparse matrix assembly, where communication latency starts to limit performance as the number of cores increases, and linear solvers, where efficient overlapping is necessary to amortize communication and synchronization cost of sparse matrix vector multiplication and dot products. We present our work on improving strong scalability limits of message passing based general low-order finite element based solvers. Using lightweight one-sided communication offered by partitioned global address space languages (PGAS), we demonstrate that the scalability of performance critical, latency sensitive sparse matrix assembly can achieve almost an order of magnitude better scalability. Linear solvers are also addressed via a signaling put algorithm for low-cost point-to-point synchronization, achieving similar performance as message passing based linear solvers. We introduce a new hybrid MPI+PGAS implementation of the open source general finite element framework FEniCS, replacing the linear algebra backend with a new library written in Unified Parallel C (UPC). A detailed description of the implementation and the hybrid interface to FEniCS is given, and the feasibility of the approach is demonstrated via a performance study of the hybrid implementation on Cray XC40 machines.
15.	Jansson, Niclas, 1983-, et al. (författare) CUBE: A scalable framework for large-scale industrial simulations 2019 Ingår i: The international journal of high performance computing applications. - : Sage Publications. - 1094-3420 .- 1741-2846. ; 33:4, s. 678-698 Tidskriftsartikel (refereegranskat)
16.	Jansson, Niclas, 1983-, et al. (författare) Design of Neko - A Scalable High-Fidelity Simulation Framework with Extensive Accelerator Support Annan publikation (övrigt vetenskapligt/konstnärligt)
17.	Jansson, Niclas, 1983-, et al. (författare) Exploring the Ultimate Regime of Turbulent Rayleigh–Bénard Convection Through Unprecedented Spectral-Element Simulations 2023 Ingår i: SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. - : Association for Computing Machinery (ACM). ; , s. 1-9 Konferensbidrag (refereegranskat)abstract We detail our developments in the high-fidelity spectral-element code Neko that are essential for unprecedented large-scale direct numerical simulations of fully developed turbulence. Major inno- vations are modular multi-backend design enabling performance portability across a wide range of GPUs and CPUs, a GPU-optimized preconditioner with task overlapping for the pressure-Poisson equation and in-situ data compression. We carry out initial runs of Rayleigh–Bénard Convection (RBC) at extreme scale on the LUMI and Leonardo supercomputers. We show how Neko is able to strongly scale to 16,384 GPUs and obtain results that are not pos- sible without careful consideration and optimization of the entire simulation workflow. These developments in Neko will help resolv- ing the long-standing question regarding the ultimate regime in RBC.
18.	Jansson, Niclas, 1983-, et al. (författare) Neko: A modern, portable, and scalable framework for high-fidelity computational fluid dynamics 2024 Ingår i: Computers & Fluids. - : Elsevier BV. - 0045-7930 .- 1879-0747. ; 275, s. 106243-106243 Tidskriftsartikel (refereegranskat)abstract Computational fluid dynamics (CFD), in particular applied to turbulent flows, is a research area with great engineering and fundamental physical interest. However, already at moderately high Reynolds numbers the computational cost becomes prohibitive as the range of active spatial and temporal scales is quickly widening. Specifically scale-resolving simulations, including large-eddy simulation (LES) and direct numerical simulations (DNS), thus need to rely on modern efficient numerical methods and corresponding software implementations. Recent trends and advancements, including more diverse and heterogeneous hardware in High-Performance Computing (HPC), are challenging software developers in their pursuit for good performance and numerical stability. The well-known maxim “software outlives hardware” may no longer necessarily hold true, and developers are today forced to re-factor their codebases to leverage these powerful new systems. In this paper, we present Neko, a new portable framework for high-order spectral element discretization, targeting turbulent flows in moderately complex geometries. Neko is fully available as open software. Unlike prior works, Neko adopts a modern object-oriented approach in Fortran 2008, allowing multi-tier abstractions of the solver stack and facilitating hardware backends ranging from general-purpose processors (CPUs) down to exotic vector processors and FPGAs. We show that Neko’s performance and accuracy are comparable to NekRS, and thus on-par with Nek5000’s successor on modern CPU machines. Furthermore, we develop a performance model, which we use to discuss challenges and opportunities for high-order solvers on emerging hardware
19.	Jansson, Niclas, 1983- (författare) Spectral Element Simulations on the NEC SX-Aurora TSUBASA 2021 Ingår i: HPC Asia 2021: The International Conference on High Performance Computing in Asia-Pacific Region. - New York, NY, USA : Association for Computing Machinery (ACM). Konferensbidrag (refereegranskat)abstract Following the recent transition in the high performance computing landscape to more heterogeneous architectures, application developers are faced with the challenge of ensuring good performance across a diverse set of platforms. In this paper, we present our work on porting the spectral element code Nek5000 to the recent vector architecture SX-Aurora TSUBASA. Using Nek5000's mini-app Nekbone, we formulate suitable loop transformations in key kernels, allowing for better vectorization, increasing the baseline performance by a factor of six. Using the new transformations, we demonstrate that the main compute intensive matrix-vector and matrix-matrix multiplication kernels achieves close to half the peak performance of a SX-Aurora core. Our work also addresses the gather-scatter operations, a key kernel for efficient matrix-free spectral element formulation. We introduce a new implementation of Nek5000's gather-scatter library with mesh topology awareness for improved vectorization via exploitation of the SX-Aurora's hardware gather-scatter instructions, improving performance with up to 116%. A detailed description of the implementation is given together with a performance study, comparing both single node performance and strong scalability characteristics, running across multiple SX-Aurora cards.
20.	Ju, Yi, et al. (författare) In-Situ Techniques on GPU-Accelerated Data-Intensive Applications 2023 Ingår i: Proceedings 2023 IEEE 19th International Conference on e-Science, e-Science 2023. - : Institute of Electrical and Electronics Engineers (IEEE). Konferensbidrag (refereegranskat)abstract The computational power of High-Performance Computing (HPC) systems is constantly increasing, however, their input/output (IO) performance grows relatively slowly, and their storage capacity is also limited. This unbalance presents significant challenges for applications such as Molecular Dynamics (MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of data for further visualization or analysis. At the same time, checkpointing is crucial for long runs on HPC clusters, due to limited walltimes and/or failures of system components, and typically requires the storage of large amount of data. Thus, restricted IO performance and storage capacity can lead to bottlenecks for the performance of full application workflows (as compared to computational kernels without IO). In-situ techniques, where data is further processed while still in memory rather to write it out over the I/O subsystem, can help to tackle these problems. In contrast to traditional post-processing methods, in-situ techniques can reduce or avoid the need to write or read data via the IO subsystem. They offer a promising approach for applications aiming to leverage the full power of large scale HPC systems. In-situ techniques can also be applied to hybrid computational nodes on HPC systems consisting of graphics processing units (GPUs) and central processing units (CPUs). On one node, the GPUs would have significant performance advantages over the CPUs. Therefore, current approaches for GPU-accelerated applications often focus on maximizing GPU usage, leaving CPUs underutilized. In-situ tasks using CPUs to perform data analysis or preprocess data concurrently to the running simulation, offer a possibility to improve this underutilization.
21.	Karp, Martin, 1996-, et al. (författare) Appendix to High-Performance Spectral Element Methods on Field-Programmable Gate Arrays 2020 Annan publikation (övrigt vetenskapligt/konstnärligt)abstract In this Appendix we display some results we omitted fromour article ”High-Performance Spectral Element Methods onField-Programmable Gate Arrays”. In particular we showcasethe measured bandwidth for the FPGA we used (Stratix 10) aswell as the performance for our accelerator at different stagesof optimization. In addition to this, we show illustrate morepractical aspects of our performance/resource modelingImprovements in computer systems have historically relied on two well-known observations: Moore's law and Dennard's scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the comforts of general-purpose architectures in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate Array (FPGA), which strikes a good balance between complexity and performance.In this paper, we study modern FPGAs' applicability for use in accelerating the Spectral Element Method (SEM) core to many computational fluid dynamics (CFD) applications. We design a custom SEM hardware accelerator that we evaluate and empirically evaluate on the latest Stratix 10 SX-series FPGAs and position its performance (and power-efficiency) against state-of-the-art systems such as ARM ThunderX2, NVIDIA Pascal/Volta/Ampere Tesla-series cards, and general-purpose manycore CPUs. Finally, we develop a performance model for our SEM-accelerator, which we use to project the performance and role of future FPGAs to accelerator CFD applications, ultimately answering the question: what characteristics would a perfect FPGA for CFD applications have?
22.	Karp, Martin, 1996- (författare) Direct Numerical Simulation of Turbulence on Heterogenous Computer Systems : Architectures, Algorithms, and Applications 2024 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract Direct numerical simulations (DNS) of turbulence have a virtually unbounded need for computing power. To carry out these simulations, software, computer architectures, and algorithms must operate as efficiently as possible to amortize the large computational cost. However, in a computing landscape increasingly incorporating heterogeneous computer systems, changes are necessary. In this thesis, we consider how DNS can be carried out efficiently on upcoming heterogeneous computer systems. This work relates to developing algorithms for upcoming heterogeneous computer architectures, overcoming software challenges associated with large-scale DNS on these platforms, and applying these developments to new flow cases that were previously too costly to carry out. We consider in particular the spectral element method for DNS and evaluate how this method maps to field-programmable gate arrays, graphics processing units, as well as conventional processors. We also consider the issue of trading arithmetic operations for less communication, reducing the cost of solving the linear systems that arise in the spectral element method. Our developments are incorporated into the spectral element framework Neko, enabling Neko to strong-scale efficiently on the largest supercomputers in the world. Finally, we have carried out several DNS such as the simulation of a Flettner rotor in a turbulent boundary layer and simulating Rayleigh-Bénard convection at very high Rayleigh numbers. The developments in this thesis enable the high-fidelity simulation of turbulence on emerging computer systems with high parallel efficiency and performance.
23.	Karp, Martin, 1996-, et al. (författare) Experience and Analysis of Scalable High-Fidelity Computational Fluid Dynamics on Modular Supercomputing Architectures Annan publikation (övrigt vetenskapligt/konstnärligt)abstract The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high-fidelity CFD using the spectral element method can exploit the modular supercomputing architecture at scale through domain partitioning, where the computational domain is split between GPUs and CPUs. We investigate several different flow cases and computer systems based on the MSA. We observe that for our simulations, the communication overhead and load balancing issues incurred by incorporating different computing architectures are seldom worthwhile, especially when I/O is also considered, but when the simulation at hand requires more than the combined global memory on the GPUs, utilizing additional CPUs to increase the available memory can be fruitful. We support our results with a simple performance model to assess when running across modules might be beneficial. For a smaller supercomputer where the computation takes significant amounts of time on the CPU module, it can be beneficial to also use a GPU module to decrease the execution time significantly.
24.	Karp, Martin, 1996-, et al. (författare) High-Perfomance Spectral Element Methods on Field-Programmable Gate Arrays : Implementation, Evaluation, and Future Projection 2021 Ingår i: Proceedings of the 35rd IEEE International Parallel & Distributed Processing Symposium, May 17-21, 2021 Portland, Oregon, USA. - : Institute of Electrical and Electronics Engineers (IEEE). Konferensbidrag (refereegranskat)abstract Improvements in computer systems have historically relied on two well-known observations: Moore's law and Dennard's scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the general-purpose architectures' comforts in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate Array (FPGA), which strikes a convenient balance between complexity and performance. In this paper, we study modern FPGAs' applicability in accelerating the Spectral Element Method (SEM) core to many computational fluid dynamics (CFD) applications. We design a custom SEM hardware accelerator operating in double-precision that we empirically evaluate on the latest Stratix 10 GX-series FPGAs and position its performance (and power-efficiency) against state-of-the-art systems such as ARM ThunderX2, NVIDIA Pascal/Volta/Ampere Tesla-series cards, and general-purpose manycore CPUs. Finally, we develop a performance model for our SEM-accelerator, which we use to project future FPGAs' performance and role to accelerate CFD applications, ultimately answering the question: what characteristics would a perfect FPGA for CFD applications have?
25.	Karp, Martin, 1996-, et al. (författare) Large-scale direct numerical simulations of turbulence using GPUs and modern Fortran 2023 Ingår i: The international journal of high performance computing applications. - : SAGE Publications. - 1094-3420 .- 1741-2846. ; , s. 109434202311586- Tidskriftsartikel (refereegranskat)abstract We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by performing the world’s first direct numerical simulation of the flow around a Flettner rotor at Re = 30,000 and its interaction with a turbulent boundary layer. We present a performance comparison between the AMD Instinct MI250X and Nvidia A100 GPUs for scalable computational fluid dynamics. Our results show that one MI250X offers performance on par with two A100 GPUs and has a similar power efficiency based on readings from on-chip energy sensors.
26.	Karp, Martin, 1996-, et al. (författare) Optimization of Tensor-product Operations in Nekbone on GPUs 2020 Konferensbidrag (refereegranskat)abstract In the CFD solver Nek5000, the computation is dominated by the evaluation of small tensor operations. Nekbone is a proxy app for Nek5000 and has previously been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we continue this effort and optimize the main tensor-product operation in Nekbone further. Our optimization is done in CUDA and uses a different, 2D, thread structure to make the computations layer by layer. This enables us to use loop unrolling as well as utilize registers and shared memory efficiently. Our implementation is then compared on both the Pascal and Volta GPU architectures to previous GPU versions of Nekbone as well as a measured roofline. The results show that our implementation outperforms previous GPU Nekbone implementations by 6-10%. Compared to the measured roofline, we obtain 77-92% of the peak performance for both Nvidia P100 and V100 GPUs for inputs with 1024-4096 elements and polynomial degree 9.
27.	Karp, Martin, 1996-, et al. (författare) Reducing Communication in the Conjugate Gradient Method : A Case Study on High-Order Finite Elements 2022 Ingår i: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC 2022. - New York, NY, USA : Association for Computing Machinery (ACM). Konferensbidrag (refereegranskat)abstract Currently, a major bottleneck for several scientific computations is communication, both communication between different processors, so-called horizontal communication, and vertical communication between different levels of the memory hierarchy. With this bottleneck in mind, we target a notoriously communication-bound solver at the core of many high-performance applications, namely the conjugate gradient method (CG). To reduce the communication we present lower bounds on the vertical data movement in CG and go on to make a CG solver with reduced data movement. Using our theoretical analysis we apply our CG solver on a high-performance discretization used in practice, the spectral element method (SEM). Guided by our analysis, we show that for the Poisson equation on modern GPUs we can improve the performance by 30% by both rematerializing the discrete system and by reformulating the system to work on unique degrees of freedom. In order to investigate how horizontal communication can be reduced, we compare CG to two communication-reducing techniques, namely communication-avoiding and pipelined CG. We strong scale up to 4096 CPU cores and showcase performance improvements of upwards of 70% for pipelined CG compared to standard CG when applied on SEM at scale. We show that in addition to improving the scaling capabilities of the solver, initial measurements indicate that the convergence of SEM is largely unaffected by pipelined CG.
28.	Karp, Martin, 1996-, et al. (författare) Uncertainty Quantification of Reduced-Precision Time Series in Turbulent Channel Flow 2023 Ingår i: Proceedings of 2023 SC Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023. - : Association for Computing Machinery (ACM). ; , s. 387-390 Konferensbidrag (refereegranskat)abstract With increased computational power through the use of arithmetic in low-precision, a relevant question is how lower precision affects simulation results, especially for chaotic systems where analytical round-off estimates are non-trivial to obtain. In this work, we consider how the uncertainty of the time series of a direct numerical simulation of turbulent channel flow at Ret = 180 is affected when restricted to a reduced-precision representation. We utilize a non-overlapping batch means estimator and find that the mean statistics can, in this case, be obtained with significantly fewer mantissa bits than conventional IEEE-754 double precision, but that the mean values are observed to be more sensitive in the middle of the channel than in the near-wall region. This indicates that using lower precision in the near-wall region, where the majority of the computational efforts are required, may benefit from low-precision floating point units found in upcoming computer hardware.
29.	Massaro, Daniele, et al. (författare) Direct numerical simulation of the turbulent flow around a Flettner rotor 2024 Ingår i: Scientific Reports. - : Springer Nature. - 2045-2322. ; 14:1 Tidskriftsartikel (refereegranskat)abstract The three-dimensional turbulent flow around a Flettner rotor, i.e. an engine-driven rotating cylinder in an atmospheric boundary layer, is studied via direct numerical simulations (DNS) for three different rotation speeds (α). This technology offers a sustainable alternative mainly for marine propulsion, underscoring the critical importance of comprehending the characteristics of such flow. In this study, we evaluate the aerodynamic loads produced by the rotor of height h, with a specific focus on the changes in lift and drag force along the vertical axis of the cylinder. Correspondingly, we observe that vortex shedding is inhibited at the highest α values investigated. However, in the case of intermediate α, vortices continue to be shed in the upper section of the cylinder (y/h>0.3). As the cylinder begins to rotate, a large-scale motion becomes apparent on the high-pressure side, close to the bottom wall. We offer both a qualitative and quantitative description of this motion, outlining its impact on the wake deflection. This finding is significant as it influences the rotor wake to an extent of approximately one hundred diameters downstream. In practical applications, this phenomenon could influence the performance of subsequent boats and have an impact on the cylinder drag, affecting its fuel consumption. This fundamental study, which investigates a limited yet significant (for DNS) Reynolds number and explores various spinning ratios, provides valuable insights into the complex flow around a Flettner rotor. The simulations were performed using a modern GPU-based spectral element method, leveraging the power of modern supercomputers towards fundamental engineering problems.
30.	Svedin, Martin, et al. (författare) Benchmarking the Nvidia GPU Lineage : From Early K80 to Modern A100 with Asynchronous Memory Transfers 2021 Ingår i: ACM International Conference Proceeding Series. - New York, NY, USA : Association for Computing Machinery (ACM). Konferensbidrag (refereegranskat)abstract For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100 (A100), claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI benchmarks, and can we expect the A100 to deliver the application improvements we have grown used to with previous GPU generations? In this paper, we benchmark the A100 GPU and compare it to four previous generations of GPUs, with a particular focus on empirically quantifying our derived performance expectations. We find that the A100 delivers less performance increase than previous generations for the well-known Rodinia benchmark suite; we show that some of these performance anomalies can be remedied through clever use of the new data-movement features, which we microbenchmark and demonstrate where (and more importantly, how) they should be used.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Träfflista för sökning "WFRF:(Jansson Niclas 1983 ) "

Avgränsa träffmängd

År