SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Peng Ivy Bo) "

Sökning: WFRF:(Peng Ivy Bo)

  • Resultat 1-43 av 43
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  •  
2.
  • Andersson, Måns (författare)
  • Leveraging Intermediate Representations for High-Performance Portable Discrete Fourier Transform Frameworks : with Application to Molecular Dynamics
  • 2023
  • Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • The Discrete Fourier Transform (DFT) and its improved formulations, the Fast Fourier Transforms (FFTs), are vital for scientists and engineers in a range of domains from signal processing to the solution of partial differential equations.  A growing trend in Scientific Computing is heterogeneous computing, where accelerators are used instead or together with CPUs. This has led to problems for developers in unifying portability, performance, and productivity. This thesis first motivates this work by showing the importance of having efficient DFT calculations, describes the DFT algorithm and a formulation based on matrix-factorizations which has been developed to formulate FFT algorithms and express their parallelism to exploit modern computer architectures, such as accelerators.The first paper is a motivating study of the breakdown of the performance and scalability of the high-performance Molecular Dynamics code GROMACS where DFT calculations are a main performance bottleneck. In particular, the long-range interactions are solved with the Particle-Mesh Ewald algorithm which uses a three-dimensional Fast Fourier Transform. The two following papers present two approaches to leverage factorization with the help of two different frameworks using Intermediate Representation and compiler technology, for the development of fast and portable code. The second paper presents a front-end and a pipeline for code generation in a domain-specific language based on Multi-Level Intermediate Representation (MLIR) for developing Fast Fourier Transform libraries. The last paper investigates and optimizes an implementation of an important kernel within the matrix-factorization framework: the batched DFT. It is implemented with data-centric programming and a data-centric intermediate representation called Stateful Dataflow multi-graphs (SDFG). The paper evaluates strategies for complex-valued data layout for performance and portability and we show that there is a trade-off between portability and maintainability in using the native complex data type and that an SDFG-level abstraction could be beneficial for developing higher-level applications.
  •  
3.
  • Araújo De Medeiros, Daniel (författare)
  • Emerging Paradigms in the Convergence of Cloud and High-Performance Computing
  • 2023
  • Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Traditional HPC scientific workloads are tightly coupled, while emerging scientific workflows exhibit even more complex patterns, consisting of multiple characteristically different stages that may be IO-intensive, compute-intensive, or memory-intensive. New high-performance computer systems are evolving to adapt to these new requirements and are motivated by the need for performance and efficiency in resource usage. On the other hand, cloud workloads are loosely coupled, and their systems have matured technologies under different constraints from HPC.In this thesis, the use of cloud technologies designed for loosely coupled dynamic and elastic workloads is explored, repurposed, and examined in the landscape of HPC in three major parts. The first part deals with the deployment of HPC workloads in cloud-native environments through the use of containers and analyses the feasibility and trade-offs of elastic scaling. The second part relates to the use of workflow management systems in HPC workflows; in particular, a molecular docking workflow executed through Airflow is discussed. Finally, object storage systems, a cost-effective and scalable solution widely used in the cloud, and their usage in HPC applications through MPI I/O are discussed in the third part of this thesis. 
  •  
4.
  • Araújo De Medeiros, Daniel, et al. (författare)
  • Kub : Enabling Elastic HPC Workloads on Containerized Environments
  • 2023
  • Ingår i: Proceedings of the 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources in the system or release underutilized resources during the execution. In this paper, we present Kub, a methodology that enables elastic execution of HPC workloads on Kubernetes so that the resources allocated to a job can be dynamically scaled during the execution. One main optimization of our method is to maximize the reuse of the originally allocated resources so that the disruption to the running job can be minimized. The scaling procedure is coordinated among nodes through remote procedure calls on Kubernetes for deploying workloads in the cloud. We evaluate our approach using one synthetic benchmark and two production-level MPI-based HPC applications - GRO-MACS and CM1. Our results demonstrate that the benefits of adapting the allocated resources depend on the workload characteristics. In the tested cases, a properly chosen scaling point for increasing resources during execution achieved up to 2x speedup. Also, the overhead of checkpointing and data reshuffling significantly influences the selection of optimal scaling points and requires application-specific knowledge.
  •  
5.
  • Araújo De Medeiros, Daniel, et al. (författare)
  • LibCOS : Enabling Converged HPC and Cloud Data Stores with MPI
  • 2023
  • Ingår i: Proceedings of International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2023. - New York, NY, USA : Association for Computing Machinery (ACM). ; , s. 106-116
  • Konferensbidrag (refereegranskat)abstract
    • Recently, federated HPC and cloud resources are becoming increasingly strategic for providing diversified and geographically available computing resources. However, accessing data stores across HPC and cloud storage systems is challenging. Many cloud providers use object storage systems to support their clients in storing and retrieving data over the internet. One popular method is REST APIs atop the HTTP protocol, with Amazon's S3 APIs being supported by most vendors. In contrast, HPC systems are contained within their networks and tend to use parallel file systems with POSIX-like interfaces. This work addresses the challenge of diverse data stores on HPC and cloud systems by providing native object storage support through the unified MPI I/O interface in HPC applications. In particular, we provide a prototype library called LibCOS that transparently enables MPI applications running on HPC systems to access object storage on remote cloud systems. We evaluated LibCOS on a Ceph object storage system and a traditional HPC system. In addition, we conducted performance characterization of core S3 operations that enable individual and collective MPI I/O. Our evaluation in HACC, IOR, and BigSort shows that enabling diverse data stores on HPC and Cloud storage is feasible and can be transparently achieved through the widely adopted MPI I/O. Also, we show that a native object storage system like Ceph could improve the scalability of I/O operations in parallel applications.
  •  
6.
  • Chen, Yuxi, et al. (författare)
  • Global Three-Dimensional Simulation of Earth's Dayside Reconnection Using a Two-Way Coupled Magnetohydrodynamics With Embedded Particle-in-Cell Model : Initial Results
  • 2017
  • Ingår i: Journal of Geophysical Research - Space Physics. - : AMER GEOPHYSICAL UNION. - 2169-9380 .- 2169-9402. ; 122:10, s. 10318-10335
  • Tidskriftsartikel (refereegranskat)abstract
    • We perform a three-dimensional (3-D) global simulation of Earth's magnetosphere with kinetic reconnection physics to study the flux transfer events (FTEs) and dayside magnetic reconnection with the recently developed magnetohydrodynamics with embedded particle-in-cell model. During the 1 h long simulation, the FTEs are generated quasi-periodically near the subsolar point and move toward the poles. We find that the magnetic field signature of FTEs at their early formation stage is similar to a "crater FTE," which is characterized by a magnetic field strength dip at the FTE center. After the FTE core field grows to a significant value, it becomes an FTE with typical flux rope structure. When an FTE moves across the cusp, reconnection between the FTE field lines and the cusp field lines can dissipate the FTE. The kinetic features are also captured by our model. A crescent electron phase space distribution is found near the reconnection site. A similar distribution is found for ions at the location where the Larmor electric field appears. The lower hybrid drift instability (LHDI) along the current sheet direction also arises at the interface of magnetosheath and magnetosphere plasma. The LHDI electric field is about 8 mV/m, and its dominant wavelength relative to the electron gyroradius agrees reasonably with Magnetospheric Multiscale (MMS) observations.
  •  
7.
  • Chien, Steven Wei Der, et al. (författare)
  • An Evaluation of the TensorFlow Programming Model for Solving Traditional HPC Problems
  • 2018
  • Ingår i: Proceedings of the 5th International Conference on Exascale Applications and Software. - : The University of Edinburgh. - 9780992661533 ; , s. 34-
  • Konferensbidrag (refereegranskat)abstract
    • Computational intensive applications such as pattern recognition, and natural language processing, are increasingly popular on HPC systems. Many of these applications use deep-learning, a branch of machine learning, to determine the weights of artificial neural network nodes by minimizing a loss function. Such applications depend heavily on dense matrix multiplications, also called tensorial operations. The use of Graphics Processing Unit (GPU) has considerably speeded up deep-learning computations, leading to a Renaissance of the artificial neural network. Recently, the NVIDIA Volta GPU and the Google Tensor Processing Unit (TPU) have been specially designed to support deep-learning workloads. New programming models have also emerged for convenient expression of tensorial operations and deep-learning computational paradigms. An example of such new programming frameworks is TensorFlow, an open-source deep-learning library released by Google in 2015. TensorFlow expresses algorithms as a computational graph where nodes represent operations and edges between nodes represent data flow. Multi-dimensional data such as vectors and matrices which flows between operations are called Tensors. For this reason, computation problems need to be expressed as a computational graph. In particular, TensorFlow supports distributed computation with flexible assignment of operation and data to devices such as GPU and CPU on different computing nodes. Computation on devices are based on optimized kernels such as MKL, Eigen and cuBLAS. Inter-node communication can be through TCP and RDMA. This work attempts to evaluate the usability and expressiveness of the TensorFlow programming model for traditional HPC problems. As an illustration, we prototyped a distributed block matrix multiplication for large dense matrices which cannot be co-located on a single device and a Conjugate Gradient (CG) solver. We evaluate the difficulty of expressing traditional HPC algorithms using computational graphs and study the scalability of distributed TensorFlow on accelerated systems. Our preliminary result with distributed matrix multiplication shows that distributed computation on TensorFlow is extremely scalable. This study provides an initial investigation of new emerging programming models for HPC.
  •  
8.
  • Cllasun, Hüsrev, et al. (författare)
  • FPGA-accelerated simulation of variable latency memory systems
  • 2022
  • Ingår i: MEMSYS 2022 - Proceedings of the International Symposium on Memory Systems. - : Association for Computing Machinery (ACM).
  • Konferensbidrag (refereegranskat)abstract
    • With the growing complexity of memory types, organizations, and placement, efficient use of memory systems remains a key objective to processing data-rich workloads. Heterogeneous memories including HBM, conventional DRAM, and persistent memory, both locally and network-attached, exhibit a wide range of latencies and bandwidths. The delivered performance to an application may vary widely depending on workload and interference from competing clients. Evaluating the impact on applications to these emerging memory systems challenges traditional simulation techniques. In this work, we describe VLD-sim, an FPGA-accelerated simulator designed to evaluate application performance in the presence of varying non-deterministic latency. VLD-sim implements a statistical approach in which memory system access latency is non-deterministic, as would occur when request traffic is generated from a large collection of possibly unrelated threads and compute nodes. VLD-sim runs on a Multi-Processor System on Chip with hard CPU plus configurable logic to enable fast evaluation of workloads or of individual applications. We evaluate VLD-sim with CPU-only and near memory accelerator-enabled applications and compare against an idealized fixed latency baseline. Our findings reveal and quantify performance impact on applications due to non-deterministic latency. With high flexibility and and fast execution time, VLD-sim enables system level evaluation of a large memory architecture design space.
  •  
9.
  • Coti, Camille, et al. (författare)
  • Integration of Modern HPC Performance Tools in Vlasiator for Exascale Analysis and Optimization
  • 2024
  • Ingår i: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 27-31, San Francisco, California, USA.. - 9798350364606
  • Konferensbidrag (refereegranskat)abstract
    • Key to the success of developing high-performance applications for present and future heterogeneous supercomputers will be the systematic use of measurement and analysis to understand factors that affect delivered performance in the context of parallelization strategy, heterogeneous programming methodology, data partitioning, and scalable algorithm design. The evolving complexity of future exascale platforms makes it unrealistic for application teams to implement their own tools. Similarly, it is naive to expect available robust performance tools to work effectively out-of-the-box, without integration and specialization in respect to application-specific requirements and knowledge. Vlasiator is a powerful massively parallel code for accurate magnetospheric and solar wind plasma simulations. It is being ported to the LUMI HPC system for advanced modeling of the Earth’s magnetosphere and surrounding solar wind. Building on a preexisting Vlasiator performance API called Phiprof, our work significantly advances the performance measurement and analysis capabilities offered to Vlasiator using the TAU, APEX, and IPM tools. The results presented show in-depth characterization of node-level CPU/GPU and MPI communications performance. We highlight the integration of high-level Phiprof events with detailed performance data to expose opportunities for performance tuning. Our results provide important insights to optimize Vlasiator for the upcoming Exascale machines.
  •  
10.
  • Faj, Jennifer, et al. (författare)
  • MPI Performance Analysis in Vlasiator : Unraveling Communication Bottlenecks
  • 2023
  • Ingår i: SC23 Proccedings. - Denver, Colorado, USA.
  • Konferensbidrag (refereegranskat)abstract
    • Vlasiator is a popular and powerful massively parallel code for accurate magnetospheric and solar wind plasma simulations. This work provides an in-depth analysis of Vlasiator, focusing on MPI performance using the Integrated Performance Monitoring (IPM) tool. We show that MPI non-blocking point-to-point communication accounts for most of the communication time. The communication topology shows a large number of MPI messages exchanging data in a six-dimensional grid. We also show that relatively large messages are used in MPI communication, reaching up to 256MB. As a communication-bound application, we found that using OpenMP in Vlasiator is critical for eliminating intra-node communication. Our results provide important insights for optimizing Vlasiator for the upcoming Exascale machines.
  •  
11.
  • Faj, Jennifer, et al. (författare)
  • Quantum Computer Simulations at Warp Speed : Assessing the Impact of GPU Acceleration
  • 2023
  • Ingår i: Proceedings 2023 IEEE 19th International Conference on e-Science, e-Science 2023. - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • Quantum computer simulators are crucial for the development of quantum computing. This work investigates GPU and multi-GPU systems' suitability and performance impact on a widely used simulation tool - the state vector simulator Qiskit Aer. In particular, we evaluate the performance of both Qiskit's default Nvidia Thrust backend and the recent Nvidia cuQuantum backend on Nvidia A100 GPUs. We provide a benchmark suite of representative quantum applications for characterization. For simulations with a large number of qubits, the two GPU backends can provide up to 14× speedup over the CPU backend, with Nvidia cuQuantum providing a further 1.5-3× speedup over the default Thrust backend. Our evaluation on a single GPU identifies the most important functions in Nvidia Thrust and cuQuantum for different quantum applications and their compute and memory bottlenecks. We also evaluate the gate fusion and cache-blocking optimizations on different quantum applications. Finally, we evaluate large-number qubit quantum applications on multi-GPU and identify data movement between host and GPU as the limiting factor for the performance.
  •  
12.
  • Hegde, Pratibha Raghupati, et al. (författare)
  • Beyond the Buzz : Strategic Paths for Enabling Useful NISQ Applications
  • 2024
  • Ingår i: Proceedings of the 21st ACM International Conference on Computing Frontiers, CF 2024. - : Association for Computing Machinery (ACM). ; , s. 310-313
  • Konferensbidrag (refereegranskat)abstract
    • There is much debate on whether quantum computing on current NISQ devices, consisting of noisy hundred qubits and requiring a non-negligible usage of classical computing as part of the algorithms, has utility and will ever offer advantages for scientific and industrial applications with respect to traditional computing. In this position paper, we argue that while real-world NISQ quantum applications have yet to surpass their classical counterparts, strategic approaches can be used to facilitate advancements in both industrial and scientific applications. We have identified three key strategies to guide NISQ computing towards practical and useful implementations. Firstly, prioritizing the identification of a "killer app"is a key point. An application demonstrating the distinctive capabilities of NISQ devices can catalyze broader development. We suggest focusing on applications that are inherently quantum, e.g., pointing towards quantum chemistry and material science as promising domains. These fields hold the potential to exhibit benefits, setting benchmarks for other applications to follow. Secondly, integrating AI and deep-learning methods into NISQ computing is a promising approach. Examples such as quantum Physics-Informed Neural Networks and Differentiable Quantum Circuits (DQC) demonstrate the synergy between quantum computing and AI. Lastly, recognizing the interdisciplinary nature of NISQ computing, we advocate for a co-design approach. Achieving synergy between classical and quantum computing necessitates an effort in co-designing quantum applications, algorithms, and programming environments, and the integration of HPC with quantum hardware. The interoperability of these components is crucial for enabling the full potential of NISQ computing. In conclusion, through the usage of these three approaches, we argue that NISQ computing can surpass current limitations and evolve into a valuable tool for scientific and industrial applications. This requires an approach that integrates domain-specific killer apps, harnesses the power of quantum-enhanced AI, and embraces a collaborative co-design methodology.
  •  
13.
  • Iakymchuk, Roman, et al. (författare)
  • A Particle-in-Cell Method for Automatic Load-Balancing with the AllScale Environment
  • 2016
  • Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract
    • We present an initial design and implementation of a Particle-in-Cell (PIC) method based on the work carried out in the European Exascale AllScale project. AllScale provides a unified programming system for the effective development of highly scalable, resilient and performance-portable parallel applications for Exascale systems. The AllScale approach is based on task-based nested recursive parallelism and it provides mechanisms for automatic load-balancing in the PIC simulations. We provide the preliminary results of the AllScale-based PIC implementation and draw directions for its future development. 
  •  
14.
  • Ivanov, Ilya, et al. (författare)
  • Evaluation of Parallel Communication Models in Nekbone, a Nek5000 mini-application
  • 2015
  • Ingår i: 2015 IEEE International Conference on Cluster Computing. - : IEEE. ; , s. 760-767
  • Konferensbidrag (refereegranskat)abstract
    • Nekbone is a proxy application of Nek5000, a scalable Computational Fluid Dynamics (CFD) code used for modelling incompressible flows. The Nekbone mini-application is used by several international co-design centers to explore new concepts in computer science and to evaluate their performance. We present the design and implementation of a new communication kernel in the Nekbone mini-application with the goal of studying the performance of different parallel communication models. First, a new MPI blocking communication kernel has been developed to solve Nekbone problems in a three-dimensional Cartesian mesh and process topology. The new MPI implementation delivers a 13% performance improvement compared to the original implementation. The new MPI communication kernel consists of approximately 500 lines of code against the original 7,000 lines of code, allowing experimentation with new approaches in Nekbone parallel communication. Second, the MPI blocking communication in the new kernel was changed to the MPI non-blocking communication. Third, we developed a new Partitioned Global Address Space (PGAS) communication kernel, based on the GPI-2 library. This approach reduces the synchronization among neighbor processes and is on average 3% faster than the new MPI-based, non-blocking, approach. In our tests on 8,192 processes, the GPI-2 communication kernel is 3% faster than the new MPI non-blocking communication kernel. In addition, we have used the OpenMP in all the versions of the new communication kernel. Finally, we highlight the future steps for using the new communication kernel in the parent application Nek5000.
  •  
15.
  • Liu, Xueyang, et al. (författare)
  • Accelerator integration in a tile-based SoC: lessons learned with a hardware floating point compression engine
  • 2023
  • Ingår i: Proceedings of 2023 SC Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023. - : Association for Computing Machinery (ACM). ; , s. 1662-1669
  • Konferensbidrag (refereegranskat)abstract
    • Heterogeneous Intellectual Property (IP) hardware acceleration engines have emerged as a viable path forward to improving performance in the waning of Moore's Law and Dennard scaling. In this study, we design, prototype, and evaluate the HPC-specialized ZHW floating point compression accelerator as a resource on a System on Chip (SoC). Our full hardware/software implementation and evaluation reveal inefficiencies at the system level that significantly throttle the potential speedup of the ZHW accelerator. By optimizing data movement between CPU, memory, and accelerator, 6.9X is possible compared to a RISC-V64 core, and 2.9X over a Mac M1 ARM core.
  •  
16.
  • Ma, Yingjuan, et al. (författare)
  • Reconnection in the Martian Magnetotail : Hall-MHD With Embedded Particle-in-Cell Simulations
  • 2018
  • Ingår i: Journal of Geophysical Research - Space Physics. - : AMER GEOPHYSICAL UNION. - 2169-9380 .- 2169-9402. ; 123:5, s. 3742-3763
  • Tidskriftsartikel (refereegranskat)abstract
    • Mars Atmosphere and Volatile EvolutioN (MAVEN) mission observations show clear evidence of the occurrence of the magnetic reconnection process in the Martian plasma tail. In this study, we use sophisticated numerical models to help us understand the effects of magnetic reconnection in the plasma tail. The numerical models used in this study are (a) a multispecies global Hall-magnetohydrodynamic (HMHD) model and (b) a global HMHD model two-way coupled to an embedded fully kinetic particle-in-cell code. Comparison with MAVEN observations clearly shows that the general interaction pattern is well reproduced by the global HMHD model. The coupled model takes advantage of both the efficiency of the MHD model and the ability to incorporate kinetic processes of the particle-in-cell model, making it feasible to conduct kinetic simulations for Mars under realistic solar wind conditions for the first time. Results from the coupled model show that the Martian magnetotail is highly dynamic due to magnetic reconnection, and the resulting Mars-ward plasma flow velocities are significantly higher for the lighter ion fluid, which are quantitatively consistent with MAVEN observations. The HMHD with Embedded Particle-in-Cell model predicts that the ion loss rates are more variable but with similar mean values as compared with HMHD model results.
  •  
17.
  • Markidis, Stefano, et al. (författare)
  • A performance characterization of streaming computing on supercomputers
  • 2016
  • Ingår i: Procedia Computer Science. - : Elsevier. - 1877-0509. ; , s. 98-107
  • Konferensbidrag (refereegranskat)abstract
    • Streaming computing models allow for on-the-y processing of large data sets. With the increased demand for processing large amount of data in a reasonable period of time, streaming models are more and more used on supercomputers to solve data-intensive problems. Because supercomputers have been mainly used for compute-intensive workload, supercomputer performance metrics focus on the number of oating point operations in time and cannot fully characterize a streaming application performance on supercomputers. We introduce the injection and processing rates as the main metrics to characterize the performance of streaming computing on supercomputers. We analyze the dynamics of these quantities in a modi ed STREAM benchmark developed atop of an MPI streaming library in a series of di erent congurations. We show that after a brief transient the injection and processing rates converge to sustained rates. We also demonstrate that streaming computing performance strongly depends on the number of connections between data producers and consumers and on the processing task granularity.
  •  
18.
  • Markidis, Stefano, et al. (författare)
  • Idle waves in high-performance computing
  • 2015
  • Ingår i: Physical Review E. Statistical, Nonlinear, and Soft Matter Physics. - 1539-3755 .- 1550-2376. ; 91:1, s. 013306-
  • Tidskriftsartikel (refereegranskat)abstract
    • The vast majority of parallel scientific applications distributes computation among processes that are in a busy state when computing and in an idle state when waiting for information from other processes. We identify the propagation of idle waves through processes in scientific applications with a local information exchange between the two processes. Idle waves are nondispersive and have a phase velocity inversely proportional to the average busy time. The physical mechanism enabling the propagation of idle waves is the local synchronization between two processes due to remote data dependency. This study provides a description of the large number of processes in parallel scientific applications as a continuous medium. This work also is a step towards an understanding of how localized idle periods can affect remote processes, leading to the degradation of global performance in parallel scientific applications.
  •  
19.
  • Markidis, Stefano, et al. (författare)
  • Kinetic Modeling in the Magnetosphere
  • 2021
  • Ingår i: Magnetospheres in the Solar System. - : Wiley. ; , s. 607-615
  • Bokkapitel (övrigt vetenskapligt/konstnärligt)abstract
    • This paper presents the state of the art of kinetic modeling techniques for simulating plasma kinetic dynamics in magnetospheres. We describe the key numerical techniques for enabling large-scale kinetic simulations of magnetospheres: parameter scaling, implicit Particle-in-Cell schemes and fluid-kinetic coupling. We show an application of these techniques to study particle acceleration and heating in asymmetric magnetic reconnection in the Ganymede magnetosphere. 
  •  
20.
  • Markidis, Stefano, et al. (författare)
  • The EPiGRAM Project : Preparing Parallel Programming Models for Exascale
  • 2016
  • Ingår i: HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2016 INTERNATIONAL WORKSHOPS. - Cham : Springer. - 9783319460796 - 9783319460789 ; , s. 56-68
  • Konferensbidrag (refereegranskat)abstract
    • EPiGRAM is a European Commission funded project to improve existing parallel programming models to run efficiently large scale applications on exascale supercomputers. The EPiGRAM project focuses on the two current dominant petascale programming models, message-passing and PGAS, and on the improvement of two of their associated programming systems, MPI and GASPI. In EPiGRAM, we work on two major aspects of programming systems. First, we improve the performance of communication operations by decreasing the memory consumption, improving collective operations and introducing emerging computing models. Second, we enhance the interoperability of message-passing and PGAS by integrating them in one PGAS-based MPI implementation, called EMPI4Re, implementing MPI endpoints and improving GASPI interoperability with MPI. The new EPiGRAM concepts are tested in two large-scale applications, iPIC3D, a Particle-in-Cell code for space physics simulations, and Nek5000, a Computational Fluid Dynamics code.
  •  
21.
  •  
22.
  • Narasimhamurthy, Sai, et al. (författare)
  • SAGE : Percipient Storage for Exascale Data Centric Computing
  • 2019
  • Ingår i: Parallel Computing. - : Elsevier. - 0167-8191 .- 1872-7336. ; 83, s. 22-33
  • Tidskriftsartikel (refereegranskat)abstract
    • We aim to implement a Big Data/Extreme Computing (BDEC) capable system infrastructure as we head towards the era of Exascale computing - termed SAGE (Percipient StorAGe for Exascale Data Centric Computing). The SAGE system will be capable of storing and processing immense volumes of data at the Exascale regime, and provide the capability for Exascale class applications to use such a storage infrastructure. SAGE addresses the increasing overlaps between Big Data Analysis and HPC in an era of next-generation data centric computing that has developed due to the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors, whose data needs to be processed, analysed and integrated into simulations to derive scientific and innovative insights. Indeed, Exascale I/O, as a problem that has not been sufficiently dealt with for simulation codes, is appropriately addressed by the SAGE platform. The objective of this paper is to discuss the software architecture of the SAGE system and look at early results we have obtained employing some of its key methodologies, as the system continues to evolve.
  •  
23.
  • Narasimhamurthy, S., et al. (författare)
  • The SAGE project : A storage centric approach for exascale computing
  • 2018
  • Ingår i: 2018 ACM International Conference on Computing Frontiers, CF 2018 - Proceedings. - New York, NY, USA : Association for Computing Machinery (ACM). - 9781450357616 ; , s. 287-292
  • Konferensbidrag (refereegranskat)abstract
    • SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a storage centric approach as it is capable of storing and processing large data volumes at the Exascale regime. SAGE addresses the convergence of Big Data Analysis and HPC in an era of next-generation data centric computing. This convergence is driven by the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors where data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. A first prototype of the SAGE system has been been implemented and installed at the Jülich Supercomputing Center. The SAGE storage system consists of multiple types of storage device technologies in a multi-tier I/O hierarchy, including flash, disk, and non-volatile memory technologies. The main SAGE software component is the Seagate Mero Object Storage that is accessible via the Clovis API and higher level interfaces. The SAGE project also includes scientific applications for the validation of the SAGE concepts. The objective of this paper is to present the SAGE project concepts, the prototype of the SAGE platform and discuss the software architecture of the SAGE system.
  •  
24.
  • Olshevsky, Vyacheslav, et al. (författare)
  • Magnetic Null Points In Kinetic Simulations of Space Plasmas
  • 2016
  • Ingår i: Astrophysical Journal. - : Institute of Physics Publishing (IOPP). - 0004-637X .- 1538-4357. ; 819:1
  • Tidskriftsartikel (refereegranskat)abstract
    • We present a systematic attempt to study magnetic null points and the associated magnetic energy conversion in kinetic particle-in-cell simulations of various plasma configurations. We address three-dimensional simulations performed with the semi-implicit kinetic electromagnetic code iPic3D in different setups: variations of a Harris current sheet, dipolar and quadrupolar magnetospheres interacting with the solar wind,. and a relaxing turbulent configuration with multiple null points. Spiral nulls are more likely created in space plasmas: in all our simulations except lunar magnetic anomaly (LMA) and quadrupolar mini-magnetosphere the number of spiral nulls prevails over the number of radial nulls by a factor of 3-9. We show that often magnetic nulls do not indicate the regions of intensive energy dissipation. Energy dissipation events caused by topological bifurcations at radial nulls are rather rare and short-lived. The so-called X-lines formed by the radial nulls in the Harris current sheet and LMA simulations are rather stable and do not exhibit any energy dissipation. Energy dissipation is more powerful in the vicinity of spiral nulls enclosed by magnetic flux ropes with strong currents at their axes (their cross. sections resemble 2D magnetic islands). These null lines reminiscent of Z-pinches efficiently dissipate magnetic energy due to secondary instabilities such as the two-stream or kinking instability, accompanied by changes in magnetic topology. Current enhancements accompanied by spiral nulls may signal magnetic energy conversion sites in the observational data.
  •  
25.
  • Peng, Ivy Bo, et al. (författare)
  • A Data streaming model in MPI
  • 2015
  • Ingår i: Proceedings of the 3rd ExaMPI Workshop at the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2015. - New York, NY, USA : ACM Digital Library. - 9781450339988
  • Konferensbidrag (refereegranskat)abstract
    • Data streaming model is an effective way to tackle the chal-lenge of data-intensive applications. As traditional HPC applications generate large volume of data and more data-intensive applications move to HPC infrastructures, it is nec-essary to investigate the feasibility of combining message-passing and streaming programming models. MPI, the de facto standard for programming on HPC, cannot intuitively express the communication pattern and the functional op-erations required in streaming models. In this work, we de-signed and implemented a data streaming library MPIStream atop MPI to allocate data producers and consumers, to stream data continuously or irregularly and to process data at run-Time. In the same spirit as the STREAM benchmark, we developed a parallel stream benchmark to measure data processing rate. The performance of the library largely de-pends on the size of the stream element, the number of data producers and consumers and the computational intensity of processing one stream element. With 2,048 data produc-ers and 2,048 data consumers in the parallel benchmark, MPIStream achieved 200 GB/s processing rate on a Blue Gene/Q supercomputer. We illustrate that a streaming li-brary for HPC applications can effectively enable irregular parallel I/O, application monitoring and threshold collective operations. 
  •  
26.
  • Peng, Ivy Bo (författare)
  • Data Movement on Emerging Large-Scale Parallel Systems
  • 2017
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Large-scale HPC systems are an important driver for solving computational problems in scientific communities. Next-generation HPC systems will not only grow in scale but also in heterogeneity. This increased system complexity entails more challenges to data movement in HPC applications. Data movement on emerging HPC systems requires asynchronous fine-grained communication and efficient data placement in the main memory. This thesis proposes an innovative programming model and algorithm to prepare HPC applications for the next computing era: (1) a data streaming model that supports emerging data-intensive applications on supercomputers, (2) a decoupling model that improves parallelism and mitigates the impact of imbalance in applications, (3) a new framework and methodology for predicting the impact of largescale heterogeneous memory systems on HPC applications, and (4) a data placement algorithm that uses a set of rules and a decision tree to determine the data-to-memory mapping in heterogeneous main memory.The proposed approaches in this thesis are evaluated on multiple supercomputers with different processors and interconnect networks. The evaluation uses a diverse set of applications that represent conventional scientific applications and emerging data-analytic workloads on HPC systems. The experimental results on the petascale testbed show that the approaches obtain increasing performance improvements as system scale increases and this trend supports the approaches as a valuable contribution towards future HPC systems.
  •  
27.
  • Peng, Ivy Bo, et al. (författare)
  • Energetic particles in magnetotail reconnection
  • 2015
  • Ingår i: Journal of Plasma Physics. - 0022-3778 .- 1469-7807. ; 81
  • Tidskriftsartikel (refereegranskat)abstract
    • We carried out a 3D fully kinetic simulation of Earth's magnetotail magnetic reconnection to study the dynamics of energetic particles. We developed and implemented a new relativistic particle mover in iPIC3D, an implicit Particle-in-Cell code, to correctly model the dynamics of energetic particles. Before the onset of magnetic reconnection, energetic electrons are found localized close to current sheet and accelerated by lower hybrid drift instability. During magnetic reconnection, energetic particles are found in the reconnection region along the x-line and in the separatrices region. The energetic electrons are first present in localized stripes of the separatrices and finally cover all the separatrix surfaces. Along the separatrices, regions with strong electron deceleration are found. In the reconnection region, two categories of electron trajectory are identified. First, part of the electrons are trapped in the reconnection region, bouncing a few times between the outflow jets. Second, part of the electrons pass the reconnection region without being trapped. Different from electrons, energetic ions are localized on the reconnection fronts of the outflow jets.
  •  
28.
  • Peng, Ivy Bo, et al. (författare)
  • Exploring the performance benefit of hybrid memory system on HPC environments
  • 2017
  • Ingår i: Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017. - : Institute of Electrical and Electronics Engineers (IEEE). - 9781538634080 ; , s. 683-692
  • Konferensbidrag (refereegranskat)abstract
    • Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the computing cores. For example, the Intel Knights Landing (KNL) processor is equipped with 16 GB of high-bandwidth memory (HBM) that works together with conventional DRAM memory. Theoretically, HBM can provide ∼4× higher bandwidth than conventional DRAM. However, many factors impact the effective performance achieved by applications, including the application memory access pattern, the problem size, the threading level and the actual memory configuration. In this paper, we analyze the Intel KNL system and quantify the impact of the most important factors on the application performance by using a set of applications that are representative of scientific and data-analytics workloads. Our results show that applications with regular memory access benefit from MCDRAM, achieving up to 3× performance when compared to the performance obtained using only DRAM. On the contrary, applications with random memory access pattern are latency-bound and may suffer from performance degradation when using only MCDRAM. For those applications, the use of additional hardware threads may help hide latency and achieve higher aggregated bandwidth when using HBM.
  •  
29.
  • Peng, Ivy Bo, et al. (författare)
  • MPI Streams for HPC Applications
  • 2017
  • Ingår i: New Frontiers in High Performance Computing and Big Data. - : IOS Press. - 9781614998150 - 9781614998167 ; , s. 75-92
  • Konferensbidrag (refereegranskat)abstract
    • Data streams are a sequence of data flowing between source and destination processes. Streaming is widely used for signal, image and video processing for its efficiency in pipelining and effectiveness in reducing demand for memory. The goal of this work is to extend the use of data streams to support both conventional scientific applications and emerging data analytics applications running on HPC platforms. We introduce an extension called MPIStream to the de-facto programming standard on HPC, MPI. MPIStream supports data streams either within a single application or among multiple applications. We present three use cases using MPI streams in HPC applications together with their parallel performance. We show the convenience of using MPI streams to support the needs from both traditional HPC and emerging data analytics applications running on supercomputers.
  •  
30.
  • Peng, Ivy Bo, et al. (författare)
  • OpenCUBE : Building an Open Source Cloud Blueprint with EPI Systems
  • 2024
  • Ingår i: Euro-Par 2023. - : Springer Nature. ; , s. 260-264
  • Konferensbidrag (refereegranskat)abstract
    • OpenCUBE aims to develop an open-source full software stack for Cloud computing blueprint deployed on EPI hardware, adaptable to emerging workloads across the computing continuum. OpenCUBE prioritizes energy awareness and utilizes open APIs, Open Source components, advanced SiPearl Rhea processors, and RISC-V accelerator. The project leverages representative workloads, such as cloud-native workloads and workflows of weather forecast data management, molecular docking, and space weather, for evaluation and validation.
  •  
31.
  • Peng, Ivy Bo, et al. (författare)
  • Preparing HPC Applications for the Exascale Era: A Decoupling Strategy
  • 2017
  • Ingår i: 2017 46th International Conference on Parallel Processing (ICPP). - : IEEE Computer Society. - 9781538610428 ; , s. 1-10
  • Konferensbidrag (refereegranskat)abstract
    • Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propose a decoupling strategy to improve the scalability of applications running on large-scale systems. Our strategy separates application operations onto groups of processes and enables a dataflow processing paradigm among the groups. This mechanism is effective in reducing the impact of load imbalance and increases the parallel efficiency by pipelining multiple operations. We provide a proof-of-concept implementation using MPI, the de-facto programming system on current supercomputers. We demonstrate the effectiveness of this strategy by decoupling the reduce, particle communication, halo exchange and I/O operations in a set of scientific and data-analytics applications. A performance evaluation on 8,192 processes of a Cray XC40 supercomputer shows that the proposed approach can achieve up to 4x performance improvement.
  •  
32.
  • Peng, Ivy Bo, et al. (författare)
  • RTHMS : A Tool for Data Placement on Hybrid Memory System
  • 2017
  • Ingår i: Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management, ISMM 2017. - New York, NY, USA : Association for Computing Machinery (ACM). ; , s. 82-91
  • Konferensbidrag (refereegranskat)abstract
    • Traditional scientific and emerging data analytics applications require fast, power-efficient, large, and persistent memories. Combining all these characteristics within a single memory technology is expensive and hence future supercomputers will feature different memory technologies side-by-side. However, it is a complex task to program hybrid-memory systems and to identify the best object-to-memory mapping. We envision that programmers will probably resort to use default configurations that only require minimal interventions on the application code or system settings. In this work, we argue that intelligent, fine-grained data placement can achieve higher performance than default setups. We present an algorithm for data placement on hybrid-memory systems. Our algorithm is based on a set of single-object allocation rules and global data placement decisions. We also present RTHMS, a tool that implements our algorithm and provides recommendations about the object-to-memory mapping. Our experiments on a hybrid memory system, an Intel Knights Landing processor with DRAM and HBM, show that RTHMS is able to achieve higher performance than the default configuration. We believe that RTHMS will be a valuable tool for programmers working on complex hybrid-memory systems.
  •  
33.
  • Rivas-Gomez, Sergio, et al. (författare)
  • Extending message passing interface windows to storage
  • 2017
  • Ingår i: Proceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017. - : Institute of Electrical and Electronics Engineers Inc.. - 9781509066100 ; , s. 728-730
  • Konferensbidrag (refereegranskat)abstract
    • This paper presents an extension to MPI supporting the one-sided communication model and window allocations in storage. Our design transparently integrates with the current MPI implementations, enabling applications to target MPI windows in storage, memory or both simultaneously, without major modifications. Initial performance results demonstrate that the presented MPI window extension could potentially be helpful for a wide-range of use-cases and with low-overhead.
  •  
34.
  • Rivas-Gomez, Sergio, et al. (författare)
  • MPI windows on storage for HPC applications
  • 2018
  • Ingår i: Parallel Computing. - : Elsevier. - 0167-8191 .- 1872-7336. ; 77, s. 38-56
  • Tidskriftsartikel (refereegranskat)abstract
    • Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI storage windows, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. In addition, we explore the integration of heterogeneous window allocations, where memory and storage share a unified virtual address space. When performing large, irregular memory operations, we verify that MPI windows on local storage incurs a 55% performance penalty on average. When using a Lustre parallel file system, "asymmetric" performance is observed with over 90% degradation in writing operations. Nonetheless, experimental results of a Distributed Hash Table, the HACC I/O kernel mini-application, and a novel MapReduce implementation based on the use of MPI one-sided communication, indicate that the overall penalty of MPI windows on storage can be negligible in most cases in real-world applications.
  •  
35.
  • Rivas-Gomez, Sergei, et al. (författare)
  • MPI windows on storage for HPC applications
  • 2017
  • Ingår i: ACM International Conference Proceeding Series. - New York, NY, USA : Association for Computing Machinery (ACM).
  • Konferensbidrag (refereegranskat)abstract
    • Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI windows on storage, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. Using a modified STREAM micro-benchmark, we measure the sustained bandwidth of MPI windows on storage against MPI memory windows and observe that only a 10% performance penalty is incurred. When using parallel file systems such as Lustre, asymmetric performance is observed with a 10% performance penalty in reading operations and a 90% in writing operations. Nonetheless, experimental results of a Distributed Hash Table and the HACC I/O kernel mini-application show that the overall penalty of MPI windows on storage can be negligible in most cases on real-world applications. 
  •  
36.
  • Schieffer, Gabin, et al. (författare)
  • Boosting the Performance of Object Tracking with a Half-Precision Particle Filter on GPU
  • 2024
  • Ingår i: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28 – September 1, 2023, Revised Selected Papers. - : Springer Nature. ; , s. 294-305
  • Konferensbidrag (refereegranskat)abstract
    • High-performance GPU-accelerated particle filter methods are critical for object detection applications, ranging from autonomous driving, robot localization, to time-series prediction. In this work, we investigate the design, development and optimization of particle-filter using half-precision on CUDA cores and compare their performance and accuracy with single- and double-precision baselines on Nvidia V100, A100, A40 and T4 GPUs. To mitigate numerical instability and precision losses, we introduce algorithmic changes in the particle filters. Using half-precision leads to a performance improvement of 1.5–2 × and 2.5–4.6 × with respect to single- and double-precision baselines respectively, at the cost of a relatively small loss of accuracy.
  •  
37.
  • Schieffer, Gabin, et al. (författare)
  • On the Rise of AMD Matrix Cores: Performance, Power Efficiency, and Programmability
  • 2024
  • Ingår i: Proceedings - 2024 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2024. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 132-143
  • Konferensbidrag (refereegranskat)abstract
    • Matrix multiplication is a core computational part of deep learning and scientific workloads. The emergence of Matrix Cores in high-end AMD GPUs, a building block of Exascale computers, opens new opportunities for optimizing the performance and power efficiency of compute-intensive applications. This work provides a timely, comprehensive characterization of the novel Matrix Cores in AMD GPUs. We develop low-level micro-benchmarks for leveraging Matrix Cores at different levels of parallelism, achieving up to 350, 88, and 69 TFLOPS for mixed, float, and double precision on one GPU. Using results obtained from the micro-benchmarks, we provide a performance model of Matrix Cores that can guide application developers in performance tuning. We also provide the first quantitative study and modeling of the power efficiency of Matrix Cores at different floating-point data types. Finally, we evaluate the high- level programmability of Matrix Cores through the rocBLAS library in a wide range of matrix sizes from 16 to 64K. Our results indicate that application developers can transparently leverage Matrix Cores to deliver more than 92% peak computing throughput by properly selecting data types and interfaces.
  •  
38.
  • Toth, Gabor, et al. (författare)
  • Extended magnetohydrodynamics with embedded particle-in-cell simulation of Ganymede's magnetosphere
  • 2016
  • Ingår i: Journal of Geophysical Research - Space Physics. - : American Geophysical Union (AGU). - 2169-9380 .- 2169-9402. ; 121:2, s. 1273-1293
  • Tidskriftsartikel (refereegranskat)abstract
    • We have recently developed a new modeling capability to embed the implicit particle-in-cell (PIC) model iPIC3D into the Block-Adaptive-Tree-Solarwind-Roe-Upwind-Scheme magnetohydrodynamic (MHD) model. The MHD with embedded PIC domains (MHD-EPIC) algorithm is a two-way coupled kinetic-fluid model. As one of the very first applications of the MHD-EPIC algorithm, we simulate the interaction between Jupiter's magnetospheric plasma and Ganymede's magnetosphere. We compare the MHD-EPIC simulations with pure Hall MHD simulations and compare both model results with Galileo observations to assess the importance of kinetic effects in controlling the configuration and dynamics of Ganymede's magnetosphere. We find that the Hall MHD and MHD-EPIC solutions are qualitatively similar, but there are significant quantitative differences. In particular, the density and pressure inside the magnetosphere show different distributions. For our baseline grid resolution the PIC solution is more dynamic than the Hall MHD simulation and it compares significantly better with the Galileo magnetic measurements than the Hall MHD solution. The power spectra of the observed and simulated magnetic field fluctuations agree extremely well for the MHD-EPIC model. The MHD-EPIC simulation also produced a few flux transfer events (FTEs) that have magnetic signatures very similar to an observed event. The simulation shows that the FTEs often exhibit complex 3-D structures with their orientations changing substantially between the equatorial plane and the Galileo trajectory, which explains the magnetic signatures observed during the magnetopause crossings. The computational cost of the MHD-EPIC simulation was only about 4 times more than that of the Hall MHD simulation. Key Points
  •  
39.
  • Toth, Gabor, et al. (författare)
  • Scaling the Ion Inertial Length and Its Implications for Modeling Reconnection in Global Simulations
  • 2017
  • Ingår i: Journal of Geophysical Research - Space Physics. - : AMER GEOPHYSICAL UNION. - 2169-9380 .- 2169-9402. ; 122:10, s. 10336-10355
  • Tidskriftsartikel (refereegranskat)abstract
    • We investigate the use of artificially increased ion and electron kinetic scales in global plasma simulations. We argue that as long as the global and ion inertial scales remain well separated, (1) the overall global solution is not strongly sensitive to the value of the ion inertial scale, while (2) the ion inertial scale dynamics will also be similar to the original system, but it occurs at a larger spatial scale, and (3) structures at intermediate scales, such as magnetic islands, grow in a self-similar manner. To investigate the validity and limitations of our scaling hypotheses, we carry out many simulations of a two-dimensional magnetosphere with the magnetohydrodynamics with embedded particle-in-cell (MHD-EPIC) model. The PIC model covers the dayside reconnection site. The simulation results confirm that the hypotheses are true as long as the increased ion inertial length remains less than about 5% of the magnetopause standoff distance. Since the theoretical arguments are general, we expect these results to carry over to three dimensions. The computational cost is reduced by the third and fourth powers of the scaling factor in two-and three-dimensional simulations, respectively, which can be many orders of magnitude. The present results suggest that global simulations that resolve kinetic scales for reconnection are feasible. This is a crucial step for applications to the magnetospheres of Earth, Saturn, and Jupiter and to the solar corona.
  •  
40.
  • Wahlgren, Jacob, et al. (författare)
  • A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems
  • 2023
  • Ingår i: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023. - : Association for Computing Machinery (ACM).
  • Konferensbidrag (refereegranskat)abstract
    • Memory disaggregation has recently been adopted in data centers to improve resource utilization, motivated by cost and sustainability. Recent studies on large-scale HPC facilities have also highlighted memory underutilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where node-local memory is supplemented by shared memory pools. This work outlines the prospects and requirements for adoption and clarifies several misconceptions. We propose a quantitative method for dissecting application requirements on the memory system from the top down in three levels, moving from general, to multi-tier memory systems, and then to memory pooling. We provide a multi-level profiling tool and LBench to facilitate the quantitative approach. We evaluate a set of representative HPC workloads on an emulated platform. Our results show that prefetching activities can significantly influence memory traffic profiles. Interference in memory pooling has varied impacts on applications, depending on their access ratios to memory tiers and arithmetic intensities. Finally, in two case studies, we show the benefits of our findings at the application and system levels, achieving 50% reduction in remote access and 13% speedup in BFS, and reducing performance variation of co-located workloads in interference-aware job scheduling.
  •  
41.
  • Williams, Jeremy J., et al. (författare)
  • Characterizing the Performance of the Implicit Massively Parallel Particle-in-Cell iPIC3D Code
  • 2023
  • Ingår i: SC23 Proccedings. - Denver, Colorado, USA.
  • Konferensbidrag (refereegranskat)abstract
    • Optimizing iPIC3D, an implicit Particle-in-Cell (PIC) code,for large-scale 3D plasma simulations is crucial for spaceand astrophysical applications. This work focuses on characterizing iPIC3D’s communication efficiency through strategic measures like optimal node placement, communicationand computation overlap, and load balancing. Profiling andtracing tools are employed to analyze iPIC3D’s communication efficiency and provide practical recommendations. Implementing optimized communication protocols addressesthe Geospace Environmental Modeling (GEM) magnetic reconnection challenges in plasma physics with more precisesimulations. This approach captures the complexities of 3Dplasma simulations, particularly in magnetic reconnection,advancing space and astrophysical research. 
  •  
42.
  • Williams, Jeremy J., et al. (författare)
  • Leveraging HPC Profiling and Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations
  • 2024
  • Ingår i: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28 – September 1, 2023, Revised Selected Papers. - : Springer Nature. ; , s. 123-134
  • Konferensbidrag (refereegranskat)abstract
    • Large-scale plasma simulations are critical for designing and developing next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work, we characterize single node, multiple nodes, and I/O performances of the BIT1 code in two realistic cases by using several HPC profilers, such as perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting function on-node performance is the main performance bottleneck. Strong scaling tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two test cases. We demonstrate that communication, load imbalance and self-synchronization are important factors impacting the performance of the BIT1 on large-scale runs.
  •  
43.
  • Yu, Yiqun, et al. (författare)
  • PIC simulations of wave-particle interactions with an initial electron velocity distribution from a kinetic ring current model
  • 2018
  • Ingår i: Journal of Atmospheric and Solar-Terrestrial Physics. - : PERGAMON-ELSEVIER SCIENCE LTD. - 1364-6826 .- 1879-1824. ; 177, s. 169-178
  • Tidskriftsartikel (refereegranskat)abstract
    • Whistler wave-particle interactions play an important role in the Earth inner magnetospheric dynamics and have been the subject of numerous investigations. By running a global kinetic ring current model (RAM-SCB) in a storm event occurred on Oct 23-24 2002, we obtain the ring current electron distribution at a selected location at MLT of 9 and L of 6 where the electron distribution is composed of a warm population in the form of a partial ring in the velocity space (with energy around 15 keV) in addition to a cool population with a Maxwellian-like distribution. The warm population is likely from the injected plasma sheet electrons during substorm injections that supply fresh source to the inner magnetosphere. These electron distributions are then used as input in an implicit particle-in-cell code (iPIC3D) to study whistler-wave generation and the subsequent wave-particle interactions. We find that whistler waves are excited and propagate in the quasi-parallel direction along the background magnetic field. Several different wave modes are instantaneously generated with different growth rates and frequencies. The wave mode at the maximum growth rate has a frequency around 0.62 omega(ce), which corresponds to a parallel resonant energy of 2.5 keV. Linear theory analysis of wave growth is in excellent agreement with the simulation results. These waves grow initially due to the injected warm electrons and are later damped due to cyclotron absorption by electrons whose energy is close to the resonant energy and can effectively attenuate waves. The warm electron population overall experiences net energy loss and anisotropy drop while moving along the diffusion surfaces towards regions of lower phase space density, while the cool electron population undergoes heating when the waves grow, suggesting the cross-population interactions.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-43 av 43
Typ av publikation
konferensbidrag (28)
tidskriftsartikel (10)
licentiatavhandling (2)
proceedings (redaktörskap) (1)
doktorsavhandling (1)
bokkapitel (1)
visa fler...
visa färre...
Typ av innehåll
refereegranskat (36)
övrigt vetenskapligt/konstnärligt (7)
Författare/redaktör
Peng, Ivy Bo (42)
Markidis, Stefano (34)
Laure, Erwin (16)
Araújo De Medeiros, ... (7)
Schieffer, Gabin (5)
Wahlgren, Jacob (5)
visa fler...
Chen, Yuxi (5)
Toth, Gabor (5)
Kestor, G. (4)
Gioiosa, R. (4)
Gokhale, Maya (4)
Gombosi, Tamas I. (3)
Palmroth, Minna (2)
Akhmetova, Dana (2)
Iakymchuk, Roman (2)
Brightwell, Ron (2)
Battarbee, Markus (2)
Ganse, Urs (2)
Pfau-Kempf, Yann (2)
Rahn, Mirko (2)
Cassak, Paul (2)
Jia, Xianzhe (2)
Vinuesa, Ricardo (1)
Podobas, Artur (1)
Laure, Erwin, Profes ... (1)
Wu, S (1)
Vaivads, Andris (1)
Pleiter, Dirk (1)
Netzer, Gilbert (1)
Allen, Tyler (1)
Andersson, Måns (1)
Jansson, Niclas, 198 ... (1)
Anzt, Hartwig (1)
Peng, Ivy Bo, Assist ... (1)
Markidis, Stefano, P ... (1)
Herman, Pawel, Assoc ... (1)
Cardellini, Valeria, ... (1)
Russell, Christopher ... (1)
Henri, Pierre (1)
Laure, E. (1)
Fischer, Paul (1)
Kyriienko, Oleksandr (1)
Tolias, Panagiotis, ... (1)
Larsson Träff, Jespe ... (1)
Holmes, D (1)
Slavin, James A. (1)
Gong, Jing (1)
Bull, M (1)
Schulz, Martin (1)
Jordanova, Vania K. (1)
visa färre...
Lärosäte
Kungliga Tekniska Högskolan (43)
Uppsala universitet (1)
Språk
Engelska (43)
Forskningsämne (UKÄ/SCB)
Naturvetenskap (36)
Teknik (10)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy