↓ Direkt till sidans innehåll
↓ Direkt till sidans sekundära innehåll (sidomenyn)

Träfflista för sökning "WFRF:(Podobas Artur) "

Search: WFRF:(Podobas Artur)

Result 1-10 of 59

Sort/group result

Sort by: Hits per page:

Enumeration	Reference	Cover	Find
1.	Adhi, Boma, et al. (author) Exploration Framework for Synthesizable CGRAs Targeting HPC : Initial Design and Evaluation 2022 In: 2022 IEEE 36Th International Parallel And Distributed Processing Symposium Workshops (IPDPSW 2022). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 639-646 Conference paper (peer-reviewed)abstract Among the more salient accelerator technologies to continue performance scaling in High-Performance Computing (HPC) are Coarse-Grained Reconfigurable Arrays (CGRAs). However, what benefits CGRAs will bring to HPC workloads and how those benefits will be reaped is an open research question today. In this work, we propose a framework to explore the design space of CGRAs for HPC workloads, which includes a tool flow of compilation and simulation, a CGRA HDL library written in SystemVerilog, and a synthesizable CGRA design as a baseline. Using RTL simulation, we evaluate two well-known computation kernels with the baseline CGRA for multiple different architectural parameters. The simulation results demonstrate both correctness and usefulness of our exploration framework.
2.	Adhi, Boma, et al. (author) Exploring Inter-tile Connectivity for HPC-oriented CGRA with Lower Resource Usage 2022 In: FPT 2022. - : Institute of Electrical and Electronics Engineers (IEEE). Conference paper (peer-reviewed)abstract This research aims to explore the tradeoffs between routing flexibility and hardware resource usage, ultimately reducing the resource usage of our CGRA architecture while maintaining compute efficiency. we investigate statistics of connection usages among switch blocks for benchmark DFGs, propose several CGRA architecture with a reduced connection, and evaluate their hardware cost, routability of DFGs, and computational throughput for benchmarks. We found that the topology with horizontal plus diagonal connection saves about 30% of the resource usage while maintaining virtually the same routing flexibility as the full connectivity topology.
3.	Adhi, Boma, et al. (author) Less for more : reducing intra-cgra connectivity for higher performance and efficiency in hpc 2023 In: 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 452-459 Conference paper (peer-reviewed)abstract Coarse-Grained Reconfigurable Arrays (CGRAs) are a class of reconfigurable architectures that inherit the performance of Domain-specific accelerators and the reconfigurability aspects of Field-Programmable Gate Arrays (FPGAs). Historically, CGRAs have been successfully used to accelerate embedded applications and are now considered to accelerate High-Performance Computing (HPC) applications in future supercomputers. However, embedded systems and supercomputers are two vastly different domains with different applications and constraints, and it is today not fully understood what CGRA design decisions adequately cater to the HPC market. One such unknown design decision is regarding the interconnect that facilitates intra-CGRA communication. Our findings show that even the typical king-style mesh-like topology is often under-utilized with a typical HPC workload, leading to inefficiency. This research aims to explore the provisioning of intra-CGRA interconnect for HPC-oriented workloads and, ultimately, recoup the potential performance and efficiency lost by reducing the interconnect complexity. We proposed several reduced interconnect topologies based on the usage statistic. Then we evaluate the tradeoffs regarding hardware cost, routability of DFGs, and computational throughput.
4.	Adhi, Boma, et al. (author) The Cost of Flexibility : Embedded versus Discrete Routers in CGRAs for HPC 2022 In: 2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 347-356 Conference paper (peer-reviewed)abstract Coarse-Grained Reconfigurable Arrays (CGRAs) are a class of reconfigurable architectures that inherit the performance and usability properties of Central Processing Units (CPUs) and the reconfigurability aspects of Field-Programmable Gate Arrays (FPGAs). Historically, CGRAs have been successfully used to accelerate embedded applications and are today also being considered to accelerate High-Performance Computing (HPC) applications in future supercomputers. However, embedded systems and supercomputers are two vastly different domains with different applications and constraints, and it is today not fully understood what CGRA design decisions adequately cater to the HPC market. One such unknown design decision is regarding the interconnect that facilitates intra-CGRA communication. Today, intra-CGRA communication comes in two flavors: using routers closely embedded into the compute units or using discrete routers outside the compute units. The former trades flexibility for a reduction in hardware cost, while the latter has greater flexibility but is more resource hungry. In this paper, we aspire to understand which of both designs best suits the CGRA HPC segment. We extend our previous methodology, which consists of both a parameterized CGRA design and an OpenMP-capable compiler, to accommodate both types of routing designs, including verification tests using RTL simulation. Our results show that the discrete router design can facilitate better use of processing elements (PEs) compared to embedded routers and can achieve up to 79.27% reduction in unnecessary PE occupancy for an aggressively unrolled stencil kernel on a 18 x 16 CGRA at a (estimated) hardware resource overhead cost of 6.3x. This reduction in PE occupancy can be used, for example, to exploit instruction-level parallelism (ILP) through even more aggressive unrolling.
5.	Alexandru, Iordan, et al. (author) Investigating the Potential of Energy-savings Using a Fine-grained Task Based Programming Model on Multi-cores 2011 Conference paper (peer-reviewed)abstract In this paper we study the relation between energy-efficiencyand parallel executions when implemented with a fine-grained task-centricprogramming model. Using a simulation framework comprised of an ar-chitectural simulator and a power and area estimation tool, we haveinvestigated the potential energy-savings when employing parallelism onmulti-cores system. In our experiments with 2 - 8 multi-cores systems,we employed frequency and voltage scaling in order to keep the relativeperformance of the systems constant and measured the energy-efficiencyusing the Energy-delay-product. Also, we compared the energy consump-tion of the parallel execution against the serial one. Our results showthat through judicious choice of load balancing parameters, significantimprovements of around 200 % in energy consumption can be acheived.
6.	Andersson, Måns, et al. (author) Breaking Down the Parallel Performance of GROMACS, a High-Performance Molecular Dynamics Software 2023 In: PPAM 2022. Lecture Notes in Computer Science, vol 13826.. - : Springer Nature. ; , s. 333-345 Conference paper (peer-reviewed)abstract GROMACS is one of the most widely used HPC software packages using the Molecular Dynamics (MD) simulation technique. In this work, we quantify GROMACS parallel performance using different configurations, HPC systems, and FFT libraries (FFTW, Intel MKL FFT, and FFT PACK). We break down the cost of each GROMACS computational phase and identify non-scalable stages, such as MPI communication during the 3D FFT computation when using a large number of processes. We show that the Particle-Mesh Ewald phase and the 3D FFT calculation significantly impact the GROMACS performance. Finally, we discuss performance opportunities with a particular interest in developing GROMACS for the FFT calculations.
7.	Andersson, Måns (author) Leveraging Intermediate Representations for High-Performance Portable Discrete Fourier Transform Frameworks : with Application to Molecular Dynamics 2023 Licentiate thesis (other academic/artistic)abstract The Discrete Fourier Transform (DFT) and its improved formulations, the Fast Fourier Transforms (FFTs), are vital for scientists and engineers in a range of domains from signal processing to the solution of partial differential equations. A growing trend in Scientific Computing is heterogeneous computing, where accelerators are used instead or together with CPUs. This has led to problems for developers in unifying portability, performance, and productivity. This thesis first motivates this work by showing the importance of having efficient DFT calculations, describes the DFT algorithm and a formulation based on matrix-factorizations which has been developed to formulate FFT algorithms and express their parallelism to exploit modern computer architectures, such as accelerators.The first paper is a motivating study of the breakdown of the performance and scalability of the high-performance Molecular Dynamics code GROMACS where DFT calculations are a main performance bottleneck. In particular, the long-range interactions are solved with the Particle-Mesh Ewald algorithm which uses a three-dimensional Fast Fourier Transform. The two following papers present two approaches to leverage factorization with the help of two different frameworks using Intermediate Representation and compiler technology, for the development of fast and portable code. The second paper presents a front-end and a pipeline for code generation in a domain-specific language based on Multi-Level Intermediate Representation (MLIR) for developing Fast Fourier Transform libraries. The last paper investigates and optimizes an implementation of an important kernel within the matrix-factorization framework: the batched DFT. It is implemented with data-centric programming and a data-centric intermediate representation called Stateful Dataflow multi-graphs (SDFG). The paper evaluates strategies for complex-valued data layout for performance and portability and we show that there is a trade-off between portability and maintainability in using the native complex data type and that an SDFG-level abstraction could be beneficial for developing higher-level applications.
8.	Bonnichsen, L., et al. (author) Using transactional memory to avoid blocking in OpenMP synchronization directives : Don’t wait, speculate! 2015 In: 11th International Workshop on OpenMP, IWOMP 2015. - Cham : Springer. - 9783319245942 ; , s. 149-161 Conference paper (peer-reviewed)abstract OpenMP applications with abundant parallelism are often characterized by their high-performance. Unfortunately, OpenMP applications with a lot of synchronization or serialization-points perform poorly because of blocking, i.e. the threads have to wait for each other. In this paper, we present methods based on hardware transactional memory (HTM) for executing OpenMP barrier, critical, and taskwait directives without blocking. Although HTM is still relatively new in the Intel and IBM architectures, we experimentally show a 73% performance improvement over traditional locking approaches, and 23% better than other HTM approaches on critical sections. Speculation over barriers can decrease execution time by up-to 41 %. We expect that future systems with HTM support and more cores will have a greater benefit from our approach as they are more likely to block.
9.	Borgström, Gustaf (author) Making Sampled Simulations Faster by Minimizing Warming Time 2022 Licentiate thesis (other academic/artistic)abstract A computer system simulator is a fundamental tool for computer architects to try out brand new ideas or explore the system’s response to different configurations when executing different program codes. However, even simulating the CPU core in detail is time-consuming as the execution rate slows down by several orders of magnitude compared to native execution. To solve this problem, previous work, namely SMARTS, demonstrates a statistical sampling methodology that records measurements only from tiny samples throughout the simulation. It spends only a fraction of the full simulation time on these sample measurements. In-between detailed sample simulations, SMARTS fast-forwards in the simulation using a greatly simplified and much faster simulation model (compared to full detail), which maintains only necessary parts of the architecture, such as cache memory. This maintenance process is called warming. While warming is mandatory to keep the simulation accuracy high, caches may be sufficiently warm for an accurate simulation long before reaching the sample. In other words, much time may be wasted on warming in SMARTS.In this work, we show that caches can be kept in an accurate state with much less time spent on warming. The first paper presents Adaptive Cache Warming, a methodology for identifying the minimum amount of warming in an iterative process for every SMARTS sample. The rest of the simulation time, previously spent on warming, can be skipped by fast-forwarding between samples using native hardware execution of the code. Doing so will thus result in significantly faster statistically sampled simulation while maintaining accuracy. The second paper presents Cache Merging, which mitigates the redundant warmings introduced in Adaptive Cache Warming. We solve this issue by going back in time and merging the existing warming with a cache warming session that comes chronologically before the existing warming. By removing the redundant warming, we yield even more speedup. Together, Adaptive Cache Warming and Cache Merging is a powerful boost for statistically sampled simulations.
10.	Brown, Nick, et al. (author) Utilising urgent computing to tackle the spread of mosquito-borne diseases 2021 In: Proceedings of Urgenthpc 2021. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 36-44 Conference paper (peer-reviewed)abstract It is estimated that around 80% of the world's population live in areas susceptible to at-least one major vector borne disease, and approximately 20% of global communicable diseases are spread by mosquitoes. Furthermore, the outbreaks of such diseases are becoming more common and widespread, with much of this driven in recent years by socio-demographic and climatic factors. These trends are causing significant worry to global health organisations, including the CDC and WHO, and-so an important question is the role that technology can play in addressing them. In this work we describe the integration of an epidemiology model, which simulates the spread of mosquito-borne diseases, with the VESTEC urgent computing ecosystem. The intention of this work is to empower human health professionals to exploit this model and more easily explore the progression of mosquito-borne diseases. Traditionally in the domain of the few research scientists, by leveraging state of the art visualisation and analytics techniques, all supported by running the computational workloads on HPC machines in a seamless fashion, we demonstrate the significant advantages that such an integration can provide. Furthermore we demonstrate the benefits of using an ecosystem such as VESTEC, which provides a framework for urgent computing, in supporting the easy adoption of these technologies by the epidemiologists and disaster response professionals more widely.

Skapa referenser, mejla, bekava och länka

Permalink

Result 1-10 of 59

Refine your search

Type of publication: conference paper (42); journal article (7); doctoral thesis (3); licentiate thesis (3); other publication (2); reports (1); show more...; research review (1); show less...

Type of content: peer-reviewed (47); other academic/artistic (12)

Author/Editor: Podobas, Artur (47); Markidis, Stefano (26); Brorsson, Mats (12); Podobas, Artur, 1982 ... (11); Sano, Kentaro (9); Jansson, Niclas, 198 ... (8); show more...; Chien, Wei Der (8); Schlatter, Philipp (7); Svedin, Martin (5); Vlassov, Vladimir (4); Jansson, Niclas (4); Adhi, Boma (4); Cortes, Carlos (4); Tan, Yiyu (4); Kojima, Takuya (4); Andersson, Måns (3); Herman, Pawel, 1979- (3); Natarajan Arul, Muru ... (2); Ueno, Tomohiro (2); Laure, Erwin (2); He, Yifei (2); Liu, Felix (2); Brorsson, Mats, Prof ... (2); Brorsson, Mats, 1962 ... (2); Sommer, Lukas (2); Fredriksson, Albin (2); Brown, Nick (2); Nash, Rupert (2); Sozzo, Emanuele Del (1); Pleiter, Dirk (1); Alexandru, Iordan (1); Natvig, Lasse (1); Black-Schaffer, Davi ... (1); Peng, Ivy Bo (1); Pericas, Miquel, 197 ... (1); Vincent, Jonathan (1); Karlsson, Matts (1); Anzt, Hartwig (1); Peng, Ivy (1); Peplinski, Adam (1); Hussain, Fazle (1); Gong, Jing (1); Bonnichsen, L. (1); Borgström, Gustaf (1); Rohner, Christian, P ... (1); Podobas, Artur, Assi ... (1); Poletti, Piero (1); Guzzetta, Giorgio (1); Manica, Mattia (1); Zardini, Agnese (1); show less...

University: Royal Institute of Technology (58); Uppsala University (1); Chalmers University of Technology (1); RISE (1)

Language: English (59)

Research subject (UKÄ/SCB): Natural sciences (35); Engineering and Technology (27); Medical and Health Sciences (2)

Year

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

Copyright © LIBRIS - National Library Systems
LIBRIS.kb.se

pil uppåt

Close

Copy and save the link in order to return to this view