↓ Direkt till sidans innehåll
↓ Direkt till sidans sekundära innehåll (sidomenyn)

Träfflista för sökning "L773:1544 3973 OR L773:1544 3566 "

Sökning: L773:1544 3973 OR L773:1544 3566

Resultat 1-10 av 32

Sortera/gruppera träfflistan

Sortering: Träffar per sida:

Numrering	Referens	Omslagsbild	Hitta
1.	Alves, Ricardo, et al. (författare) Early Address Prediction : Efficient Pipeline Prefetch and Reuse 2021 Ingår i: ACM Transactions on Architecture and Code Optimization (TACO). - : Association for Computing Machinery (ACM). - 1544-3566 .- 1544-3973. ; 18:3 Tidskriftsartikel (refereegranskat)abstract Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or LO caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead. In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction's lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead. Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.
2.	Anastasiadis, Petros, et al. (författare) PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems 2023 Ingår i: Transactions on Architecture and Code Optimization. - 1544-3973 .- 1544-3566. ; 20:4 Tidskriftsartikel (refereegranskat)abstract Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7× and energy efficiency by 2.5× over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems.
3.	Angerd, Alexandra, 1988, et al. (författare) A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs 2017 Ingår i: Transactions on Architecture and Code Optimization. - : Association for Computing Machinery (ACM). - 1544-3973 .- 1544-3566. ; 14:4 Tidskriftsartikel (refereegranskat)abstract Reducing the precision of floating-point values can improve performance and/or reduce energy expenditure in computer graphics, among other, applications. However, reducing the precision level of floating-point values in a controlled fashion needs support both at the compiler and at the microarchitecture level. At the compiler level, a method is needed to automate the reduction of precision of each floating-point value. At the microarchitecture level, a lower precision of each floating-point register can allow more floating-point values to be packed into a register file. This, however, calls for new register file organizations.This article proposes an automated precision-selection method and a novel GPU register file organization that can store floating-point register values at arbitrary precisions densely. The automated precision-selection method uses a data-driven approach for setting the precision level of floating-point values, given a quality threshold and a representative set of input data. By allowing a small, but acceptable, degradation in output quality, our method can remove a significant amount of the bits needed to represent floating-point values in the investigated kernels (between 28% and 60%). Our proposed register file organization exploits these lower-precision floating-point values by packing several of them into the same physical register. This reduces the register pressure per thread by up to 48%, and by 27% on average, for a negligible output-quality degradation. This can enable GPUs to keep up to twice as many threads in flight simultaneously.
4.	Armejach, A., et al. (författare) Techniques to Improve Performance in Requester-Wins Hardware Transactional Memory 2013 Ingår i: Transactions on Architecture and Code Optimization. - 1544-3973 .- 1544-3566. ; 10:4, s. articlenr, 42- Tidskriftsartikel (refereegranskat)abstract The simplicity of requester-wins Hardware Transactional Memory (HTM) makes it easy to incorporate in existing chip multiprocessors. Hence, such systems are expected to be widely available in the near future. Unfortunately, these implementations are prone to suffer severe performance degradation due to transient and persistent livelock conditions. This article shows that existing techniques are unable to mitigate this degradation effectively. It then proposes and evaluates four novel techniques-two software-based that employ information provided by the hardware and two that require simple core-local hardware additions-which have the potential to boost the performance of requester-wins HTM designs substantially.
5.	Azhar, Muhammad Waqar, 1986, et al. (författare) Approx-RM: Reducing Energy on Heterogeneous Multicore processors under Accuracy and Timing Constraints 2023 Ingår i: Transactions on Architecture and Code Optimization. - 1544-3973 .- 1544-3566. ; 20:3 Tidskriftsartikel (refereegranskat)abstract Reducing energy consumption while providing performance and quality guarantees is crucial for computing systems ranging from battery-powered embedded systems to data centers. This paper considers approximate iterative applications executing on heterogeneous multi-core platforms under user-specified performance and quality targets. We note that allowing a slight yet bounded relaxation in solution quality can considerably reduce the required iteration count and thereby can save significant amounts of energy. To this end, this paper proposes Approx-RM, a resource management scheme that reduces energy expenditure while guaranteeing a specified performance as well as accuracy target. Approx-RMpredicts the number of iterations required to meet the relaxed accuracy target at run-time. The time saved generates execution-time slack, which allows Approx-RM to allocate fewer resources on a heterogeneous multi-core platform in terms of DVFS, core type, and core count to save energy while meeting the performance target. Approx-RMcontributes with lightweight methods for predicting the iteration count needed to meet the accuracy target and the resources needed to meet the performance target. Approx-RM uses the aforementioned predictions to allocate just enoughresources to comply with quality of service constraints to save energy. Our evaluation shows energy savings of 31.6%, on average, compared to Race-to-idle when the accuracy is only relaxed by 1%. Approx-RM incurs timing and energy overheads of less than 0.1%.
6.	Azhar, Muhammad Waqar, 1986, et al. (författare) SLOOP: QoS-Supervised Loop Execution to Reduce Energy on Heterogeneous Architectures 2017 Ingår i: Transactions on Architecture and Code Optimization. - : Association for Computing Machinery (ACM). - 1544-3973 .- 1544-3566. ; 14:4, s. Article No. 41- Tidskriftsartikel (refereegranskat)abstract Most systems allocate computational resources to each executing task without any actual knowledge of the application’s Quality-of-Service (QoS) requirements. Such best-effort policies lead to overprovisioning of the resources and increase energy loss. This work assumes applications with soft QoS requirements and exploits the inherent timing slack to minimize the allocated computational resources to reduce energy consumption. We propose a lightweight progress-tracking methodology based on the outer loops of application kernels. It builds on online history and uses it to estimate the total execution time. The prediction of the execution time and the QoS requirements are then used to schedule the application on a heterogeneous architecture with big out-of-order cores and small (LITTLE) in-order cores and select the minimum operating frequency, using DVFS, that meets the deadline. Our scheme is effective in exploiting the timing slack of each application. We show that it can reduce the energy consumption by more than 20% without missing any computational deadlines.
7.	Azhar, Muhammad Waqar, 1986, et al. (författare) Task-RM: A Resource Manager for Energy Reduction in Task-Parallel Applications under Quality of Service Constraints 2022 Ingår i: Transactions on Architecture and Code Optimization. - : Association for Computing Machinery (ACM). - 1544-3973 .- 1544-3566. ; 19:1 Tidskriftsartikel (refereegranskat)abstract Improving energy efficiency is an important goal of computer system design. This article focuses on a general model of task-parallel applications under quality-of-service requirements on the completion time. Our technique, called Task-RM, exploits the variance in task execution-times and imbalance between tasks to allocate just enough resources in terms of voltage-frequency and core-allocation so that the application completes before the deadline. Moreover, we provide a solution that can harness additional energy savings with the availability of additional processors. We observe that, for the proposed run-time resource manager to allocate resources, it requires specification of the soft deadlines to the tasks. This is accomplished by analyzing the energy-saving scenarios offline and by providing Task-RM with the performance requirements of the tasks. The evaluation shows an energy saving of 33% compared to race-to-idle and 22% compared to dynamic slack allocation (DSA) with an overhead of less than 1%.
8.	Bardizbanyan, Alen, 1986, et al. (författare) Designing a Practical Data Filter Cache to Improve Both Energy Efficiency and Performance 2013 Ingår i: Transactions on Architecture and Code Optimization. - 1544-3973 .- 1544-3566. ; 10:4, s. 25 pages- Tidskriftsartikel (refereegranskat)abstract Conventional Data Filter Cache (DFC) designs improve processor energy efficiency, but degrade performance. Furthermore, the single-cycle line transfer suggested in prior studies adversely affects Level-1 Data Cache (L1 DC) area and energy efficiency. We propose a practical DFC that is accessed early in the pipeline and transfers a line over multiple cycles. Our DFC design improves performance and eliminates a substantial fraction of L1 DC accesses for loads, L1 DC tag checks on stores, and data translation lookaside buffer accesses for both loads and stores. Our evaluation shows that the proposed DFC can reduce the data access energy by 42.5% and improve execution time by 4.2%.
9.	Chen, Jing, 1995, et al. (författare) ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes 2022 Ingår i: Transactions on Architecture and Code Optimization. - : Association for Computing Machinery (ACM). - 1544-3973 .- 1544-3566. ; 19:2 Tidskriftsartikel (refereegranskat)abstract Parallel applications often rely on work stealing schedulers in combination with fine-grained tasking to achieve high performance and scalability. However, reducing the total energy consumption in the context of work stealing runtimes is still challenging, particularly when using asymmetric architectures with different types of CPU cores. A common approach for energy savings involves dynamic voltage and frequency scaling (DVFS) wherein throttling is carried out based on factors like task parallelism, stealing relations, and task criticality. This article makes the following observations: (i) leveraging DVFS on a per-task basis is impractical when using fine-grained tasking and in environments with cluster/chip-level DVFS; (ii) task moldability, wherein a single task can execute on multiple threads/cores via work-sharing, can help to reduce energy consumption; and (iii) mismatch between tasks and assigned resources (i.e., core type and number of cores) can detrimentally impact energy consumption. In this article, we propose EneRgy Aware SchedulEr (ERASE), an intra-application task scheduler on top of work stealing runtimes that aims to reduce the total energy consumption of parallel applications. It achieves energy savings by guiding scheduling decisions based on per-task energy consumption predictions of different resource configurations. In addition, ERASE is capable of adapting to both given static frequency settings and externally controlled DVFS. Overall, ERASE achieves up to 31% energy savings and improves performance by 44% on average, compared to the state-of-the-art DVFS-based schedulers.
10.	Davari, Mahdad, et al. (författare) The effects of granularity and adaptivity on private/shared classification for coherence 2015 Ingår i: ACM Transactions on Architecture and Code Optimization (TACO). - : Association for Computing Machinery (ACM). - 1544-3566 .- 1544-3973. ; 12:3 Tidskriftsartikel (refereegranskat)abstract Classification of data into private and shared has proven to be a catalyst for techniques to reduce coherence cost, since private data can be taken out of coherence and resources can be concentrated on providing coherence for shared data. In this article, we examine how granularity-page-level versus cache-line level- and adaptivity-going from shared to private-affect the outcome of classification and its final impact on coherence. We create a classification technique, called Generational Classification, and a coherence protocol called Generational Coherence, which treats data as private or shared based on cache-line generations. We compare two coherence protocols based on self-invalidation/self-downgrade with respect to data classification. Our findings are enlightening: (i) Some programs benefit from finer granularity, some benefit further from adaptivity, but some do not benefit from either. (ii) Reducing the amount of shared data has no perceptible impact on coherence misses caused by self-invalidation of shared data, hence no impact on performance. (iii) In contrast, classifying more data as private has implications for protocols that employ write-through as a means of self-downgrade, resulting in network traffic reduction-up to 30%-by reducing write-through traffic.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Resultat 1-10 av 32

Avgränsa träffmängd

Typ av publikation: tidskriftsartikel (32)

Typ av innehåll: refereegranskat (32)

Författare/redaktör: Stenström, Per, 1957 (6); Pericas, Miquel, 197 ... (6); Sourdis, Ioannis, 19 ... (5); Kaxiras, Stefanos (4); Papaefstathiou, Vasi ... (4); Manivannan, Madhavan ... (4); visa fler...; Petersen Moura Tranc ... (4); Azhar, Muhammad Waqa ... (4); Black-Schaffer, Davi ... (3); Ros, Alberto (2); Lu, Zhonghai (2); Negi, Anurag, 1980 (2); Titos Gil, Ruben, 19 ... (2); Hagersten, Erik (2); Hemani, Ahmed, 1961- (1); Kumar, Rakesh (1); Zhou, You (1); Grahn, Håkan (1); Podobas, Artur (1); Huang, Ping (1); Larsson-Edefors, Per ... (1); Sakalis, Christos (1); Sjalander, Magnus (1); Kessler, Christoph (1); Alipour, Mehdi (1); Ejaz, Ahsen, 1986 (1); Alves, Ricardo (1); Anastasiadis, Petros (1); Papadopoulou, Nikela ... (1); Goumas, Georgios, 19 ... (1); Koziris, Nectarios (1); Hoppe, Dennis (1); Zhong, Li (1); Angerd, Alexandra, 1 ... (1); Sintorn, Erik, 1980 (1); Armejach, A. (1); Unsal, O.S. (1); Cristal, A. (1); Strydis, C. (1); Peris-Lopez, P. (1); Själander, Magnus, 1 ... (1); Whalley, David (1); Bardizbanyan, Alen, ... (1); McKee, Sally A, 1963 (1); Davari, Mahdad (1); Shin, H. (1); Keller, Joerg (1); Yao, Yuan (1); Chen, Peng (1); Koukos, Konstantinos (1); visa färre...

Lärosäte: Chalmers tekniska högskola (21); Uppsala universitet (6); Kungliga Tekniska Högskolan (4); Linköpings universitet (1); Blekinge Tekniska Högskola (1)

Språk: Engelska (32)

Forskningsämne (UKÄ/SCB): Teknik (22); Naturvetenskap (21)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

Copyright © LIBRIS - Nationella bibliotekssystem
LIBRIS.kb.se

pil uppåt

Stäng

Kopiera och spara länken för att återkomma till aktuell vy