↓ Direkt till sidans innehåll
↓ Direkt till sidans sekundära innehåll (sidomenyn)

Träfflista för sökning "swepub ;lar1:(cth);pers:(Stenström Per 1957)"

Sökning: swepub > Chalmers tekniska högskola > Stenström Per 1957

Resultat 1-10 av 192

Sortera/gruppera träfflistan

Sortering: Träffar per sida:

Numrering	Referens	Omslagsbild	Hitta
1.	Magnusson, Peter S., et al. (författare) SimICS/sun4m : A virtual workstation 2019 Ingår i: USENIX 1998 Annual Technical Conference. - New Orleans, LA, USA : USENIX Association. Konferensbidrag (refereegranskat)abstract System level simulators allow computer architects and system software designers to recreate an accurate and complete replica of the program behavior of a target system, regardless of the availability, existence, or instrumentation support of such a system. Applications include evaluation of architectural design alternatives as well as software engineering tasks such as traditional debugging and performance tuning. We present an implementation of a simulator acting as a virtual workstation fully compatible with the sun4m architecture from Sun Microsystems. Built using the system-level SPARC V8 simulator SimICS, SimICS/sun4m models one or more SPARC V8 processors, supports user-developed modules for data cache and instruction cache simulation and execution profiling of all code, and provides a symbolic and performance debugging environment for operating systems. SimICS/sun4m can boot unmodified operating systems, including Linux 2.0.30 and Solaris 2.6, directly from snapshots of disk partitions. To support essentially arbitrary code, we implemented binary-compatible simulators for several devices, including SCSI, console, interrupt, timers, EEPROM, and Ethernet. The Ethernet simulation hooks into the host and allows the virtual workstation to appear on the local network with full services available (NFS, NIS, rsh, etc). Ethernet and console traffic can be recorded for future playback. The performance of SimICS/sun4m is sufficient to run realistic workloads, such as the database benchmark TPC-D, scaling factor 1/100, or an interactive network application such as Mozilla. The slowdown in relation to native hardware is in the range of 25 to 75 (measured using SPECint95). We also demonstrate some applications, including modeling an 8-processor sun4m version (which does not exist), modeling future memory hierarchies, and debugging an operating system.
2.	Hollmann, Jochen, 1970, et al. (författare) An Evaluation of Document Prefetching in a Distributed Digital Library 2003 Ingår i: Research and AdvancedTechnology for Digital Libraries / Lecture Notes In Computer Science. - Berlin, Heidelberg : Springer Berlin Heidelberg. - 0302-9743 .- 1611-3349. - 9783540407263 ; 2769, s. 276-287 Rapport (övrigt vetenskapligt/konstnärligt)abstract Latency is a fundamental problem for all distributed systems including digital libraries. To reduce user perceived delays both caching -- keeping accessed objects for future use -- and prefetching -- transferring objects ahead of access time -- can be used. In a previous paper we have reported that caching is not worthwhile for digital libraries due to low re-access frequencies. In this paper we evaluate our previous findings that prefetching can be used instead. To do this we have set up an experimental prefetching proxy which is able to retrieve documents from remote fulltext archives before the user demands them. Using a simple prediction to keep the overhead of unnecessarily transfered data limited, we find that it is possible to cut the user perceived average delay a factor of two.
3.	Islam, Mafijul, 1975, et al. (författare) Limits on Thread-Level Speculative Parallelism in Embedded Applications 2007 Ingår i: IEEE INTERACT 2007. Konferensbidrag (refereegranskat)
4.	Grahn, Håkan, et al. (författare) A Comparative Evaluation of Hardware-Only and Software-Only Directory Protocols in Shared-Memory Multiprocessors 2004 Ingår i: Journal of Systems Architecture. - : Elsevier BV. - 1383-7621. ; 50:9, s. 537-561 Tidskriftsartikel (refereegranskat)abstract The hardware complexity of hardware-only directory protocols in shared-memory multiprocessors has motivated many researchers to emulate directory management by software handlers executed on the compute processors, called software-only directory protocols.In this paper, we evaluate the performance and design trade-offs between these two approaches in the same architectural simulation framework driven by eight applications from the SPLASH-2 suite. Our evaluation reveals some common case operations that can be supported by simple hardware mechanisms and can make the performance of software-only directory protocols competitive with that of hardware-only protocols. These mechanisms aim at either reducing the software handler latency or hiding it by overlapping it with the message latencies associated with inter-node memory transactions. Further, we evaluate the effects of cache block sizes between 16 and 256 bytes as well as two different page placement policies. Overall, we find that a software-only directory protocol enhanced with these mechanisms can reach between 63% and 97% of the baseline hardware-only protocol performance at a lower design complexity.
5.	Islam, Mafijul, 1975, et al. (författare) Zero-Value Caches: Cancelling Loads that Return Zero 2009 Ingår i: Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. - 1089-795X. - 9780769537719 ; , s. 237-245 Rapport (övrigt vetenskapligt/konstnärligt)abstract The speed gap between processor and memory continues to limit performance. To address this problem, we explore the potential of eliminating Zero Loads — loads accessing memory locations that contain the value “zero” — to improve performance and energy dissipation. Our study shows that such loads comprise as many as 18% of the total number of dynamic loads. We show that a significant fraction of zero loads ends up on the critical memory-access path in out-of-order cores. We propose a non-speculative microarchitectural technique — Zero-Value Cache (ZVC) — to capitalize on zero loads and explore critical design options of such caches. We show that with modest investment (typically a 576-byte structure), we can obtain speedups up to 78% and reduce the overall energy dissipation up to 39%. Most importantly, zero-value caches never cause performance loss.
6.	Manivannan, Madhavan, 1986, et al. (författare) Efficient Forwarding of Producer-Consumer Data in Task-based Programs 2013 Ingår i: Proceedings of the International Conference on Parallel Processing. 40th International Conference on Parallel Processing, ICPP 2013, Lyon, 1-4 October 2013. - 0190-3918. - 9780769551173 ; , s. 517-522 Konferensbidrag (refereegranskat)abstract Task-based programming models are increasingly being adopted due to their ability to express parallelism. Theyalso lead to higher programmer productivity by delegating tothe run-time system and the architecture demanding parallelism management tasks such as scheduling and staging of the communication between tasks.This paper focuses on techniques to optimize producer-consumer sharing in task-based programs. As the set of producer and consumer tasks can often be statically determined, coherence prediction techniques are expected to successfully optimize producer-consumer sharing. We show that they are ineffective because the mapping of tasks to cores changes based on runtime conditions. The paper contributes with a technique that forwards produced and spatially close blocks to the consumer in a single transaction when a consumer requests a first block.We also find that stride prefetching is competitive with ourforwarding technique for sufficiently coarse tasks. However, its effectiveness deteriorates as the task granularity is reduced because of limited opportunities to train for the access pattern and to issue prefetches sufficiently ahead of time. This makes our forwarding scheme a robust alternative to reduce communicationoverhead in task-based programs.
7.	Alvarez, Lluc, et al. (författare) eProcessor: European, Extendable, Energy-Efficient, Extreme-Scale, Extensible, Processor Ecosystem 2023 Ingår i: Proceedings of the 20th ACM International Conference on Computing Frontiers 2023, CF 2023. ; , s. 309-314 Konferensbidrag (refereegranskat)abstract The eProcessor project aims at creating a RISC-V full stack ecosystem. The eProcessor architecture combines a high-performance out-of-order core with energy-efficient accelerators for vector processing and artificial intelligence with reduced-precision functional units. The design of this architecture follows a hardware/software co-design approach with relevant application use cases from the high-performance computing, bioinformatics and artificial intelligence domains. Two eProcessor prototypes will be developed based on two fabricated eProcessor ASICs integrated into a computer-on-module.
8.	Angerd, Alexandra, 1988, et al. (författare) A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs 2017 Ingår i: Transactions on Architecture and Code Optimization. - : Association for Computing Machinery (ACM). - 1544-3973 .- 1544-3566. ; 14:4 Tidskriftsartikel (refereegranskat)abstract Reducing the precision of floating-point values can improve performance and/or reduce energy expenditure in computer graphics, among other, applications. However, reducing the precision level of floating-point values in a controlled fashion needs support both at the compiler and at the microarchitecture level. At the compiler level, a method is needed to automate the reduction of precision of each floating-point value. At the microarchitecture level, a lower precision of each floating-point register can allow more floating-point values to be packed into a register file. This, however, calls for new register file organizations.This article proposes an automated precision-selection method and a novel GPU register file organization that can store floating-point register values at arbitrary precisions densely. The automated precision-selection method uses a data-driven approach for setting the precision level of floating-point values, given a quality threshold and a representative set of input data. By allowing a small, but acceptable, degradation in output quality, our method can remove a significant amount of the bits needed to represent floating-point values in the investigated kernels (between 28% and 60%). Our proposed register file organization exploits these lower-precision floating-point values by packing several of them into the same physical register. This reduces the register pressure per thread by up to 48%, and by 27% on average, for a negligible output-quality degradation. This can enable GPUs to keep up to twice as many threads in flight simultaneously.
9.	Angerd, Alexandra, 1988, et al. (författare) A GPU Register File using Static Data Compression 2020 Ingår i: ACM International Conference Proceeding Series. - New York, NY, USA : ACM. Konferensbidrag (refereegranskat)abstract GPUs rely on large register files to unlock thread-level parallelism for high throughput. Unfortunately, large register files are power hungry, making it important to seek for new approaches to improve their utilization. This paper introduces a new register file organization for efficient register-packing of narrow integer and floating-point operands designed to leverage on advances in static analysis. We show that the hardware/software co-designed register file organization yields a performance improvement of up to 79%, and 18.6%, on average, at a modest output-quality degradation.
10.	Angerd, Alexandra, 1988, et al. (författare) GBDI: Going Beyond Base-Delta-Immediate Compression with Global Bases 2022 Ingår i: Proceedings - International Symposium on High-Performance Computer Architecture. - 1530-0897. - 9781665420273 ; 2022-April, s. 1115-1127 Konferensbidrag (refereegranskat)abstract Memory bandwidth is limiting performance for many emerging applications. While compression techniques can unlock a higher memory bandwidth, prior art offers only modestly better bandwidth. This paper contributes with a new compression method - Global Base Delta Immediate compression (GBDI) - that offers substantially higher memory bandwidth by, unlike prior art, selecting base values across memory blocks. GBDI uses a novel clustering algorithm through data analysis in the background. The presented accelerator infrastructure offers low area overhead and latency. This paper shows that GBDI offers a compression ratio of 2.3×, and yields 1.5× higher bandwidth and 1.1× higher performance compared with a baseline without compression support, on average, for SPEC2017 benchmarks requiring medium to high memory bandwidth.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Resultat 1-10 av 192

Avgränsa träffmängd

Typ av publikation: konferensbidrag (98); tidskriftsartikel (54); rapport (15); samlingsverk (redaktörskap) (13); patent (9); bok (2); visa fler...; forskningsöversikt (1); visa färre...

Typ av innehåll: refereegranskat (149); övrigt vetenskapligt/konstnärligt (43)

Författare/redaktör: Stenström, Per, ... (192)Ta bort avgränsningen; Manivannan, Madhavan ... (21); Negi, Anurag, 1980 (17); Pericas, Miquel, 197 ... (14); Thuresson, Martin, 1 ... (12); Papaefstathiou, Vasi ... (11); visa fler...; Dubois, Michel (11); Arelakis, Angelos, 1 ... (8); Själander, Magnus, 1 ... (8); Garcia, J. M. (8); Titos Gil, Ruben, 19 ... (7); Warg, Fredrik, 1974 (7); Larsson-Edefors, Per ... (6); Cristal, Adrian (6); Ekman, Magnus, 1977 (6); Svensson, Lars, 1960 (5); Azhar, Muhammad Waqa ... (5); McKee, Sally A, 1963 (5); Ardö, Anders (4); Unsal, Osman (4); Vajda, András (4); Björk, Magnus, 1977 (4); Holtryd, Nadja, 1988 (4); Chen, Guancheng (4); Dybdahl, Haakon (4); Hughes, John, 1958 (3); Jeppson, Kjell, 1947 (3); Angerd, Alexandra, 1 ... (3); Sintorn, Erik, 1980 (3); Sheeran, Mary, 1959 (3); Whalley, David (3); Bardine, Alessandro (3); Busck, Alexander (3); Engbom, Mikael (3); Gaydadjiev, Georgi, ... (3); Chen, Jianwei (3); Grahn, Håkan (2); Nilsson, Jim (2); Marazakis, Manolis (2); Goel, Bhavishya, 198 ... (2); Mueller, Frank (2); Dahlgren, Fredrik, 1 ... (2); Jeong, J (2); Foglia, PieroFrances ... (2); Gabrielli, G (2); Prete, Antonio (2); Vallejo, F (2); Sandin, Patrik (2); Karlsson, Jonas, 197 ... (2); De Bosschere, Koen (2); visa färre...

Lärosäte: Chalmers tekniska högskola (192)Ta bort avgränsningen; Blekinge Tekniska Högskola (2); Göteborgs universitet (1); Lunds universitet (1); RISE (1)

Språk: Engelska (191); Svenska (1)

Forskningsämne (UKÄ/SCB): Naturvetenskap (170); Teknik (51)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

Copyright © LIBRIS - Nationella bibliotekssystem
LIBRIS.kb.se

pil uppåt

Stäng

Kopiera och spara länken för att återkomma till aktuell vy