SwePub
Tyck till om SwePub Sök här!
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "LAR1:bth ;lar1:(bth);srt2:(1995-1999);pers:(Grahn Håkan)"

Sökning: LAR1:bth > Blekinge Tekniska Högskola > (1995-1999) > Grahn Håkan

  • Resultat 1-9 av 9
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Broberg, Magnus, et al. (författare)
  • Performance Optimization using Critical Path Analysis in Multithreaded Programs on Multiprocessors
  • 1999
  • Rapport (övrigt vetenskapligt/konstnärligt)abstract
    • Efficient performance tuning of parallel programs is often hard. Optimization is often done when the program is written as a last effort to increase the performance. With sequential programs each (executed) code segment will affect the total execution time of the program. Thus, any code segment that is optimized in a sequential program will decrease the execution time. In the case of a parallel program executed on a multiprocessor this is not always true. This is due to dependencies between the different threads. As a result, certain code segments of the execution may not affect the total execution time of the program. Thus, optimization of such code segments will not increase the performance. In this paper we present a new approach to perform the optimization phase. Our approach finds the critical path of the multithreaded program and the optimization is only done on those specific code segments of the program. We have implemented the critical path analysis in a performance optimization tool.
  •  
2.
  • Broberg, Magnus, et al. (författare)
  • Visualization and performance prediction of multithreaded Solaris programs by tracing kernel threads
  • 1999
  • Konferensbidrag (refereegranskat)abstract
    • Efficient performance tuning of parallel programs is often hard. We present a performance prediction and visualization tool called VPPB. Based on a monitored uni-processor execution, VPPB shows the (predicted) behaviour of a multithreaded program using any number of processors and the program behaviour is visualized as a graph. The first version of VPPB was unable to handle I/O operations. This version has, by an improved tracing technique, added the possibility to trace activities at the kernel level as well. Thus, VPPB is now able to trace various I/O activities, e.g., manipulation of OS internal buffers, physical disk I/O, socket I/O, and RPC. VPPB allows flexible performance tuning of parallel programs developed for shared memory multiprocessors using a standardized environment; C/C++ programs that lues the thread package in Solaris 2.X.
  •  
3.
  • Broberg, Magnus, et al. (författare)
  • VPPB : A Visualization and Performance Prediction Tool for Multitreaded Solaris Programs
  • 1998
  • Konferensbidrag (refereegranskat)abstract
    • Efficient performance tuning of parallel programs is often hard. In this paper we describe an approach that uses a uni-processor execution of a multithreaded program as reference to simulate a multiprocessor execution. The speed-up is predicted, and the program behaviour is visualized as a graph, which can be used in the performance tuning process. The simulator considers scheduling as well as hardware parameters, e.g., the thread priority, no. of LWPs, and no. of CPUs. The visualization part shows the simulated execution in two graphs: one showing the threads’ behaviour over time and the other the amount of parallel-ism over time. In the first graph is it possible to relate an event in the graph to the code line causing the event. Validation using a Sun multiprocessor with eight processors and five scientific parallel applications shows that the speed-up predictions are within +/-6% of a real execution.
  •  
4.
  • Grahn, Håkan, et al. (författare)
  • Efficient Strategies for Software-Only Directory Protocols in Shared-Memory Multiprocessors
  • 1995
  • Konferensbidrag (refereegranskat)abstract
    • The cost, complexity, and inflexibility of hardware-based directory protocols motivate us to study the performance implications of protocols that emulate directory management using software handlers executed on the compute processors. An important performance limitation of such software-only protocols is that software latency associated with directory management ends up on the critical memory access path for read miss transactions. We propose five strategies that support efficient data transfers in hardware whereas directory management is handled at a slower pace in the background by software handlers. Simulations show that this approach can remove the directory-management latency from the memory access path. Whereas the directory is managed in software, the hardware mechanisms must access the memory state in order to enable data transfers at a high speed. Overall, our strategies reach between 60% and 86% of the hardware-based protocol performance.
  •  
5.
  • Grahn, Håkan, et al. (författare)
  • Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection
  • 1996
  • Ingår i: Journal of Parallel and Distributed Computing. - San Diego : Academic. - 0743-7315 .- 1096-0848. ; 39:2, s. 168-180
  • Tidskriftsartikel (refereegranskat)abstract
    • Although directory-based write-invalidate cache coherence protocols have a potential to improve the performance of large-scale multiprocessors, coherence misses limit the processor utilization. Therefore, so-called competitive-update protocols-hybrid protocols that on a per-block basis dynamically switch between write-invalidate and write-update-have been considered as a means to reduce the coherence miss rate and have been shown to be a better coherence policy for a wide range of applications. Unfortunately, such protocols may cause high traffic peaks for applications with extensive use of migratory objects. These traffic peaks can offset the performance gain of a reduced miss rate if the network bandwidth is not sufficient. We propose in this study to extend a competitive-update protocol with a previously published adaptive mechanism that can dynamically detect migratory objects and reduce the coherence traffic they cause. Detailed architectural simulations based on five scientific and engineering applications show that this adaptive protocol outperforms a write-invalidate protocol by reducing the miss rate and bandwidth needed by up to 71 and 26%, respectively.
  •  
6.
  • Grahn, Håkan (författare)
  • Evaluation of design alternatives for a directory-based cache coherence protocol in shared-memory multiprocessors
  • 1995
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • In shared-memory multiprocessors, caches are attached to the processors in order to reduce the memory access latency. To keep the memory consistent, a cache coherence protocol is needed. A well known approach is to record which caches have copies of a memory block in a directory and only notify the caches having a copy when a processor modifies the block. Such a protocol is called a directory-based cache coherence protocol. This thesis, which is a summary of seven papers, identifies three problems in a directory-based protocol, and evaluates implementation and performance aspects of some design alternatives. The evaluation methodology is based on program-driven simulation. The write-invalidate policy, which is used in the baseline protocol, forces all other copies of a block to be invalidated when a processor modifies the block. This leads to a cache miss each time a processor accesses an invalidated block. To reduce the number of cache misses, a competitive-update policy is proposed in this thesis. The competitive-update policy is shown to reduce both the read stall and execution times as compared to write-invalidate under a relaxed memory consistency model. However, update-based policies need more buffering and hardware support in the caches. In the baseline protocol, the implementation cost of the directory is proportional to the number of caches. To reduce this cost, an alternative directory organization is proposed which distributes the directory information among the caches sharing the same memory block. To achieve a low write latency, the caches sharing a block are organized in a tree. The caches are linked into the tree in parallel with application execution to achieve a low read latency. The hardware-implemented directory controller in the baseline protocol may lead to high design complexity and implementation cost. This thesis evaluates a design alternative where the controller is implemented using software handlers executed on the compute processor. By using efficient strategies and proper architectural support, this design alternative is shown to be competitive with the baseline protocol. However, the performance of this alternative is more sensitive to other design choices, e.g., block size and latency tolerating techniques, than the baseline protocol.
  •  
7.
  • Grahn, Håkan, et al. (författare)
  • Implementation and Evaluation of Update-Based Cache Protocols Under Relaxed Memory Consistency Models
  • 1995
  • Ingår i: Future Generations Computer Systems. - Amsterdam : North-Holland. - 0167-739X. ; 11:3, s. 247-271
  • Tidskriftsartikel (refereegranskat)abstract
    • The protocols of invalidation-based cache coherence have been extensively studied in the context of large-scale shared-memory multiprocessors. Under a relaxed memory consistency model, most of the write latency can be hidden whereas cache misses still incur a severe performance problem. By contrast, update-based protocols have a potential to reduce both write and read penalties under relaxed memory consistency models because coherence misses can be completely eliminated. This paper compares update- and invalidation-based protocols for their ability to reduce or hide memory access latencies and for their ease of implementation under relaxed memory consistency models.
  •  
8.
  • Grahn, Håkan, et al. (författare)
  • Relative Performance of Hardware and Software-Only Directory Protocols Under Latency Tolerating and Reducing Techniques
  • 1997
  • Konferensbidrag (refereegranskat)abstract
    • In both hardware-only and software-only directory protocols the performance is often limited by memory access stall times. To increase the performance, several latency tolerating and reducing techniques have been proposed and shown effective for hardware-only directory protocols. For software-only directory protocols, the efficiency of a technique depends not only on how effective it is as seen by the local processor, but also on how it impacts the software handler execution overhead in the node where a memory block is allocated. Based on architectural simulations and case studies of three techniques, we find that prefetching can degrade the performance of software-only directory protocols due to useless prefetches. A relaxed memory consistency model hides all write latency for software-only directory protocols, but the software handler overhead is virtually unaffected and now constitutes a larger portion of the execution time. Overall, latency tolerating techniques for software-only directory protocols must be chosen with more care than for hardware-only directory protocols.
  •  
9.
  • Stenström, Per, et al. (författare)
  • Boosting the Performance of Shared Memory Multiprocessors
  • 1997
  • Ingår i: Computer. - Long Beach, Calif. : IEEE Computer Society. - 0018-9162 .- 1558-0814. ; 30:7, s. 63-70
  • Tidskriftsartikel (refereegranskat)abstract
    • Shared memory multiprocessors make it practical to convert sequential programs to parallel ones in a variety of applications. An emerging class of shared memory multiprocessors are nonuniform memory access machines with private caches and a cache coherence protocol. Proposed hardware optimizations to CC-NUMA machines can shorten the time processors lose because of cache misses and invalidations. The authors look at cost-performance trade-offs for each of four proposed optimizations: release consistency, adaptive sequential prefetching, migratory sharing detection, and hybrid update/invalidate with a write cache. The four optimizations differ with respect to which application features they attack, what hardware resources they require, and what constraints they impose on the application software. The authors measured the degree of performance improvement using the four optimizations in isolation and in combination, looking at the trade-offs in hardware and programming complexities. Although one combination of the proposed optimizations (prefetching and migratory sharing detection) can boost a sequentially consistent machine to perform as well as a machine with release consistency, release consistency models offer significant performance improvements across a broad application domain at little extra complexity in the machine design. Moreover, a combination of sequential prefetching and hybrid update/invalidate with a write cache cuts the execution time of a sequentially consistent machine by half with fairly modest changes to the second-level cache and the cache protocol. The authors expect that designers will begin to turn more to the release consistency model.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-9 av 9
Typ av publikation
konferensbidrag (4)
tidskriftsartikel (3)
rapport (1)
doktorsavhandling (1)
Typ av innehåll
refereegranskat (7)
övrigt vetenskapligt/konstnärligt (2)
Författare/redaktör
Stenström, Per (5)
Lundberg, Lars (3)
Broberg, Magnus (3)
Dubois, Michel (2)
Brorsson, Mats (1)
visa fler...
Dahlgren, Fredrik (1)
visa färre...
Lärosäte
Språk
Engelska (9)
Forskningsämne (UKÄ/SCB)
Naturvetenskap (9)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy