SwePub
Tyck till om SwePub Sök här!
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Black Schaffer David Professor) "

Sökning: WFRF:(Black Schaffer David Professor)

  • Resultat 1-10 av 22
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Alipour, Mehdi (författare)
  • Rethinking Dynamic Instruction Scheduling and Retirement for Efficient Microarchitectures
  • 2020
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Out-of-order execution is one of the main micro-architectural techniques used to improve the performance of both single- and multi-threaded processors. The application of such a processor varies from mobile devices to server computers. This technique achieves higher performance by finding independent instructions and hiding execution latency and uses the cycles which otherwise would be wasted or caused a CPU stall. To accomplish this, it uses scheduling resources including the ROB, IQ, LSQ and physical registers, to store and prioritize instructions.The pipeline of an out-of-order processor has three macro-stages: the front-end, the scheduler, and the back-end. The front-end fetches instructions, places them in the out-of-order resources, and analyzes them to prepare for their execution. The scheduler identifies which instructions are ready for execution and prioritizes them for scheduling. The back-end updates the processor state with the results of the oldest completed instructions, deallocates the resources and commits the instructions in the program order to maintain correct execution.Since out-of-order execution needs to be able to choose any available instructions for execution, its scheduling resources must have complex circuits for identifying and prioritizing instructions, which makes them very expansive, therefore, limited. Due to their cost, the scheduling resources are constrained in size. This limited size leads to two stall points respectively at the front-end and the back-end of the pipeline. The front-end can stall due to fully allocated resources and therefore no more new instructions can be placed in the scheduler. The back-end can stall due to the unfinished execution of an instruction at the head of the ROB which prevents other resources from being deallocated, preventing new instructions from being inserted into the pipeline.To address these two stalls, this thesis focuses on reducing the time instructions occupy the scheduling resources. Our front-end technique tackles IQ pressure while our back-end approach considers the rest of the resources. To reduce front-end stalls we reduce the pressure on the IQ for both storing (depth) and issuing (width) instructions by bypassing them to cheaper storage structures. To reduce back-end stalls, we explore how we can retire instructions earlier, and out-of-order, to reduce the pressure on the out-of-order resource.
  •  
2.
  • Alves, Ricardo (författare)
  • Leveraging Existing Microarchitectural Structures to Improve First-Level Caching Efficiency
  • 2019
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Low-latency data access is essential for performance. To achieve this, processors use fast first-level caches combined with out-of-order execution, to decrease and hide memory access latency respectively. While these approaches are effective for performance, they cost significant energy, leading to the development of many techniques that require designers to trade-off performance and efficiency.Way-prediction and filter caches are two of the most common strategies for improving first-level cache energy efficiency while still minimizing latency. They both have compromises as way-prediction trades off some latency for better energy efficiency, while filter caches trade off some energy efficiency for lower latency. However, these strategies are not mutually exclusive. By borrowing elements from both, and taking into account SRAM memory layout limitations, we proposed a novel MRU-L0 cache that mitigates many of their shortcomings while preserving their benefits. Moreover, while first-level caches are tightly integrated into the cpu pipeline, existing work on these techniques largely ignores the impact they have on instruction scheduling. We show that the variable hit latency introduced by way-misspredictions causes instruction replays of load dependent instruction chains, which hurts performance and efficiency. We study this effect and propose a variable latency cache-hit instruction scheduler, that identifies potential misschedulings, reduces instruction replays, reduces negative performance impact, and further improves cache energy efficiency.Modern pipelines also employ sophisticated execution strategies to hide memory latency and improve performance. While their primary use is for performance and correctness, they require intermediate storage that can be used as a cache as well. In this work we demonstrate how the store-buffer, paired with the memory dependency predictor, can be used to efficiently cache dirty data; and how the physical register file, paired with a value predictor, can be used to efficiently cache clean data. These strategies not only improve both performance and energy, but do so with no additional storage and minimal additional complexity, since they recycle existing cpu structures to detect reuse, memory ordering violations, and misspeculations.
  •  
3.
  • Borgström, Gustaf (författare)
  • Making Sampled Simulations Faster by Minimizing Warming Time
  • 2022
  • Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • A computer system simulator is a fundamental tool for computer architects to try out brand new ideas or explore the system’s response to different configurations when executing different program codes. However, even simulating the CPU core in detail is time-consuming as the execution rate slows down by several orders of magnitude compared to native execution. To solve this problem, previous work, namely SMARTS, demonstrates a statistical sampling methodology that records measurements only from tiny samples throughout the simulation. It spends only a fraction of the full simulation time on these sample measurements. In-between detailed sample simulations, SMARTS fast-forwards in the simulation using a greatly simplified and much faster simulation model (compared to full detail), which maintains only necessary parts of the architecture, such as cache memory. This maintenance process is called warming. While warming is mandatory to keep the simulation accuracy high, caches may be sufficiently warm for an accurate simulation long before reaching the sample. In other words, much time may be wasted on warming in SMARTS.In this work, we show that caches can be kept in an accurate state with much less time spent on warming. The first paper presents Adaptive Cache Warming, a methodology for identifying the minimum amount of warming in an iterative process for every SMARTS sample. The rest of the simulation time, previously spent on warming, can be skipped by fast-forwarding between samples using native hardware execution of the code. Doing so will thus result in significantly faster statistically sampled simulation while maintaining accuracy. The second paper presents Cache Merging, which mitigates the redundant warmings introduced in Adaptive Cache Warming. We solve this issue by going back in time and merging the existing warming with a cache warming session that comes chronologically before the existing warming. By removing the redundant warming, we yield even more speedup. Together, Adaptive Cache Warming and Cache Merging is a powerful boost for statistically sampled simulations.
  •  
4.
  • Sandberg, Andreas, 1984- (författare)
  • Understanding Multicore Performance : Efficient Memory System Modeling and Simulation
  • 2014
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • To increase performance, modern processors employ complex techniques such as out-of-order pipelines and deep cache hierarchies. While the increasing complexity has paid off in performance, it has become harder to accurately predict the effects of hardware/software optimizations in such systems. Traditional microarchitectural simulators typically execute code 10 000×–100 000× slower than native execution, which leads to three problems: First, high simulation overhead makes it hard to use microarchitectural simulators for tasks such as software optimizations where rapid turn-around is required. Second, when multiple cores share the memory system, the resulting performance is sensitive to how memory accesses from the different cores interleave. This requires that applications are simulated multiple times with different interleaving to estimate their performance distribution, which is rarely feasible with today's simulators. Third, the high overhead limits the size of the applications that can be studied. This is usually solved by only simulating a relatively small number of instructions near the start of an application, with the risk of reporting unrepresentative results.In this thesis we demonstrate three strategies to accurately model multicore processors without the overhead of traditional simulation. First, we show how microarchitecture-independent memory access profiles can be used to drive automatic cache optimizations and to qualitatively classify an application's last-level cache behavior. Second, we demonstrate how high-level performance profiles, that can be measured on existing hardware, can be used to model the behavior of a shared cache. Unlike previous models, we predict the effective amount of cache available to each application and the resulting performance distribution due to different interleaving without requiring a processor model. Third, in order to model future systems, we build an efficient sampling simulator. By using native execution to fast-forward between samples, we reach new samples much faster than a single sample can be simulated. This enables us to simulate multiple samples in parallel, resulting in almost linear scalability and a maximum simulation rate close to native execution.
  •  
5.
  • Aguilar, Xavier (författare)
  • Towards Scalable Performance Analysis of MPI Parallel Applications
  • 2015
  • Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract
    •   A considerably fraction of science discovery is nowadays relying on computer simulations. High Performance Computing  (HPC) provides scientists with the means to simulate processes ranging from climate modeling to protein folding. However, achieving good application performance and making an optimal use of HPC resources is a heroic task due to the complexity of parallel software. Therefore, performance tools  and runtime systems that help users to execute  applications in the most optimal way are of utmost importance in the landscape of HPC.  In this thesis, we explore different techniques to tackle the challenges of collecting, storing, and using  fine-grained performance data. First, we investigate the automatic use of real-time performance data in order to run applications in an optimal way. To that end, we present a prototype of an adaptive task-based runtime system that uses real-time performance data for task scheduling. This runtime system has a performance monitoring component that provides real-time access to the performance behavior of anapplication while it runs. The implementation of this monitoring component is presented and evaluated within this thesis. Secondly, we explore lossless compression approaches  for MPI monitoring. One of the main problems that  performance tools face is the huge amount of fine-grained data that can be generated from an instrumented application. Collecting fine-grained data from a program is the best method to uncover the root causes of performance bottlenecks, however, it is unfeasible with extremely parallel applications  or applications with long execution times. On the other hand, collecting coarse-grained data is scalable but  sometimes not enough to discern the root cause of a performance problem. Thus, we propose a new method for performance monitoring of MPI programs using event flow graphs. Event flow graphs  provide very low overhead in terms of execution time and  storage size, and can be used to reconstruct fine-grained trace files of application events ordered in time.
  •  
6.
  • Borgström, Gustaf, PhD Student, 1984-, et al. (författare)
  • Faster Functional Warming with Cache Merging
  • 2022
  • Rapport (övrigt vetenskapligt/konstnärligt)abstract
    • Smarts-like sampled hardware simulation techniques achieve good accuracy by simulating many small portions of an application in detail. However, while this reduces the detailed simulation time, it results in extensive cache warming times, as each of the many simulation points requires warming the whole memory hierarchy. Adaptive Cache Warming reduces this time by iteratively increasing warming until achieving sufficient accuracy. Unfortunately, each time the warming increases, the previous warming must be redone, nearly doubling the required warming. We address re-warming by developing a technique to merge the cache states from the previous and additional warming iterations.We address re-warming by developing a technique to merge the cache states from the previous and additional warming iterations. We demonstrate our merging approach on multi-level LRU cache hierarchy and evaluate and address the introduced errors. By removing warming redundancy, we expect an ideal 2× warming speedup when using our Cache Merging solution together with Adaptive Cache Warming. Experiments show that Cache Merging delivers an average speedup of 1.44×, 1.84×, and 1.87× for 128kB, 2MB, and 8MB L2 caches, respectively, with 95-percentile absolute IPC errors of only 0.029, 0.015, and 0.006, respectively. These results demonstrate that Cache Merging yields significantly higher simulation speed with minimal losses.
  •  
7.
  • Borgström, Gustaf, PhD Student, 1984-, et al. (författare)
  • Faster FunctionalWarming with Cache Merging
  • 2023
  • Ingår i: PROCEEDINGS OF SYSTEM ENGINEERING FOR CONSTRAINED EMBEDDED SYSTEMS, DRONESE AND RAPIDO 2023. - : Association for Computing Machinery (ACM). - 9798400700453 ; , s. 39-47
  • Konferensbidrag (refereegranskat)abstract
    • Smarts-like sampled hardware simulation techniques achieve good accuracy by simulating many small portions of an application in detail. However, while this reduces the simulation time, it results in extensive cache warming times, as each of the many simulation points requires warming the whole memory hierarchy. Adaptive Cache Warming reduces this time by iteratively increasing warming to achieve sufficient accuracy. Unfortunately, each increases requires that the previous warming be redone, nearly doubling the total warming. We address re-warming by developing a technique to merge the cache states from the previous and additional warming iterations. We demonstrate our merging approach on multi-level LRU cache hierarchy and evaluate and address the introduced errors. Our experiments show that Cache Merging delivers an average speedup of 1.44x, 1.84x, and 1.87x for 128kB, 2MB, and 8MB L2 caches, respectively, (vs. a 2x theoretical maximum speedup) with 95-percentile absolute IPC errors of only 0.029, 0.015, and 0.006, respectively. These results demonstrate that Cache Merging yields significantly higher simulation speed with minimal losses.
  •  
8.
  • Haddadi, Alireza, et al. (författare)
  • Large-scale Graph Processing on Commodity Systems : Understanding and Mitigating the Impact of Swapping
  • 2023
  • Ingår i: The International Symposium on Memory Systems (MEMSYS '23). - : Association for Computing Machinery (ACM).
  • Konferensbidrag (refereegranskat)abstract
    • Graph workloads are critical in many areas. Unfortunately, graph sizes have been increasing faster than DRAM capacity. As a result, large-scale graph processing necessarily falls back to virtual memory paging, resulting in tremendous performance losses.In this work we investigate how we can get the best possible performance on commodity systems from graphs that cannot fit into DRAM by understanding, and adjusting, how the virtual memory system and the graph characteristics interact. To do so, we first characterize the graph applications, system, and SSD behavior as a function of how much of the graph fits in DRAM. From this analysis we see that for multiple graph types, the system fails to fully utilize the bandwidth of the SSDs due to a lack of parallel page-in requests.We use this insight to motivate overcommitting CPU threads for graph processing. This allows us to significantly increase the number of parallel page-in requests for several graph types, and recover much of the performance lost to paging. We show that overcommitting threads generally improves performance for various algorithms and graph types. However, we identify one graph that suffers from overcommitting threads, leading to the recommendation that overcommitting threads is generally good for performance, but there may be certain graph inputs that suffer from overcommitting threads.
  •  
9.
  • Hassan, Muhammad, 1990- (författare)
  • Enhancing Processor Performance : Approaches for Memory Characterization, Efficient Dynamic Instruction Prefetching, and Optimized Instruction Caching
  • 2024
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Low latency access to both data and instructions is paramount for processor performance. However, memory speed has been trailing behind the processor speed and is now a dominant bottleneck in execution. While both data and instruction misses cause performance losses, data misses can be overlapped with other useful work, but instruction misses stall the front-end of the processor leading to greater performance loss than data misses.Memory access characterization is important for designing memory hierarchies. While many works have characterised SPEC benchmark's memory behaviour, the results have been either tied to a specific micro-architecture or ignored the time-based behaviour of the benchmarks. In this thesis, we remove a majority of the micro-architectural features to characterize the intrinsic memory behaviour of the SPEC benchmarks and use this to understand how the workloads behave with various cache sizes and prefetching. In order to simplify the analysis of complex time-based results, we propose the use of MPKI Bins which divide the execution into distinct MPKI ranges. Using MPKI bins, we demonstrate that short memory-bound phases cause a significant percentage of the overall cache misses. For instructions, the growing instruction footprints of server workloads are causing significant performance losses due to front-end stalls that cannot be overlapped or hidden by out-of-order execution. The second part of this thesis develops a technique to enable dedicated instruction prefetchers without the area cost of separate metadata storage structures. We propose to re-purpose the branch target buffer (BTB) to store prefetcher metadata based on the insight that benchmarks that require a dedicated instruction prefetcher can tolerate increased BTB misses. Going further, we propose L2 instruction bypassing based on the insight that decreased L2 data misses deliver more benefit then the slight instruction latency reduction of having instructions in the L2. We show that L2 instruction bypass delivers more performance than a dedicated instruction prefetcher and instruction focused replacement policies. 
  •  
10.
  • Hassan, Muhammad, 1990-, et al. (författare)
  • Protean : Resource-efficient Instruction Prefetching
  • 2023
  • Ingår i: The International Symposium on Memory Systems (MEMSYS '23). - : Association for Computing Machinery (ACM).
  • Konferensbidrag (refereegranskat)abstract
    • Increases in code footprint and control flow complexity have made low-latency instruction fetch challenging. Dedicated Instruction Prefetchers (DIPs) can provide performance gains (up to 5%) for a subset of applications that are poorly served by today’s ubiquitous Fetch-Directed Instruction Prefetching (FDIP). However, DIPs incur the significant overhead of in-core metadata storage (for all work- loads) and energy and performance loss from excess prefetches (for many workloads), leading to 11% of workloads actually losing performance. This work addresses how to provide the benefits of a DIP without its costs when the DIP cannot provide a benefit.Our key insight is that workloads that benefit from DIPs can tolerate increased Branch Target Buffer (BTB) misses. This allows us to dynamically re-purpose the existing BTB storage between the BTB and the DIP. We train a simple performance counter based decision tree to select the optimal configuration at runtime, which allows us to achieve different energy/performance optimization goals. As a result, we pay essentially no area overhead when a DIP is needed, and can use the larger BTB when it is beneficial, or even power it off when not needed.We look at our impact on two groups of benchmarks: those where the right configuration choice can improve performance or energy and those where the wrong choice could hurt them. For the benchmarks with improvement potential, when optimizing for performance, we are able to obtain 86% of the oracle potential, and when optimizing for energy, 98% of the potential, both while avoid- ing essentially all performance and energy losses on the remaining benchmarks. This demonstrates that our technique is able to dy- namically adapt to different performance/energy goals and obtain essentially all of the potential gains of DIP without the overheads they experience today.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 22
Typ av publikation
konferensbidrag (6)
doktorsavhandling (6)
tidskriftsartikel (4)
licentiatavhandling (4)
rapport (1)
annan publikation (1)
visa fler...
visa färre...
Typ av innehåll
övrigt vetenskapligt/konstnärligt (11)
refereegranskat (10)
Författare/redaktör
Black-Schaffer, Davi ... (16)
Hassan, Muhammad, 19 ... (5)
Park, Chang Hyun, Po ... (4)
Rohner, Christian, P ... (3)
Park, Chang Hyun, As ... (3)
Alipour, Mehdi (2)
visa fler...
Black-Schaffer, Davi ... (2)
Kaxiras, Stefanos, P ... (2)
Alves, Ricardo (2)
Hagersten, Erik (2)
Hagersten, Erik, Pro ... (2)
Borgström, Gustaf, P ... (2)
Wood, David A., Prof ... (2)
Jonsson, Bengt (1)
Kaxiras, Stefanos (1)
Kumar, Rakesh (1)
Aguilar, Xavier (1)
Laure, Erwin, Profes ... (1)
Black-Schaffer, Davi ... (1)
Podobas, Artur, 1982 ... (1)
H. Lipasti, Mikko, P ... (1)
Erez, Mattan, Profes ... (1)
Sandberg, Andreas (1)
Brorsson, Mats, Prof ... (1)
Sandberg, Andreas, 1 ... (1)
Borgström, Gustaf (1)
Sembrant, Andreas (1)
Podobas, Artur, Assi ... (1)
Black-Schaffer, Davi ... (1)
Moreto, Miquel (1)
Haddadi, Alireza (1)
Grot, Boris, Associa ... (1)
Popov, Mihail (1)
Sembrant, Andreas, 1 ... (1)
Khan, Muneeb, 1985- (1)
Stenström, Per, Prof ... (1)
Nematallah, Ahmed, 1 ... (1)
Vougioukas, Ilias (1)
Black-Schaffer, Davi ... (1)
Sánchez Barrera, Isa ... (1)
Marc, Casas (1)
Stupnikova, Anastasi ... (1)
Black-Schaffer, Davi ... (1)
visa färre...
Lärosäte
Uppsala universitet (20)
Kungliga Tekniska Högskolan (2)
Språk
Engelska (22)
Forskningsämne (UKÄ/SCB)
Naturvetenskap (13)
Teknik (11)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy