SwePub
Tyck till om SwePub Sök här!
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Kaxiras Stefanos Professor) "

Sökning: WFRF:(Kaxiras Stefanos Professor)

  • Resultat 1-8 av 8
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Alipour, Mehdi (författare)
  • Rethinking Dynamic Instruction Scheduling and Retirement for Efficient Microarchitectures
  • 2020
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Out-of-order execution is one of the main micro-architectural techniques used to improve the performance of both single- and multi-threaded processors. The application of such a processor varies from mobile devices to server computers. This technique achieves higher performance by finding independent instructions and hiding execution latency and uses the cycles which otherwise would be wasted or caused a CPU stall. To accomplish this, it uses scheduling resources including the ROB, IQ, LSQ and physical registers, to store and prioritize instructions.The pipeline of an out-of-order processor has three macro-stages: the front-end, the scheduler, and the back-end. The front-end fetches instructions, places them in the out-of-order resources, and analyzes them to prepare for their execution. The scheduler identifies which instructions are ready for execution and prioritizes them for scheduling. The back-end updates the processor state with the results of the oldest completed instructions, deallocates the resources and commits the instructions in the program order to maintain correct execution.Since out-of-order execution needs to be able to choose any available instructions for execution, its scheduling resources must have complex circuits for identifying and prioritizing instructions, which makes them very expansive, therefore, limited. Due to their cost, the scheduling resources are constrained in size. This limited size leads to two stall points respectively at the front-end and the back-end of the pipeline. The front-end can stall due to fully allocated resources and therefore no more new instructions can be placed in the scheduler. The back-end can stall due to the unfinished execution of an instruction at the head of the ROB which prevents other resources from being deallocated, preventing new instructions from being inserted into the pipeline.To address these two stalls, this thesis focuses on reducing the time instructions occupy the scheduling resources. Our front-end technique tackles IQ pressure while our back-end approach considers the rest of the resources. To reduce front-end stalls we reduce the pressure on the IQ for both storing (depth) and issuing (width) instructions by bypassing them to cheaper storage structures. To reduce back-end stalls, we explore how we can retire instructions earlier, and out-of-order, to reduce the pressure on the out-of-order resource.
  •  
2.
  • Alves, Ricardo (författare)
  • Leveraging Existing Microarchitectural Structures to Improve First-Level Caching Efficiency
  • 2019
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Low-latency data access is essential for performance. To achieve this, processors use fast first-level caches combined with out-of-order execution, to decrease and hide memory access latency respectively. While these approaches are effective for performance, they cost significant energy, leading to the development of many techniques that require designers to trade-off performance and efficiency.Way-prediction and filter caches are two of the most common strategies for improving first-level cache energy efficiency while still minimizing latency. They both have compromises as way-prediction trades off some latency for better energy efficiency, while filter caches trade off some energy efficiency for lower latency. However, these strategies are not mutually exclusive. By borrowing elements from both, and taking into account SRAM memory layout limitations, we proposed a novel MRU-L0 cache that mitigates many of their shortcomings while preserving their benefits. Moreover, while first-level caches are tightly integrated into the cpu pipeline, existing work on these techniques largely ignores the impact they have on instruction scheduling. We show that the variable hit latency introduced by way-misspredictions causes instruction replays of load dependent instruction chains, which hurts performance and efficiency. We study this effect and propose a variable latency cache-hit instruction scheduler, that identifies potential misschedulings, reduces instruction replays, reduces negative performance impact, and further improves cache energy efficiency.Modern pipelines also employ sophisticated execution strategies to hide memory latency and improve performance. While their primary use is for performance and correctness, they require intermediate storage that can be used as a cache as well. In this work we demonstrate how the store-buffer, paired with the memory dependency predictor, can be used to efficiently cache dirty data; and how the physical register file, paired with a value predictor, can be used to efficiently cache clean data. These strategies not only improve both performance and energy, but do so with no additional storage and minimal additional complexity, since they recycle existing cpu structures to detect reuse, memory ordering violations, and misspeculations.
  •  
3.
  • Shimchenko, Marina, 1994- (författare)
  • Optimizing Energy Efficiency of Concurrent Garbage Collection
  • 2024
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • The increasing energy consumption of the Information and Communication Technology sector amid climate change concerns underscores the urgency for energy efficiency improvements in computing. This thesis focuses on optimizing the energy efficiency of Java, a widely used programming language, and its implementation in OpenJDK. Specifically, our focus is on enhancing the energy efficiency of concurrent garbage collection.As a starting point for our work, we assessed the energy consumption of various garbage collection algorithms within OpenJDK, establishing concurrent garbage collectors as the least energy-efficient. This prompted further investigation into methods to enhance their energy consumption. We investigated methods like dynamically adjusting the memory size required by an application based on how much of the computer's processors one wants to use for garbage collection. We also looked into scheduling garbage collection tasks to run on specific types of computer cores that use less energy and running these tasks when the computer is not being actively used. We implemented all the abovementioned strategies in one of Java’s concurrent garbage collectors, ZGC. Through our experiments, we showed that these techniques can significantly reduce the amount of energy used by garbage collection without slowing down the performance of the programs running on the computer. Overall, our research contributes to making computing more environmentally friendly by finding ways to use less energy while still getting the same results.
  •  
4.
  • Tran, Kim-Anh (författare)
  • Finding and Exploiting Memory-Level-Parallelism in Constrained Speculative Architectures
  • 2020
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • One of the main performance bottlenecks of processors today is the discrepancy between processor and memory speed, known as the memory wall. While the processor executes instructions at a high pace, the memory is too slow to provide data in a timely manner. Load instructions that require an access to memory are referred to as long-latency or delinquent loads. To prevent the processor from stalling, independent instruction past the load may execute, including independent loads. Overlapping load operations and thus their latency is referred to as memory-level parallelism. Memory-level parallelism (MLP) can significantly improve performance. Today's out-of-order processors are therefore equipped with complex hardware that allows them to look into the future and to select independent loads that can be overlapped. However, the ability to choose future instructions and speculatively execute them in advance introduces complexity, increased power consumption and potential security risks. In this thesis we look at constrained speculative architectures that struggle to hide memory latencies as they are constrained by design, by their resources, or by security. We investigate ways for the compiler to help them in finding MLP, with the ultimate goal to avoid processor stalls as much as possible. This includes small energy-efficient processors that lack the ability to look-ahead far enough to find independent loads, but also large processors that are disallowed to speculatively execute independent loads due to enforced security measures to circumvent side-channel attacks. We identify the reason for their limitation and propose software transformations and hardware extensions to overcome their restrictions.
  •  
5.
  • Sakalis, Christos, 1990- (författare)
  • Rethinking Speculative Execution from a Security Perspective
  • 2021
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Speculative out-of-order execution is one of the fundamental building blocks of modern, high-performance processors. To maximize the utilization of the system's resources, hardware and software security checks in the speculative domain can be temporarily ignored, without affecting the correctness of the application, as long as no architectural changes are made before transitioning to the non-speculative domain. Similarly, the microarchitectural state of the system, which is by necessity modified for every single operation (speculative or otherwise) also does not affect the correctness of the application, as such state is meant to be invisible on the architectural level. Unfortunately, while the microarchitectural state of the system is indeed separate from the architectural state and is typically hidden from the users, it can still be observed indirectly through its side-effects, through the use of "side-channels". Starting with Meltdown and Spectre, speculative execution, combined with existing side-channel attacks, can be abused to bypass both hardware and software security barriers and illegally gain access to data that would not be accessible otherwise.Embroiled in a battle between security and efficiency, computer architects have designed numerous microarchitectural solutions to this issue, all the while new attacks are being constantly discovered. This thesis proposes two such speculative side-channel defenses, Ghost loads and Delay-on-Miss, both of which protect against speculative side-channel attacks targeting the cache and memory hierarchy as their side-channel. Ghost loads work by making speculative loads invisible in the memory hierarchy, while Delay-on-Miss, which is both simpler and more secure than Ghost loads, restricts speculative loads from even reaching many levels of the hierarchy.At the same time, this thesis also tackles security problems brought on by speculative execution that are not themselves speculative side-channel attacks, namely microarchitectural replay attacks. In the latter, the attacker abuses speculative execution not to gain access to data but to amplify an otherwise already existing side-channel. This is achieved by trapping the execution of a victim application in a repeating window of speculation, forcing it to constantly squash and re-execute the same side-channel instructions again and again. To counter such attacks, Delay-on-Squash is introduced, which prevents instructions from being replayed in the same window of speculation, hence stopping any microarchitectural replay attempts.Overall, between Delay-on-Squash, Delay-on-Miss, and Ghost loads, this thesis covers a wide range of insecure microarchitectural behaviors and secure countermeasures for them, all the while balancing the trade-offs between security, performance, and complexity.
  •  
6.
  • Spiliopoulos, Vasileios (författare)
  • Improving Energy-Efficiency of Multicores using First-Order Modeling
  • 2016
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • In the recent decades, power consumption has evolved to one of the most critical resources in a computer system. In the form of electricity bill in data centers, battery life in mobile devices, or thermal constraints in desktops and laptops, power consumption imposes several limitations in today’s processors and improving power and energy efficiency is one of the most urgent research topics of Computer Architecture.Dynamic Voltage and Frequency Scaling (DVFS) and Cache Resizing are among the most popular energy saving techniques. Previous work, however, has focused on developing heuristics and trial-and-error methods that yield acceptable savings, but fail to provide insight and understanding of how these techniques affect power and performance of a computer system. In contrast, this Thesis proposes the use of first-order modeling to improve the energy efficiency of computer systems. A first-order model needs to be (i) accurate enough to efficiently drive DVFS and Cache Resizing decisions, and (ii) simple enough to eliminate the overhead of collecting the required inputs to the model. We show that such models can be constructed and successfully applied in modern systems.For DVFS, we propose to scale frequency down to exploit applications’ memory slack, i.e., periods that the processor spends waiting for data to be fetched from the main memory. In such cases, the processor frequency can be scaled down to save energy without inordinate performance penalty. Our DVFS models can detect slack and predict the impact of DVFS in both power and performance with great accuracy. Cache Resizing, on the other hand, relies on the fact that many applications do not benefit from the vast amount of cache that modern processors are equipped with. In such cases, the cache can be resized to save static energy consumption at limited performance cost. Since both techniques are related with the memory behavior of applications, we propose a unified model to manage the two techniques in tandem and maximize energy efficiency through synergistic DVFS and Cache Resizing.Finally, our experience with DVFS in real systems motivated us to contribute to the integration of DVFS into the gem5 simulator. Unlike other simulators that ignore the role of OS in DVFS, we extend the gem5 simulator by developing the hardware and software components that allow existing Linux DVFS infrastructure to be seamlessly integrated in the simulator.
  •  
7.
  • Alves, Ricardo, et al. (författare)
  • Early Address Prediction : Efficient Pipeline Prefetch and Reuse
  • 2021
  • Ingår i: ACM Transactions on Architecture and Code Optimization (TACO). - : Association for Computing Machinery (ACM). - 1544-3566 .- 1544-3973. ; 18:3
  • Tidskriftsartikel (refereegranskat)abstract
    • Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or LO caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead. In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction's lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead. Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.
  •  
8.
  • Davari, Mahdad (författare)
  • Advances Towards Data-Race-Free Cache Coherence Through Data Classification
  • 2017
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Providing a consistent view of the shared memory based on precise and well-defined semantics—memory consistency model—has been an enabling factor in the widespread acceptance and commercial success of shared-memory architectures. Moreover, cache coherence protocols have been employed by the hardware to remove from the programmers the burden of dealing with the memory inconsistency that emerges in the presence of the private caches. The principle behind all such cache coherence protocols is to guarantee that consistent values are read from the private caches at all times.In its most stringent form, a cache coherence protocol eagerly enforces two invariants before each data modification: i) no other core has a copy of the data in its private caches, and ii) all other cores know where to receive the consistent data should they need the data later. Nevertheless, by partly transferring the responsibility for maintaining those invariants to the programmers, commercial multicores have adopted weaker memory consistency models, namely the Total Store Order (TSO), in order to optimize the performance for more common cases.Moreover, memory models with more relaxed invariants have been proposed based on the observation that more and more software is written in compliance with the Data-Race-Free (DRF) semantics. The semantics of DRF software can be leveraged by the hardware to infer when data in the private caches might be inconsistent. As a result, hardware ignores the inconsistent data and retrieves the consistent data from the shared memory. DRF semantics therefore removes from the hardware the burden of eagerly enforcing the strong consistency invariants before each data modification. Instead, consistency is guaranteed only when needed. This results in manifold optimizations, such as reducing the energy consumption and improving the performance and scalability. The efficiency of detecting and discarding the inconsistent data is an important factor affecting the efficiency of such coherence protocols. For instance, discarding the consistent data does not affect the correctness, but results in performance loss and increased energy consumption.In this thesis we show how data classification can be leveraged as an effective tool to simplify the cache coherence based on the DRF semantics. In particular, we introduce simple but efficient hardware-based private/shared data classification techniques that can be used to efficiently detect the inconsistent data, thus enabling low-overhead and scalable cache coherence solutions based on the DRF semantics.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-8 av 8

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy