SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Grot Boris) "

Sökning: WFRF:(Grot Boris)

  • Resultat 1-4 av 4
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Awan, Ahsan Javed, 1988- (författare)
  • Performance Characterization of In-Memory Data Analytics on a Scale-up Server
  • 2016
  • Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • The sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark defines the state of the art in big data analytics platforms for (i) exploiting data-flow and in-memory computing and (ii) for exhibiting superior scale-out performance on the commodity machines, little effort has been devoted at understanding the performance of in-memory data analytics with Spark on modern scale-up servers. This thesis characterizes the performance of in-memory data analytics with Spark on scale-up servers.Through empirical evaluation of representative benchmark workloads on a dual socket server, we have found that in-memory data analytics with Spark exhibit poor multi-core scalability beyond 12 cores due to thread level load imbalance and work-time inflation. We have also found that workloads are bound by the latency of frequent data accesses to DRAM. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization).For data accesses we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10% on average, (ii) disabling next-line L1-D prefetchers can reduce the execution time by up-to 14%. For GC impact, we match memory behaviour with the garbage collector to improve performance of applications between 1.6x to 3x. and recommend to use multiple small executors that can provide up-to 36% speedup over single large executor.
  •  
2.
  • Hassan, Muhammad, 1990- (författare)
  • Enhancing Processor Performance : Approaches for Memory Characterization, Efficient Dynamic Instruction Prefetching, and Optimized Instruction Caching
  • 2024
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Low latency access to both data and instructions is paramount for processor performance. However, memory speed has been trailing behind the processor speed and is now a dominant bottleneck in execution. While both data and instruction misses cause performance losses, data misses can be overlapped with other useful work, but instruction misses stall the front-end of the processor leading to greater performance loss than data misses.Memory access characterization is important for designing memory hierarchies. While many works have characterised SPEC benchmark's memory behaviour, the results have been either tied to a specific micro-architecture or ignored the time-based behaviour of the benchmarks. In this thesis, we remove a majority of the micro-architectural features to characterize the intrinsic memory behaviour of the SPEC benchmarks and use this to understand how the workloads behave with various cache sizes and prefetching. In order to simplify the analysis of complex time-based results, we propose the use of MPKI Bins which divide the execution into distinct MPKI ranges. Using MPKI bins, we demonstrate that short memory-bound phases cause a significant percentage of the overall cache misses. For instructions, the growing instruction footprints of server workloads are causing significant performance losses due to front-end stalls that cannot be overlapped or hidden by out-of-order execution. The second part of this thesis develops a technique to enable dedicated instruction prefetchers without the area cost of separate metadata storage structures. We propose to re-purpose the branch target buffer (BTB) to store prefetcher metadata based on the insight that benchmarks that require a dedicated instruction prefetcher can tolerate increased BTB misses. Going further, we propose L2 instruction bypassing based on the insight that decreased L2 data misses deliver more benefit then the slight instruction latency reduction of having instructions in the L2. We show that L2 instruction bypass delivers more performance than a dedicated instruction prefetcher and instruction focused replacement policies. 
  •  
3.
  • Kumar, Rakesh, et al. (författare)
  • Blasting Through The Front-End Bottleneck With Shotgun
  • 2018
  • Ingår i: ACM Sigplan Notices. - New York, NY, USA : ACM. ; , s. 30-42
  • Konferensbidrag (refereegranskat)abstract
    • The front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction working sets. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a trade-off between performance and metadata storage costs. This work introduces Shotgun, a BTB-directed front-end prefetcher powered by a new BTB organization that maintains a logical map of an application's instruction footprint, which enables high-efficacy prefetching at low storage cost. To map active code regions, Shotgun precisely tracks an application's global control flow (e.g., function and trap routine entry points) and summarizes local control flow within each code region. Because the local control flow enjoys high spatial locality, with most functions comprised of a handful of instruction cache blocks, it lends itself to a compact region-based encoding. Meanwhile, the global control flow is naturally captured by the application's unconditional branch working set (calls, returns, traps). Based on these insights, Shotgun devotes the bulk of its BTB capacity to branches responsible for the global control flow and a spatial encoding of their target regions. By effectively capturing a map of the application's instruction footprint in the BTB, Shotgun enables highly effective BTB-directed prefetching. Using a storage budget equivalent to a conventional BTB, Shotgun outperforms the state-of-the-art BTB-directed front-end prefetcher by up to 14% on a set of varied commercial workloads.
  •  
4.
  • Roozbeh, Amir, 1983- (författare)
  • Toward Next-generation Data Centers : Principles of Software-Defined “Hardware” Infrastructures and Resource Disaggregation
  • 2019
  • Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • The cloud is evolving due to additional demands introduced by new technological advancements and the wide movement toward digitalization. Therefore, next-generation data centers (DCs) and clouds are expected (and need) to become cheaper, more efficient, and capable of offering more predictable services.Aligned with this, we examine the concept of software-defined “hardware” infrastructures (SDHI) based on hardware resource disaggregation as one possible way of realizing next-generation DCs. We start with an overview of the functional architecture of a cloud based on SDHI. Following this, we discuss a series of use-cases and deployment scenarios enabled by SDHI and explore the role of each functional block of SDHI’s architecture, i.e., cloud infrastructure, cloud platforms, cloud execution environments, and applications.Next, we propose a framework to evaluate the impact of SDHI on techno-economic efficiency of DCs, specifically focusing on application profiling, hardware dimensioning, and total cost of ownership (TCO). Our study shows that combining resource disaggregation and software-defined capabilities makes DCs less expensive and easier to expand; hence they can rapidly follow the exponential demand growth. Additionally, we elaborate on technologies behind SDHI, its challenges, and its potential future directions.Finally, to identify a suitable memory management scheme for SDHI and show its advantages, we focus on the management of Last Level Cache (LLC) in currently available Intel processors. Aligned with this, we investigate how better management of LLC can provide higher performance, more predictable response time, and improved isolation between threads. More specifically, we take advantage of LLC’s non-uniform cache architecture (NUCA) in which the LLC is divided into “slices,” where access by the core to which it closer is faster than access to other slices. Based upon this, we introduce a new memory management scheme, called slice-aware memory management, which carefully maps the allocated memory to LLC slices based on their access time latency rather than the de facto scheme that maps them uniformly. Many applications can benefit from our memory management scheme with relatively small changes. As an example, we show the potential benefits that Key-Value Store (KVS) applications gain by utilizing our memory management scheme. Moreover, we discuss how this scheme could be used to provide explicit CPU slicing – which is one of the expectations of SDHI  and hardware resource disaggregation.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-4 av 4

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy