SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Koukos Konstantinos) "

Sökning: WFRF:(Koukos Konstantinos)

  • Resultat 1-10 av 12
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  •  
2.
  •  
3.
  • Koukos, Konstantinos, et al. (författare)
  • Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead
  • 2016
  • Ingår i: ACM Transactions on Architecture and Code Optimization (TACO). - : Association for Computing Machinery (ACM). - 1544-3566 .- 1544-3973. ; 13:1
  • Tidskriftsartikel (refereegranskat)abstract
    • This work proposes a novel scheme to facilitate heterogeneous systems with unified virtual memory. Research proposals implement coherence protocols for sequential consistency (SC) between central processing unit (CPU) cores and between devices. Such mechanisms introduce severe bottlenecks in the system; therefore, we adopt the heterogeneous-race-free (HRF) memory model. The use of HRF simplifies the coherency protocol and the graphics processing unit (GPU) memory management unit (MMU). Our protocol optimizes CPU and GPU demands separately, with the GPU part being simpler while the CPU is more elaborate and latency aware. We achieve an average 45% speedup and 45% energy-delay product reduction (20% energy) over the corresponding SC implementation.
  •  
4.
  • Koukos, Konstantinos, 1982- (författare)
  • Efficient Execution Paradigms for Parallel Heterogeneous Architectures
  • 2016
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • This thesis proposes novel, efficient execution-paradigms for parallel heterogeneous architectures. The end of Dennard scaling is threatening the effectiveness of DVFS in future nodes; therefore, new execution paradigms are required to exploit the non-linear relationship between performance and energy efficiency of memory-bound application-regions. To attack this problem, we propose the decoupled access-execute (DAE) paradigm. DAE transforms regions of interest (at program-level) in two coarse-grain phases: the access-phase and the execute-phase, which we can independently DVFS. The access-phase is intended to prefetch the data in the cache, and is therefore expected to be predominantly memory-bound, while the execute-phase runs immediately after the access-phase (that has warmed-up the cache) and is therefore expected to be compute-bound.DAE, achieves good energy savings (on average 25% lower EDP) without performance degradation, as opposed to other DVFS techniques. Furthermore, DAE increases the memory level parallelism (MLP) of memory-bound regions, which results in performance improvements of memory-bound applications. To automatically transform application-regions to DAE, we propose compiler techniques to automatically generate and incorporate the access-phase(s) in the application. Our work targets affine, non-affine, and even complex, general-purpose codes. Furthermore, we explore the benefits of software multi-versioning to optimize DAE in dynamic environments, and handle codes with statically unknown access-phase overheads. In general, applications automatically-transformed to DAE by our compiler, maintain (or even exceed in some cases) the good performance and energy efficiency of manually-optimized DAE codes.Finally, to ease the programming environment of heterogeneous systems (with integrated GPUs), we propose a novel system-architecture that provides unified virtual memory with low overhead. The underlying insight behind our work is that existing data-parallel programming models are a good fit for relaxed memory consistency models (e.g., the heterogeneous race-free model). This allows us to simplify the coherency protocol between the CPU – GPU, as well as the GPU memory management unit. On average, we achieve 45% speedup and 45% lower EDP over the corresponding SC implementation.
  •  
5.
  • Koukos, Konstantinos, et al. (författare)
  • Multiversioned decoupled access-execute : The key to energy-efficient compilation of general-purpose programs
  • 2016
  • Ingår i: Proc. 25th International Conference on Compiler Construction. - New York : ACM Press. - 9781450342414 ; , s. 121-131
  • Konferensbidrag (refereegranskat)abstract
    • Computer architecture design faces an era of great challenges in an attempt to simultaneously improve performance and energy efficiency. Previous hardware techniques for energy management become severely limited, and thus, compilers play an essential role in matching the software to the more restricted hardware capabilities. One promising approach is software decoupled access-execute (DAE), in which the compiler transforms the code into coarse-grain phases that are well-matched to the Dynamic Voltage and Frequency Scaling (DVFS) capabilities of the hardware. While this method is proved efficient for statically analyzable codes, general purpose applications pose significant challenges due to pointer aliasing, complex control flow and unknown runtime events. We propose a universal compile-time method to decouple general-purpose applications, using simple but efficient heuristics. Our solutions overcome the challenges of complex code and show that automatic decoupled execution significantly reduces the energy expenditure of irregular or memory-bound applications and even yields slight performance boosts. Overall, our technique achieves over 20% on average energy-delay-product (EDP) improvements (energy over 15% and performance over 5%) across 14 bench-marks from SPEC CPU 2006 and Parboil benchmark suites, with peak EDP improvements surpassing 70%.
  •  
6.
  • Koukos, Konstantinos, et al. (författare)
  • Towards more efficient execution : a decoupled access-execute approach
  • 2013
  • Ingår i: Proc. 27th ACM International Conference on Supercomputing. - New York : ACM Press. - 9781450321303 ; , s. 253-262
  • Konferensbidrag (refereegranskat)abstract
    • The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the energy savings of this technique. This paper evaluates how much we can increase the effectiveness of DVFS by using a software decoupled access-execute approach. Decoupling the data access from execution allows us to apply optimal voltage-frequency selection for each phase and therefore improve energy efficiency over standard coupled execution.The underlying insight of our work is that by decoupling access and execute we can take advantage of the memory-bound nature of the access phase and the compute-bound nature of the execute phase to optimize power efficiency, while maintaining good performance. To demonstrate this we built a task based parallel execution infrastructure consisting of: (1) a runtime system to orchestrate the execution, (2) power models to predict optimal voltage-frequency selection at runtime, (3) a modeling infrastructure based on hardware measurements to simulate zero-latency, per-core DVFS, and (4) a hardware measurement infrastructure to verify our model's accuracy.Based on real hardware measurements we project that the combination of decoupled access-execute and DVFS has the potential to improve EDP by 25% without hurting performance. On memory-bound applications we significantly improve performance due to increased MLP in the access phase and ILP in the execute phase. Furthermore we demonstrate that our method can achieve high performance both in presence or absence of a hardware prefetcher.
  •  
7.
  • Koukos, Konstantinos, et al. (författare)
  • Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models
  • 2013
  • Ingår i: PARMA 2013, 4th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures.
  • Konferensbidrag (refereegranskat)abstract
    • This work demonstrates the potential of hardware and software optimization to improve theeffectiveness of dynamic voltage and frequency scaling (DVFS). For software, we decouple data prefetch (access) and computation (execute) to enable optimal DVFS selectionfor each phase. For hardware, we use measurements from state-of-the-art multicore processors to accurately model the potential of per-core, zero-latency DVFS. We demonstrate that the combinationof decoupled access-execute and precise DVFS has the potential to decrease EDP by 25-30% without reducing performance.The underlying insight in this work is that by decoupling access and execute we can take advantageof the memory-bound nature of the access phase and the compute-bound nature of the execute phase to optimize power efficiency. For the memory-bound access phase, where we prefetch data into the cachefrom main memory, we can run at a reduced frequency and voltage without hurting performance. Thereafter, the execute phase can run much faster, thanks to the prefetching of the access phase, and achieve higher performance. This decoupled program behavior allows us to achieve more effective use of DVFS than standard coupled executions which mix data access and compute.To understand the potential of this approach, we measure application performance and power consumption on a modern multicore system across a range of frequencies and voltages. From this data we build a model that allows us to analyze the effects of per-core, zero-latency DVFS. The results of this work demonstrate the significant potential for finer-grain DVFS in combination with DVFS-optimized software.
  •  
8.
  • Tran, Kim-Anh, et al. (författare)
  • Clairvoyance : Look-ahead compile-time scheduling
  • 2017
  • Ingår i: Proc. 15th International Symposium on Code Generation and Optimization. - Piscataway, NJ : IEEE Press. - 9781509049318 ; , s. 171-184
  • Konferensbidrag (refereegranskat)
  •  
9.
  • Tran, Kim-Anh, et al. (författare)
  • Static Instruction Scheduling for High Performance on Limited Hardware
  • 2018
  • Ingår i: IEEE Transactions on Computers. - : IEEE Computer Society. - 0018-9340 .- 1557-9956. ; 67:4, s. 513-527
  • Tidskriftsartikel (refereegranskat)abstract
    • Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14 percent for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications' high-performance.
  •  
10.
  • Tran, Kim-Anh, et al. (författare)
  • SWOOP : software-hardware co-design for non-speculative, execute-ahead, in-order cores
  • 2018
  • Ingår i: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. - New York, NY, USA : Association for Computing Machinery (ACM). - 9781450356985 ; , s. 328-343
  • Konferensbidrag (refereegranskat)abstract
    • Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency.In this paper, we address one of the main performance bottlenecks — last-level cache misses — through a software-hardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms.We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-flow and indirect memory accesses.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 12

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy