SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(McKee Sally A 1963) srt2:(2010-2014)"

Sökning: WFRF:(McKee Sally A 1963) > (2010-2014)

  • Resultat 1-10 av 31
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Fang, Z., et al. (författare)
  • Active memory controller
  • 2012
  • Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 1573-0484 .- 0920-8542. ; 62:1, s. 510-549
  • Tidskriftsartikel (refereegranskat)abstract
    • Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs' performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.
  •  
2.
  • Larsen, P, et al. (författare)
  • Parallelizing more loops with compiler guided refactoring
  • 2012
  • Ingår i: Proceedings of the International Conference on Parallel Processing. 41st International Conference on Parallel Processing, ICPP 2012, Pittsburgh, PA, 10 - 13 September 2012. - 0190-3918. - 9780769547961 ; , s. 410-419
  • Konferensbidrag (refereegranskat)abstract
    • The performance of many parallel applications relies not on instruction-level parallelism but on loop-level parallelism. Unfortunately, automatic parallelization of loops is a fragile process, many different obstacles affect or prevent it in practice. To address this predicament we developed an interactive compilation feedback system that guides programmers in iteratively modifying their application source code. This helps leverage the compiler's ability to generate loop-parallel code. We employ our system to modify two sequential benchmarks dealing with image processing and edge detection, resulting in scalable parallelized code that runs up to 8.3 times faster on an eight-core Intel Xeon 5570 system and up to 12.5 times faster on a quad-core IBM POWER6 system. Benchmark performance varies significantly between the systems. This suggests that semi-automatic parallelization should be combined with target-specific optimizations. Furthermore, comparing the first benchmark to manually-parallelized, hand-optimized pthreads and OpenMP versions, we find that code generated using our approach typically outperforms the pthreads code (within 93-339%). It also performs competitively against the OpenMP code (within 75-111%). The second benchmark outperforms manually-parallelized and optimized OpenMP code (within 109-242%).
  •  
3.
  • Puzovic, N., et al. (författare)
  • A multi-pronged approach to benchmark characterization
  • 2010
  • Ingår i: 2010 IEEE International Conference on Cluster Computing Workshops and Posters, Cluster Workshops 2010. - 9781424483969
  • Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract
    • Understanding the behavior of current and future workloads is key for designers of future computer systems. If target workload characteristics are available, computer designers can use this information to optimize the system. This can lead to a chicken-and-egg problem: how does one characterize application behavior for an architecture that is a moving target and for which sophisticated modeling tools do not yet exist? We present a multi-pronged approach to benchmark characterization early in the design cycle. We collect statistics from multiple sources and combine them to create a comprehensive view of application behavior. We assume a fixed part of the system (service core) and a "to-be-designed" part that will gradually be developed under the measurements taken on the fixed part. Data are collected from measurements taken on existing hardware and statistics are obtained via emulation tools. These are supplemented with statistics extracted from traces and ILP information generated by the compiler. Although the motivation for this work is the classification of workloads for an embedded, reconfigurable, parallel architecture, the methodology can easily be adapted to other platforms. © 2010 IEEE.
  •  
4.
  • Bardizbanyan, Alen, 1986, et al. (författare)
  • Improving Data Access Efficiency by Using a Tagless Access Buffer (TAB)
  • 2013
  • Ingår i: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2013. - 9781467355254 ; , s. 269-279
  • Konferensbidrag (refereegranskat)abstract
    • The need for energy efficiency continues to grow for many classes of processors, including those for which performance remains vital. Data cache is crucial for good performance, but it also represents a significant portion of the processor's energy expenditure. We describe the implementation and use of a tagless access buffer (TAB) that greatly improves data access energy efficiency while slightly improving performance. The compiler recognizes memory reference patterns within loops and allocates these references to a TAB. This combined hardware/software approach reduces energy usage by (1) replacing many level-one data cache (L1D) accesses with accesses to the smaller, more power-efficient TAB; (2) removing the need to perform tag checks or data translation lookaside buffer (DTLB) lookups for TAB accesses; and (3) reducing DTLB lookups when transferring data between the L1D and the TAB. Accesses to the TAB occur earlier in the pipeline, and data lines are prefetched from lower memory levels, which result in asmall performance improvement. In addition, we can avoid many unnecessary block transfers between other memory hierarchy levels by characterizing how data in the TAB are used. With a combined size equal to that of a conventional 32-entry register file, a four-entry TAB eliminates 40% of L1D accesses and 42% of DTLB accesses, on average. This configuration reduces data-access related energy by 35% while simultaneously decreasing execution time by 3%.
  •  
5.
  • Bhadauria, Major, et al. (författare)
  • An approach to resource-aware co-scheduling for CMPs
  • 2010
  • Ingår i: Proceedings of the International Conference on Supercomputing. - New York, NY, USA : ACM. - 9781450300186 ; , s. 189-199
  • Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract
    • We develop real-time scheduling techniques for improving performance and energy for multiprogrammed workloads that scale non-uniformly with increasing thread counts. Multithreaded programs generally deliver higher throughput than single-threaded programs on chip multiprocessors, but performance gains from increasing threads decrease when there is contention for shared resources. We use analytic metrics to derive local search heuristics for creating efficient multiprogrammed, multithreaded workload schedules. Programs are allocated fewer cores than requested, and scheduled to space-share the CMP to improve global throughput. Our holistic approach attempts to co-schedule programs that complement each other with respect to shared resource consumption. We find application co-scheduling for performance and energy in a resource-aware manner achieves better results than solely targeting total throughput or concurrently co-scheduling all programs. Our schedulers improve overall energy delay (E*D) by a factor of 1.5 over time-multiplexed gang scheduling. © 2010 ACM.
  •  
6.
  • Cui, Z., et al. (författare)
  • DTail: A flexible approach to DRAM refresh management
  • 2014
  • Ingår i: 28th ACM International Conference on Supercomputing, ICS 2014; Munich; Germany; 10 June 2014 through 13 June 2014. - New York, NY, USA : ACM. - 9781450326421 ; , s. 43-52
  • Konferensbidrag (refereegranskat)abstract
    • DRAM cells must be refreshed (or rewritten) periodically to maintain data integrity, and as DRAM density grows, so does the refresh time and energy. Not all data need to be refreshed with the same frequency, though, and thus some refresh operations can safely be delayed. Tracking such information allows the memory controller to reduce refresh costs by judiciously choosing when to refresh different rows Solutions that store imprecise information miss opportunities to avoid unnecessary refresh operations, but the storage for tracking complete information scales with memory capacity. We therefore propose a flexible approach to refresh management that tracks complete refresh information within the DRAM itself, where it incurs negligible storage costs (0.006% of total capacity) and can be managed easily in hardware or software. Completely tracking multiple types of refresh information (e.g., row retention time and data validity) maximizes refresh reduction and lets us choose the most effective refresh schemes. Our evaluations show that our approach saves 25-82% of the total DRAM energy over prior refresh-reduction mechanisms.
  •  
7.
  • Dickov, B., et al. (författare)
  • Analyzing Performance Improvements and Energy Savings in Infiniband Architecture using Network Compression
  • 2014
  • Ingår i: Proceedings - Symposium on Computer Architecture and High Performance Computing. - 1550-6533. - 9781479969043 ; , s. 73-80
  • Konferensbidrag (refereegranskat)abstract
    • One of the greatest challenges in HPC is total system power and energy consumption. Whereas HPC interconnects have traditionally been designed with a focus on bandwidth and latency, there is an increasing interest in minimising the interconnect's energy consumption. This paper complements ongoing efforts related to power reduction and energy proportionality, by investigating the potential benefits from MPI data compression. We apply lossy compression to two common communication patterns in HPC kernels, in conjunction with recently introduced InfiniBand (IB) power saving modes. The results for the Alya CG kernel and Gromacs PME solver kernels show improvements in both performance and energy. While performance improvements are strongly influenced and changable depending on the type of communication pattern, energy savings in IB links are more consistent and proportional to achievable compression rates. We estimated an upper bound for link energy savings of up to 71% for the ALYA CG kernel, while for the Gromacs PME solver we obtained savings up to 63% compared to nominal energy when compression rate of 50% is used. We conclude that lossy compression is not always useful for performance improvements, but that it does reduce average IB link energy consumption
  •  
8.
  • Franklin, D., et al. (författare)
  • PC chairs' welcome
  • 2014
  • Ingår i: Proceedings of the 11th ACM Conference on Computing Frontiers, CF 2014.
  • Konferensbidrag (övrigt vetenskapligt/konstnärligt)
  •  
9.
  • Frolov, Nikita, 1986, et al. (författare)
  • A SAT-Based Compiler for FlexCore
  • 2011
  • Rapport (övrigt vetenskapligt/konstnärligt)abstract
    • Much like VLIW, statically scheduled architectures that expose all control signals to the compiler offer much potential for highly parallel, energy-efficient performance. Bau is a novel compilation infrastructure that leverages the LLVM compilation tools and the MiniSAT solver to generate efficient code for one such exposed architecture. We first build a compiler construction library that allows scheduling and resource constraints to be expressed declaratively in a domain-specific language, and then use this library to implement a compiler that generates programs that are 1.2–1.5 times more compact than either a baseline MIPS R2K compiler or a basic-block-based, sequentially phased scheduler.
  •  
10.
  • Frolov, Nikita, 1986, et al. (författare)
  • Declarative, SAT-solver-based Scheduling for an Embedded Architecture with a Flexible Datapath
  • 2011
  • Ingår i: Swedish System-on-Chip Conference.
  • Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract
    • Much like VLIW, statically scheduled architectures that expose all control signals to the compiler offer much potential for highly parallel, energy-efficient performance. Bau is a novel compilation infrastructure that leverages the LLVM compilation tools and the MiniSAT solver to generate efficient code for one such exposed architecture. We first build a compiler construction library that allows scheduling and resource constraints to be expressed declaratively in a domain specific language, and then use this library to implement a compiler that generates programs that are 1.2–1.5 times more compact than either a baseline MIPS R2K compiler or a basic-block-based, sequentially phased scheduler.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 31

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy