SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(McKee Sally A 1963) "

Sökning: WFRF:(McKee Sally A 1963)

  • Resultat 1-10 av 55
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Fang, Z., et al. (författare)
  • Active memory controller
  • 2012
  • Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 1573-0484 .- 0920-8542. ; 62:1, s. 510-549
  • Tidskriftsartikel (refereegranskat)abstract
    • Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs' performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.
  •  
2.
  • Brown, Martin K., et al. (författare)
  • Agave: a Benchmark Suite for Exploring the Complexities of the Android Software Stack
  • 2016
  • Ingår i: ISPASS 2016 - International Symposium on Performance Analysis of Systems and Software. - 9781509019526 ; 31 May 2016, s. 157-158
  • Konferensbidrag (refereegranskat)abstract
    • Traditional suites used for benchmarking high-performance computing platforms or for architectural design space exploration use much simpler virtual memory layouts and multitasking/ multithreading schemes, which means that they cannot be used to study the complex interactions among the layers of the Android software stack. To demonstrate this, we present memory reference and concurrency data showing how Android applications differ from traditional C benchmarks. We propose the Agave suite of open-source applications as the basis for a standard, multipurpose Android benchmark suite. We make all sources and tools available in hopes that the community will adopt and build on this initial version of Agave.
  •  
3.
  • Larsen, P, et al. (författare)
  • Parallelizing more loops with compiler guided refactoring
  • 2012
  • Ingår i: Proceedings of the International Conference on Parallel Processing. 41st International Conference on Parallel Processing, ICPP 2012, Pittsburgh, PA, 10 - 13 September 2012. - 0190-3918. - 9780769547961 ; , s. 410-419
  • Konferensbidrag (refereegranskat)abstract
    • The performance of many parallel applications relies not on instruction-level parallelism but on loop-level parallelism. Unfortunately, automatic parallelization of loops is a fragile process, many different obstacles affect or prevent it in practice. To address this predicament we developed an interactive compilation feedback system that guides programmers in iteratively modifying their application source code. This helps leverage the compiler's ability to generate loop-parallel code. We employ our system to modify two sequential benchmarks dealing with image processing and edge detection, resulting in scalable parallelized code that runs up to 8.3 times faster on an eight-core Intel Xeon 5570 system and up to 12.5 times faster on a quad-core IBM POWER6 system. Benchmark performance varies significantly between the systems. This suggests that semi-automatic parallelization should be combined with target-specific optimizations. Furthermore, comparing the first benchmark to manually-parallelized, hand-optimized pthreads and OpenMP versions, we find that code generated using our approach typically outperforms the pthreads code (within 93-339%). It also performs competitively against the OpenMP code (within 75-111%). The second benchmark outperforms manually-parallelized and optimized OpenMP code (within 109-242%).
  •  
4.
  • Puzovic, N., et al. (författare)
  • A multi-pronged approach to benchmark characterization
  • 2010
  • Ingår i: 2010 IEEE International Conference on Cluster Computing Workshops and Posters, Cluster Workshops 2010. - 9781424483969
  • Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract
    • Understanding the behavior of current and future workloads is key for designers of future computer systems. If target workload characteristics are available, computer designers can use this information to optimize the system. This can lead to a chicken-and-egg problem: how does one characterize application behavior for an architecture that is a moving target and for which sophisticated modeling tools do not yet exist? We present a multi-pronged approach to benchmark characterization early in the design cycle. We collect statistics from multiple sources and combine them to create a comprehensive view of application behavior. We assume a fixed part of the system (service core) and a "to-be-designed" part that will gradually be developed under the measurements taken on the fixed part. Data are collected from measurements taken on existing hardware and statistics are obtained via emulation tools. These are supplemented with statistics extracted from traces and ILP information generated by the compiler. Although the motivation for this work is the classification of workloads for an embedded, reconfigurable, parallel architecture, the methodology can easily be adapted to other platforms. © 2010 IEEE.
  •  
5.
  • Bardizbanyan, Alen, 1986, et al. (författare)
  • Improving Data Access Efficiency by Using a Tagless Access Buffer (TAB)
  • 2013
  • Ingår i: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2013. - 9781467355254 ; , s. 269-279
  • Konferensbidrag (refereegranskat)abstract
    • The need for energy efficiency continues to grow for many classes of processors, including those for which performance remains vital. Data cache is crucial for good performance, but it also represents a significant portion of the processor's energy expenditure. We describe the implementation and use of a tagless access buffer (TAB) that greatly improves data access energy efficiency while slightly improving performance. The compiler recognizes memory reference patterns within loops and allocates these references to a TAB. This combined hardware/software approach reduces energy usage by (1) replacing many level-one data cache (L1D) accesses with accesses to the smaller, more power-efficient TAB; (2) removing the need to perform tag checks or data translation lookaside buffer (DTLB) lookups for TAB accesses; and (3) reducing DTLB lookups when transferring data between the L1D and the TAB. Accesses to the TAB occur earlier in the pipeline, and data lines are prefetched from lower memory levels, which result in asmall performance improvement. In addition, we can avoid many unnecessary block transfers between other memory hierarchy levels by characterizing how data in the TAB are used. With a combined size equal to that of a conventional 32-entry register file, a four-entry TAB eliminates 40% of L1D accesses and 42% of DTLB accesses, on average. This configuration reduces data-access related energy by 35% while simultaneously decreasing execution time by 3%.
  •  
6.
  • Bhadauria, Major, et al. (författare)
  • Accomodating diversity in CMPs with heterogeneous frequencies
  • 2009
  • Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Berlin, Heidelberg : Springer Berlin Heidelberg. - 1611-3349 .- 0302-9743. - 9783540929895 ; 5409 LNCS, s. 248-262
  • Konferensbidrag (refereegranskat)abstract
    • Shrinking process technologies and growing chip sizes have profound effects on process variation. This leads to Chip Multiprocessors (CMPs) where not all cores operate at maximum frequency. Instead of simply disabling the slower cores or using guard banding (running all at the frequency of the slowest logic block), we investigate keeping them active, and examine performance and power efficiency of using frequency-heterogeneous CMPs on multithreaded workloads. With uniform workload partitioning, one might intuitively expect slower cores to degrade performance. However, with non-uniform workload partitioning, we find that using both low and high frequency cores improves performance and reduces energy consumption over just running faster cores. Thread scheduling and workload partitioning naturally play significant roles in these improvements. We find that using under-performing cores improves performance by 16% on average and saves CPU energy by up to 16% across the NAS and SPEC-OMP benchmarks on a quad-core AMD platform. Workload balancing via dynamic partitioning yields results within 5% of the overall ideal value. Finally, we show feasible methods to determine at run time whether using a heterogeneous configuration is beneficial. We validate our work through evaluation on a real CMP.
  •  
7.
  • Bhadauria, Major, et al. (författare)
  • An approach to resource-aware co-scheduling for CMPs
  • 2010
  • Ingår i: Proceedings of the International Conference on Supercomputing. - New York, NY, USA : ACM. - 9781450300186 ; , s. 189-199
  • Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract
    • We develop real-time scheduling techniques for improving performance and energy for multiprogrammed workloads that scale non-uniformly with increasing thread counts. Multithreaded programs generally deliver higher throughput than single-threaded programs on chip multiprocessors, but performance gains from increasing threads decrease when there is contention for shared resources. We use analytic metrics to derive local search heuristics for creating efficient multiprogrammed, multithreaded workload schedules. Programs are allocated fewer cores than requested, and scheduled to space-share the CMP to improve global throughput. Our holistic approach attempts to co-schedule programs that complement each other with respect to shared resource consumption. We find application co-scheduling for performance and energy in a resource-aware manner achieves better results than solely targeting total throughput or concurrently co-scheduling all programs. Our schedulers improve overall energy delay (E*D) by a factor of 1.5 over time-multiplexed gang scheduling. © 2010 ACM.
  •  
8.
  • Bhadauria, Major, et al. (författare)
  • Understanding parsec performance on contemporary CMPS
  • 2009
  • Ingår i: Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009. - 9781424451562 ; , s. 98-107
  • Konferensbidrag (refereegranskat)abstract
    • PARSEC is a reference application suite used in industry and academia to assess new Chip Multiprocessor (CMP) designs. No investigation to date has profiled PARSEC on real hardware to better understand scaling properties and bottlenecks. This understanding is crucial in guiding future CMP designs for these kinds of emerging workloads. We use hardware performance counters, taking a systems-level approach and varying common architectural parameters: number of out-of-order cores, memory hierarchy configu- rations, number of multiple simultaneous threads, number of memory channels, and processor frequencies. We find these programs to be largely compute-bound, and thus lim- ited by number of cores, micro-architectural resources, and cache-to-cache transfers, rather than by off-chip memory or system bus bandwidth. Half the suite fails to scale lin- early with increasing number of threads, and some applica- tions saturate performance at few threads on all platforms tested. Exploiting thread level parallelism delivers greater payoffs than exploiting instruction level parallelism. To re- duce power and improve performance, we recommend in- creasing the number of arithmetic units per core, increasing support for TLP, and reducing support for ILP.
  •  
9.
  • Bronevetsky, Greg, et al. (författare)
  • Compiler-enhanced incremental checkpointing for openMP applications
  • 2009
  • Ingår i: 23rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2009; Rome; Italy; 23 May 2009 through 29 May 2009. - 9781424437504
  • Konferensbidrag (refereegranskat)abstract
    • As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enablingapplications to periodically save their state and restart computation after a failure. Although a many automated system-level checkpointing solutions are currently availableto HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves performance of automated checkpointing via a compiler analysis for incremental checkpointing.This analysis, which works with both sequential and OpenMP applications, reduces checkpoint sizes by as much as 80% and enables asynchronous checkpointing.
  •  
10.
  • Chang, Y. S., et al. (författare)
  • Extending on-chip interconnects for rack-level remote resource access
  • 2016
  • Ingår i: Proceedings of the 34th IEEE International Conference on Computer Design, ICCD 2016. - 1063-6404. - 9781509051427 ; , s. 56-63
  • Konferensbidrag (refereegranskat)abstract
    • The need to perform data analytics on exploding data volumes coupled with the rapidly changing workloads in cloud computing places great pressure on data-center servers. To improve hardware resource utilization across servers within a rack, we propose Direct Extension of On-chip Interconnects (DEOI), a high-performance and efficient architecture for remote resource access among server nodes. DEOI extends an SoC server node's on-chip interconnect to access resources in adjacent nodes with no protocol changes, allowing remote memory and network resources to be used as if they were local. Our results on a four-node FPGA prototype show that the latency of user-level, cross-node, random reads to DEOI-connected remote memory is as low as 1.16?s, which beats current commercial technologies. We exploit DEOI remote access to improve performance of the Redis in-memory key-value framework by 47%. When using DEOI to access remote network resources, we observe an 8.4% average performance degradation and only a 2.52?s ping-pong latency disparity compared to using local assets. These results suggest that DEOI can be a promising mechanism for increasing both performance and efficiency in next-generation data-center servers.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 55

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy