SwePub
Sök i SwePub databas

  Extended search

Träfflista för sökning "L773:1530 0897 "

Search: L773:1530 0897

  • Result 1-7 of 7
Sort/group result
   
EnumerationReferenceCoverFind
1.
  • Angerd, Alexandra, 1988, et al. (author)
  • GBDI: Going Beyond Base-Delta-Immediate Compression with Global Bases
  • 2022
  • In: Proceedings - International Symposium on High-Performance Computer Architecture. - 1530-0897. - 9781665420273 ; 2022-April, s. 1115-1127
  • Conference paper (peer-reviewed)abstract
    • Memory bandwidth is limiting performance for many emerging applications. While compression techniques can unlock a higher memory bandwidth, prior art offers only modestly better bandwidth. This paper contributes with a new compression method - Global Base Delta Immediate compression (GBDI) - that offers substantially higher memory bandwidth by, unlike prior art, selecting base values across memory blocks. GBDI uses a novel clustering algorithm through data analysis in the background. The presented accelerator infrastructure offers low area overhead and latency. This paper shows that GBDI offers a compression ratio of 2.3×, and yields 1.5× higher bandwidth and 1.1× higher performance compared with a baseline without compression support, on average, for SPEC2017 benchmarks requiring medium to high memory bandwidth.
  •  
2.
  • Dong, Jianbo, et al. (author)
  • Venice: Exploring Server Architectures for Effective Resource Sharing
  • 2016
  • In: Proceedings - International Symposium on High-Performance Computer Architecture. - 1530-0897. - 9781467392112 ; 2016-April, s. 507-518
  • Conference paper (peer-reviewed)abstract
    • Consolidated server racks are quickly becoming the backbone of IT infrastructure for science, engineering, and business, alike. These servers are still largely built and organized as when they were distributed, individual entities. Given that many fields increasingly rely on analytics of huge datasets, it makes sense to support flexible resource utilization across servers to improve cost-effectiveness and performance. We introduce Venice, a family of data-center server architectures that builds a strong communication substrate as a first-class resource for server chips. Venice provides a diverse set of resource-joining mechanisms that enables user programs to efficiently leverage non-local resources.To better understand the implications of design decisionsabout system support for resource sharing we have constructed a hardware prototype that allows us to more accurately measure end-to-end performance of at-scale applications and to explore tradeoffs among performance, power, and resource-sharing transparency. We present results from our initial studies analyzing these tradeoffs when sharing memory, accelerators, or NICs. We find that it is particularly important to reduce or hide latency, that data-sharing access patterns should match the features of the communication channels employed, and that inter-channel collaboration can be exploited for better performance.
  •  
3.
  • Ejaz, Ahsen, 1986, et al. (author)
  • FastTrackNoC: A NoC with FastTrack Router Datapaths
  • 2022
  • In: Proceedings - International Symposium on High-Performance Computer Architecture. - 1530-0897. ; 2022-April, s. 971-985
  • Conference paper (peer-reviewed)abstract
    • This paper introduces FastTrackNoC, a Network-on-Chip (NoC) router architecture that reduces packet latency by bypassing its switch traversal (ST) stage. It is based on the observation that there is a bias in the direction a flit takes through a router, e.g., in a 2D mesh network, non-turning hops are preferred, especially when dimension order routing is used. FastTrackNoC capitalizes on this observation and adds to a 2D mesh router a fast-track path between the head of a single input virtual channel (VC) buffer and its most popular, opposite output. This allows non-turning flits to bypass ST logic, i.e., buffer-, input-and output multiplexing, when the required router resources are available. FastTrackNoC combines ST bypassing with existing techniques for reducing latency, namely, allocation bypassing, precomputed routing, and lookahead control signaling to allow at best incoming flits to proceed directly to link traversal (LT). Moreover, it is applied to a Dual Data Rate (DDR) router in order to maximize network throughput. Post place and route results in 28nm show the following: compared to previous DDR NoCs, FastTrackNoC offers 13-32% lower average packet latency; compared to previous multi-VC Single Data Rate (SDR) NoCs, FastTrackNoC reduces latency by 10-40% and achieves 18-21% higher throughput, and compared to single-channel SDR NoC offers up to 50% higher throughput and similar latency.
  •  
4.
  • Knyaginin, Dmitry, 1983, et al. (author)
  • ProFess: A Probabilistic Hybrid Main Memory Management Framework for High Performance and Fairness
  • 2018
  • In: Proceedings - International Symposium on High-Performance Computer Architecture. - 1530-0897. ; 2018-February, s. 143-155
  • Conference paper (peer-reviewed)abstract
    • Non-Volatile Memory (NVM) technologies enable cost-effective hybrid main memories with two partitions: M1 (DRAM) and slower but larger M2 (NVM). This paper considers a flat migrating organization of hybrid memories. A challenging and open issue of managing such memories is to allocate M1 among co-running programs such that high fairness is achieved at the same time as high performance. This paper introduces ProFess: a Probabilistic hybrid main memory management Framework for high performance and fairness. It comprises: i) a Relative-Slowdown Monitor (RSM) that enables fair management by indicating which program suffers the most from competition for M1; and ii) a probabilistic Migration-Decision Mechanism (MDM) that unlocks high performance by realizing cost-benefit analysis that is individual for each pair of data blocks considered for migration. Within ProFess, RSM guides MDM towards high fairness. We show that for the multiprogrammed workloads evaluated, ProFess improves fairness by 15% (avg.; up to 29%), compared to the state-of-the-art, while outperforming it by 12% (avg.; up to 29%).
  •  
5.
  • Manivannan, Madhavan, 1986, et al. (author)
  • RADAR: Runtime-assisted dead region management for last-level caches
  • 2016
  • In: Proceedings - International Symposium on High-Performance Computer Architecture. - 1530-0897. - 9781467392112 ; 2016-April, s. 644-656
  • Conference paper (peer-reviewed)abstract
    • Last-level caches (LLCs) bridge the processor/memory speed gap and reduce energy consumed per access. Unfortunately, LLCs are poorly utilized because of the relatively large occurrence of dead blocks. We propose RADAR, a hybrid static/dynamic dead-block management technique that can accurately predict and evict dead blocks in LLCs. RADAR does dead-block prediction and eviction at the granularity of address regions supported in many of today's task-parallel programming models. The runtime system utilizes static control-flow information about future region accesses in conjunction with past region access patterns to make accurate predictions about dead regions. The runtime system instructs the cache to demote and eventually evict blocks belonging to such dead regions. This paper considers three RADAR schemes to predict dead regions: a scheme that uses control-flow information provided by the programming model (Look-ahead), a history-based scheme (Look-back) and a combined scheme (Look-ahead and Look-back). Our evaluation shows that, on average, all RADAR schemes outperform state-of-the-art hardware dead-block prediction techniques, whereas the combined scheme always performs best.
  •  
6.
  • Negi, Anurag, 1980, et al. (author)
  • Pi-TM: Pessimistic Invalidation for Scalable Lazy Hardware Transactional Memory
  • 2012
  • In: Proceedings - International Symposium on High-Performance Computer Architecture. - 1530-0897. - 9781467308243 ; , s. 141-151
  • Conference paper (peer-reviewed)abstract
    • Lazy hardware transactional memory has been shown to be more efficient at extracting available concurrency than its eager counterpart. However, it poses scalability challenges at commit time as existence of conflicts among concurrent transactions is not known prior to commit. Non-conflicting transactions may have to wait before committing, severely affecting performance in certain workloads. Early conflict detection can be employed to allow such transactions to commit simultaneously. In this paper we show that the potential of this technique has not yet been fully utilized, with design choices in prior work severely burdening common-case transactional execution to avoid some relatively uncommon correctness concerns. The paper quantifies the severity of the problem and develops. pi-TM, an early conflict detection - lazy conflict resolution design. This design highlights how, with modest extensions to existing directory-based coherence protocols, information regarding possible conflicts can be effectively used to achieve true parallelism at commit without burdening the common-case. We leverage the observation that contention is typically seen on only a small fraction of shared data accessed by coarse-grained transactions. Pessimistic invalidation of such lines when committing or aborting, therefore, enables fast common-case execution. Our results show that. pi-TM performs consistently well and, in particular, far better than previous work on early conflict detection in lazy HTM. We also identify a pathological scenario that lazy designs with early conflict detection suffer from and propose a simple hardware workaround to sidestep it.
  •  
7.
  • Vasilakis, Evangelos, 1985, et al. (author)
  • Hybrid2: Combining Caching and Migration in Hybrid Memory Systems
  • 2020
  • In: Proceedings - International Symposium on High-Performance Computer Architecture. - 1530-0897. ; , s. 649-662
  • Conference paper (peer-reviewed)abstract
    • This paper considers a hybrid memory system composed of memory technologies with different characteristics; in particular a small, near memory exhibiting high bandwidth, i.e., 3D-stacked DRAM, and a larger, far memory offering capacity at lower bandwidth, i.e., off-chip DRAM. In the past, the near memory of such a system has been used either as a DRAM cache or as part of a flat address space combined with a migration mechanism. Caches and migration offer different tradeoffs (between performance, main memory capacity, data transfer costs, etc.) and share similar challenges related to data-transfer granularity and metadata management. This paper proposes Hybrid2 , a new hybrid memory system architecture that combines a DRAM cache with a migration scheme. Hybrid 2 does not deny valuable capacity from the memory system because it uses only a small fraction of the near memory as a DRAM cache; 64MB in our experiments. It further leverages the DRAM cache as a staging area to select the data most suitable for migration. Finally, Hybrid2 alleviates the metadata overheads of both DRAM caches and migration using a common mechanism. Using near to far memory ratios of 1:16, 1:8 and 1:4 in our experiments, Hybrid2 on average outperforms current state-of-the-art migration schemes by 7.9%, 9.1% and 6.4%, respectively. In the same system configurations, compared to DRAM caches Hybrid2 gives away on average only 0.3%, 1.2%, and 5.3% of performance offering 5.9%, 12.1%, and 24.6% more main memory capacity, respectively.
  •  
Skapa referenser, mejla, bekava och länka
  • Result 1-7 of 7

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Close

Copy and save the link in order to return to this view