SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "(db:Swepub) pers:(Chen Xiaowen) srt2:(2019)"

Sökning: (db:Swepub) pers:(Chen Xiaowen) > (2019)

  • Resultat 1-2 av 2
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Chen, Xiaowen, 1982- (författare)
  • Efficient Memory Access and Synchronization in NoC-based Many-core Processors
  • 2019
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • In NoC-based many-core processors, memory subsystem and synchronization mechanism are always the two important design aspects, since mining parallelism and pursuing higher performance require not only optimized memory management but also efficient synchronization mechanism. Therefore, we are motivated to research on efficient memory access and synchronization in three topics, namely, efficient on-chip memory organization, fair shared memory access, and efficient many-core synchronization.One major way of optimizing the memory performance is constructing a suitable and efficient memory organization. A distributed memory organization is more suitable to NoC-based many-core processors, since it features good scalability. We envision that it is essential to support Distributed Shared Memory (DSM) because of the huge amount of legacy code and easy programming. Therefore, we first adopt the microcoded approach to address DSM issues, aiming for hardware performance but maintaining the flexibility of programs. Second, we further optimize the DSM performance by reducing the virtual-to-physical address translation overhead. In addition to the general-purpose memory organization such as DSM, there exists special-purpose memory organization to optimize the performance of application-specific memory access. We choose Fast Fourier Transform (FFT) as the target application, and propose a multi-bank data memory specialized for FFT computation.In 3D NoC-based many-core processors, because processor cores and memories reside in different locations (center, corner, edge, etc.) of different layers, memory accesses behave differently due to their different communication distances. As the network size increases, the communication distance difference of memory accesses becomes larger, resulting in unfair memory access performance among different processor cores. This unfair memory access phenomenon may lead to high latencies of some memory accesses, thus negatively affecting the overall system performance. Therefore, we are motivated to study on-chip memory and DRAM access fairness in 3D NoC-based many-core processors through narrowing the round-trip latency difference of memory accesses as well as reducing the maximum memory access latency.Barrier synchronization is used to synchronize the execution of parallel processor cores. Conventional barrier synchronization approaches such as master-slave, all-to-all, tree-based, and butterfly are algorithm oriented. As many processor cores are networked on a single chip, contended synchronization requests may cause large performance penalty. Motivated by this, different from the algorithm-based approaches, we choose another direction (i.e., exploiting efficient communication) to address the barrier synchronization problem. We propose cooperative communication as a means and combine it with the master-slave algorithm and the all-to-all algorithm to achieve efficient many-core barrier synchronization. Besides, a multi-FPGA implementation case study of fast many-core barrier synchronization is conducted.
  •  
2.
  • Wang, Z., et al. (författare)
  • Load-balanced link distribution in mesh-based many-core systems
  • 2019
  • Ingår i: 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019 10-12 Aug. 2019. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 1028-1034
  • Konferensbidrag (refereegranskat)abstract
    • Networks-on-Chip (NoC) is becoming the fundamental infrastructure of modern chip multiprocessors (CMPs). Along with the scaling up for a mesh-based network, the inequivalence of location for the links gradually causes unbalanced traffic load on each link. In a mesh network, the central regions are easy to become the hotspots, and the central links are heavily utilized than the peripheral links in the context of non-uniform cache architecture (NUCA). Different from the traditional uniform interconnection between network nodes, we propose the load-balanced link distribution scheme, which aims at assigning physical channels in accordance with the traffic load of each link. In this paper, we analyze the traffic load distribution for the mesh network with different scales and give the corresponding load-balanced link distributions. The simulation results indicate that the load-balanced scheme achieves not only lower physical channel costs but also better network and system performance than the traditional uniform scheme. The experiments with synthetic traffics show that the load-balanced scheme exhibits 57.33%/60.23%/47.56% lower network latency at saturation point on average compared with the uniform scheme for 8x8/10x10/12x12 mesh networks respectively. By contrast, the load-balanced link distribution scheme uses less physical channels, and the reductions in physical channel cost are 7.14%/5.56%/15.15% for 8x8/10x10/12x12 mesh networks respectively. The experiments with PARSEC benchmarks reveal that a 2.1% improvement of system throughput can be achieved by the load-balanced scheme.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-2 av 2

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy