SwePub
Sök i SwePub databas

  Extended search

Träfflista för sökning "(db:Swepub) pers:(Chen Xiaowen) srt2:(2010)"

Search: (db:Swepub) pers:(Chen Xiaowen) > (2010)

  • Result 1-10 of 10
Sort/group result
   
EnumerationReferenceCoverFind
1.
  • Candaele, Bernard, et al. (author)
  • Mapping Optimisation for Scalable multi-core ARchiTecture : The MOSART approach
  • 2010
  • In: Proceedings - IEEE Annual Symposium on VLSI, ISVLSI 2010. - 9780769540764 ; , s. 518-523
  • Conference paper (peer-reviewed)abstract
    • The project will address two main challenges of prevailing architectures: 1) The global Interconnect and memory bottleneck due to a single, globally shared memory with high access times and power consumption; 2) The difficulties in programming heterogeneous, multi-core platforms, in particular in dynamically managing data structures in distributed memory. MOSART aims to overcome these through a multi-core architecture with distributed memory organisation, a Network-on-Chip (NoC) communication backbone and configurable processing cores that are scaled, optimised and customised together to achieve diverse energy, performance, cost and size requirements of different classes of applications. MOSART achieves this by: A) Providing platform support for management of abstract data structures Including middleware services and a run-time data manager for NoC based communication infrastructure; 2) Developing tool support for parallelizing and mapping applications on the multi-core target platform and customizing the processing cores for the application.
  •  
2.
  • Chen, Xiaowen, et al. (author)
  • Area and Performance Optimization of Barrier Synchronization on Multi-core Network-on-Chips
  • 2010
  • In: 3rd IEEE International Conference on Computer and Electrical Engineering (ICCEE).
  • Conference paper (peer-reviewed)abstract
    • Barrier synchronization is commonly and widelyused to synchronize the execution of parallel processor coreson multi-core Network-on-Chips (NoCs). Since its globalnature may cause heavy serialization resulting in largeperformance penalty, barrier synchronization should becarefully designed to have low latency communication and tominimize overall completion time. Therefore, in the paper, wepropose a fast barrier synchronization mechanism, targetingMulti-core NoCs. The fast barrier synchronization mechanismincludes a dedicated hardware module, named Fast BarrierSynchronizer (FBS), integrated with each processor node. Itoffers a set of barrier counters and can concurrently processsynchronization requests issued by the local node and remotenodes via the on-chip network. The salient feature of our fastbarrier synchronization mechanism is that, once the barriercondition is reached, the “barrier release” acknowledgement isrouted to all processor nodes in a broadcast way in order tosave chip area by avoiding storing source node informationand to minimize completion time by avoiding serialization ofbarrier releasing. Synthesis results suggest that the FBS canrun over 1 GHz in SMIC® 130nm technology with small areaoverhead. We implemented a FBS-enhanced multi-core NoCarchitecture on our FPGA platform using the Xilinx® Virtex 5as the FPGA chip. FPGA utilization and simulation resultsshow that our fast barrier synchronization demonstrates botharea and performance advantages over the barriersynchronization counterpart with unicast barrier releasing.
  •  
3.
  • Chen, Xiaowen, et al. (author)
  • Handling Shared Variable Synchronization in Multi-core Network-on-Chips with Distributed Memory
  • 2010
  • In: Proceedings. - 9781424466832 ; , s. 467-472
  • Conference paper (peer-reviewed)abstract
    • Parallelized shared variable applications running on multi-core Network-on-Chips(NoCs) require efficient support for synchronization, since communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. In this paper, we propose a dedicated hardware module forsynchronization management. This module is called Synchronization Handler (SH), integrated with each processor-memory node on the multi-core NoCs. It uses two physical buffers to concurrently process synchronization requests issued by the local processor and remote processors via the on-chip network. One salient feature is that the two physical buffers are dynamically allocated to form multiple virtual buffers (a virtual buffer is related to a shared synchronization variable) so as to improve the buffer utilization and alleviate the head-of-line blocking. Synthesis results suggest that the SH can run over 900 MHz in 130nm technology with small area overhead. To justify the SH-enhanced multicore NoCs, we employ synthetic workloads to evaluate synchronizationcost and buffer utilization, and run synchronization-intensive applications to investigate speedup. The results show that our approach is viable.
  •  
4.
  • Chen, Xiaowen, et al. (author)
  • Kinetics and mechanism of autohydrolysis of hardwoods
  • 2010
  • In: Bioresource Technology. - : Elsevier BV. - 0960-8524 .- 1873-2976. ; 101:20, s. 7812-7819
  • Journal article (peer-reviewed)abstract
    • Autohydrolysis using water is a promising method to extract hemicelluloses from wood prior to pulping in order to make co-products such as ethanol and acetic acid besides pulp. Many studies have been carried out on the kinetics and mechanism of autohydrolysis using batch reactors. The present study was performed in a continuous mixed flow reactor where the wood chips are retained in a basket inside the reactor. This reactor is well suited to determine intrinsic kinetics of hemicellulose dissolution because the dissolved products are rapidly removed from the reactor, thus minimizing further hydrolysis and degradation of the hemicelluloses in solution. The xylan removal rate follows an S-shaped behavior. GPC analysis of the continuously removed extract shows that the dissolved xylan oligomers have a DP smaller than about 25. Lignin-free xylan oligomers and cellulose oligomers are the major components dissolved in the initial stage of autohydrolysis, while xylan covalently bound to lignin (i.e. an LCC) is the major component removed during the later stage of autohydrolysis. The molecular weight of the dissolved components decreases with time in the second stage. The kinetics of xylan removal are explained in terms of a mechanism based on recent knowledge of the ultrastructure of the cell fibre wall.
  •  
5.
  • Chen, Xiaowen, et al. (author)
  • Multi-FPGA Implementation of a Network-on-Chip Based Many-core Architecture with Fast Barrier Synchronization Mechanism
  • 2010
  • In: Proceedings of the IEEE Norchip Conference. - 9781424489732
  • Conference paper (peer-reviewed)abstract
    • In this paper, we propose a fast barrier synchronization mechanism, targetingNetwork-on-Chip based manycore architectures. Its salient feature is that, once thebarrier condition is reached, the "barrier release" acknowledgement is routed to all processor nodes in a broadcast way in order to save area by avoiding storing source node information and to minimize completion time by eliminating serialization of barrierreleasing. Then, we construct a multi-FPGA platform using Xilinx® Virtex 5 as FPGA chipsand implement a NoC based many-core architecture on it. FPGA utilization and simulation results show that our mechanism demonstrates both area and performance advantages over the barrier synchronization counterpart with unicast barrier releasing. 
  •  
6.
  • Chen, Xiaowen, 1982-, et al. (author)
  • Run-time Partitioning of Hybrid Distributed Shared Memory on Multi-core Network-on-Chips
  • 2010
  • In: The 3rd IEEE International Symposium on Parallel Architectures, Algorithms and Programming (PAAP 2010). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 39-46
  • Conference paper (peer-reviewed)abstract
    • On multi-core Network-on-Chips (NoCs), mem- ories are preferably distributed and supporting Distributed Shared Memory (DSM) is essential for the sake of reusing huge amount of legacy code and easy programming. However, the DSM organization imports the inherent overhead of translating virtual memory addresses into physical memoryaddresses, resulting in negative performance. We observe that, in parallel applications, different data have different properties (private or shared). For the private data accesses, it's unnecessary to perform Virtual-to-Physical address translations. Even for the same datum, its property may be changeable in different phases of the program execution. Therefore, this paper focuses on decreasing the overhead of Virtual-to-Physical address translation and hence improving the system performance by introducing hybrid DSM organization and supporting run-time partitioning according to the data property. Thehybrid DSM organization aims at supporting fast and physical memory accesses for private data and maintaining a global and single virtual memory space for shared data. Based on the data property of parallel applications, the run-time partitioning supports changing the hybrid DSM organization during the program execution. It ensures fast physical memory addressing on private data and conventional virtual memory addressingon shared data, improving the performance of the entire system by reducing virtual-to-physical address translation overhead as much as possible. We formulate the run-timepartitioning of hybrid DSM organization in order to analyze its perfor- mance. A real DSM based multi-core NoC platform is also constructed. The experimental results of real applications show that the hybrid DSM organization with run-time partitioningdemonstrates performance advantage over the conventional DSM counterpart. The percentage of performance improve- ment depends on problem size, way of datapartitioning and computation/ communication ratio of parallel applications, network size of the system, etc. In our experiments, the maximal improvement is 34.42%, the minimal improvement 3.68%.
  •  
7.
  • Chen, Xiaowen, et al. (author)
  • Supporting Distributed Shared Memory on Multi-core Network-on-Chips Using a Dual Microcoded Controller
  • 2010
  • In: Proceedings of the conference for Design Automation and Test in Europe. ; , s. 39-44
  • Conference paper (peer-reviewed)abstract
    • Supporting Distributed Shared Memory (DSM) is essential for multi-coreNetwork-on-Chips for the sake of reusing huge amount of legacy code and easy programmability. We propose a microcoded controller as a hardware module in each node to connect the core, the local memory and the network. The controller is programmable where the DSM functions such as virtual-to-physical address translation,memory access and synchronization etc. are realized using microcode. To enable concurrent processing of memory requests from the local and remote cores, ourcontroller features two mini-processors, one dealing with requests from the local coreand the other from remote cores. Synthesis results suggest that the controller consumes 51k gates for the logic and can run up to 455 MHz in 130 nm technology. To evaluate its performance, we use synthetic and application workloads. Results show that, when the system size is scaled up, the delay overhead incurred by the controller may become less significant when compared with the network delay. In this way, the delay efficiency of our DSM solution is close to hardware solutions on average but still have all the flexibility of software solutions.
  •  
8.
  • Chen, Xiaowen, 1982-, et al. (author)
  • Supporting Efficient Synchronization in Multi-core NoCs Using Dynamic Buffer Allocation Technique
  • 2010
  • In: Proceedings of the IEEE Annual Symposium on VLSI. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 462-463
  • Conference paper (peer-reviewed)abstract
    • This paper explores a dynamic buffer allocation technique to guide a distributedsynchronization architecture to support efficient synchronization on multi-core Network-on-Chips (NoCs). The synchronization architecture features two physical buffers to be able to concurrently queue and handle synchronization requests issued by the local processor and remote processors via the on-chip network. Using the dynamic bufferallocation technique, the two physical buffers are dynamically allocated to form multiple virtual buffers in order to improve buffers' utilization. Experiments are carried on to evaluate buffers' utilization.
  •  
9.
  • Chen, Yancang, et al. (author)
  • A Trace-driven Hardware-level Simulator for Design and Verification of Network-on-Chips
  • 2010
  • In: 2011 INTERNATIONAL CONFERENCE ON COMPUTERS, COMMUNICATIONS, CONTROL AND AUTOMATION (CCCA 2011), VOL II. - : IEEE. ; , s. 32-35
  • Conference paper (peer-reviewed)abstract
    • Traditional communications of general-purpose multi-core processor and application-specific System-on-Chip face challenges in terms of scalability and complexity. Network-on-Chip (NoC) has been the most promising solution for the communications of multi-core and many-core chips. In this paper, we present a trace-driven hardware-level simulator (noted HS) based on SystemVerilog for the design and verification of NoCs. Different from the state-of-the-art NoC simulators, the HS owns three important characteristics in addition to the capability of creating simulation and synthesizable NoC descriptions: 1) hardware-level simulation can be done, which means more implementation details of hardware than flit-level simulation; 2) router debugging and verification can be done at RTL by inserting assertions and coverage; 3) trace-based application simulations can be done besides synthetic workloads. A 4 X 4 2D mesh NoC with output virtual-channel routers verifies the capability of our HS.
  •  
10.
  • Naeem, Abdul, et al. (author)
  • Scalability of Weak Consistency in NoC based Multicore Architectures
  • 2010
  • In: IEEE INT SYMP CIRC SYST PROC. - New York : IEEE. - 9781424453085 ; , s. 3497-3500
  • Conference paper (peer-reviewed)abstract
    • In Multicore Network-on-Chip, it is preferable to realize distributed but shared memory (DSM) in order to reuse the huge amount of legacy code and easy programming. Within DSM systems, memory consistency is a critical issue since it affects not only performance but also the correctness of programs. In this paper, we investigate the scalability of the weak consistency model, which may be implemented using a transaction counter. The experimental results compare synchronization latencies for various network sizes, topologies and lock positions in the network. Average synchronization latency rises exponentially for mesh and torus topologies as the network size grows. However, torus improves the synchronization latency in comparison to mesh. For mesh topology network average synchronization latency is also slightly affected by the lock position with respect to the network center.
  •  
Skapa referenser, mejla, bekava och länka
  • Result 1-10 of 10

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Close

Copy and save the link in order to return to this view