↓ Direkt till sidans innehåll
↓ Direkt till sidans sekundära innehåll (sidomenyn)

Träfflista för sökning "WFRF:(Chen Xiaowen) "

Sökning: WFRF:(Chen Xiaowen)

Resultat 1-35 av 35

Sortera/gruppera träfflistan

Sortering: Träffar per sida:

Numrering	Referens	Omslagsbild	Hitta
1.	Beal, Jacob, et al. (författare) Robust estimation of bacterial cell count from optical density 2020 Ingår i: Communications Biology. - : Springer Science and Business Media LLC. - 2399-3642. ; 3:1 Tidskriftsartikel (refereegranskat)abstract Optical density (OD) is widely used to estimate the density of cells in liquid culture, but cannot be compared between instruments without a standardized calibration protocol and is challenging to relate to actual cell count. We address this with an interlaboratory study comparing three simple, low-cost, and highly accessible OD calibration protocols across 244 laboratories, applied to eight strains of constitutive GFP-expressing E. coli. Based on our results, we recommend calibrating OD to estimated cell count using serial dilution of silica microspheres, which produces highly precise calibration (95.5% of residuals <1.2-fold), is easily assessed for quality control, also assesses instrument effective linear range, and can be combined with fluorescence calibration to obtain units of Molecules of Equivalent Fluorescein (MEFL) per cell, allowing direct comparison and data fusion with flow cytometry measurements: in our study, fluorescence per cell measurements showed only a 1.07-fold mean difference between plate reader and flow cytometry data.
2.	Chen, Jialin, et al. (författare) Characterization and comparison of post-natal rat Achilles tendon-derived stem cells at different development stages 2016 Ingår i: Scientific Reports. - : Springer Science and Business Media LLC. - 2045-2322. ; 6 Tidskriftsartikel (refereegranskat)abstract Tendon stem/progenitor cells (TSPCs) are a potential cell source for tendon tissue engineering. The striking morphological and structural changes of tendon tissue during development indicate the complexity of TSPCs at different stages. This study aims to characterize and compare post-natal rat Achilles tendon tissue and TSPCs at different stages of development. The tendon tissue showed distinct differences during development: the tissue structure became denser and more regular, the nuclei became spindle-shaped and the cell number decreased with time. TSPCs derived from 7 day Achilles tendon tissue showed the highest self-renewal ability, cell proliferation, and differentiation potential towards mesenchymal lineage, compared to TSPCs derived from 1 day and 56 day tissue. Microarray data showed up-regulation of several groups of genes in TSPCs derived from 7 day Achilles tendon tissue, which may account for the unique cell characteristics during this specific stage of development. Our results indicate that TSPCs derived from 7 day Achilles tendon tissue is a superior cell source as compared to TSPCs derived from 1 day and 56 day tissue, demonstrating the importance of choosing a suitable stem cell source for effective tendon tissue engineering and regeneration.
3.	Chen, Xiaowen, et al. (författare) Performance analysis of homogeneous on-chip large-scale parallel computing architectures for data-parallel applications 2015 Ingår i: Journal of Electrical and Computer Engineering. - : Hindawi Limited. - 2090-0147 .- 2090-0155. ; 2015 Tidskriftsartikel (refereegranskat)abstract On-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to as On-chip Large-scale Parallel Computing Architectures (OLPCs) in the paper. Homogenous OLPCs feature strong regularity and scalability due to its identical cores and routers. Data-parallel applications have their parallel data subsets that are handled individually by the same program running in different cores. Therefore, data-parallel applications are able to obtain good speedup in homogenous OLPCs. The paper addresses modeling the speedup performance of homogeneous OLPCs for data-parallel applications. When establishing the speedup performance model, the network communication latency and the ways of storing data of data-parallel applications are modeled and analyzed in detail. Two abstract concepts (equivalent serial packet and equivalent serial communication) are proposed to construct the network communication latency model. The uniform and hotspot traffic models are adopted to reflect the ways of storing data. Some useful suggestions are presented during the performance model's analysis. Finally, three data-parallel applications are performed on our cycle-accurate homogenous OLPC experimental platform to validate the analytic results and demonstrate that our study provides a feasible way to estimate and evaluate the performance of data-parallel applications onto homogenous OLPCs.
4.	Chen, Xiaowen, et al. (författare) Reducing Virtual-to-Physical address translation overhead in Distributed Shared Memory based multi-core Network-on-Chips according to data property 2013 Ingår i: Computers & electrical engineering. - : Elsevier BV. - 0045-7906 .- 1879-0755. ; 39:2, s. 596-612 Tidskriftsartikel (refereegranskat)abstract In Network-on-Chip (NoC) based multi-core platforms, Distributed Shared Memory (DSM) preferably uses virtual addressing in order to hide the physical locations of the memories. However, this incurs performance penalty due to the Virtual-to-Physical (V2P) address translation overhead for all memory accesses. Based on the data property which can be either private or shared, this paper proposes a hybrid DSM which partitions a local memory into a private and a shared part. The private part is accessed directly using physical addressing and the shared part using virtual addressing. In particular, the partitioning boundary can be configured statically at design time and dynamically at runtime. The dynamic configuration further removes the V2P address translation overhead for those data with changeable property when they become private at runtime. In the experiments with three applications (matrix multiplication, 2D FFT, and H.264/AVC encoding), compared with the conventional DSM, our techniques show performance improvement up to 37.89%.
5.	Chen, Xiaowen, et al. (författare) A Variable-Size FFT Hardware Accelerator Based on Matrix Transposition 2018 Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : Institute of Electrical and Electronics Engineers (IEEE). - 1063-8210 .- 1557-9999. ; 26:10, s. 1953-1966 Tidskriftsartikel (refereegranskat)abstract Fast Fourier transform (FFT) is the kernel and the most time-consuming algorithm in the domain of digital signal processing, and the FFT sizes of different applications are very different. Therefore, this paper proposes a variable-size FFT hardware accelerator, which fully supports the IEEE-754 single-precision floating-point standard and the FFT calculation with a wide size range from 2 to 220 points. First, a parallel Cooley-Tukey FFT algorithm based on matrix transposition (MT) is proposed, which can efficiently divide a large size FFT into several small size FFTs that can be executed in parallel. Second, guided by this algorithm, the FFT hardware accelerator is designed, and several FFT performance optimization techniques such as hybrid twiddle factor generation, multibank data memory, block MT, and token-based task scheduling are proposed. Third, its VLSI implementation is detailed, showing that it can work at 1 GHz with the area of 2.4 mm(2) and the power consumption of 91.3 mW at 25 degrees C, 0.9 V. Finally, several experiments are carried out to evaluate the proposal's performance in terms of FFT execution time, resource utilization, and power consumption. Comparative experiments show that our FFT hardware accelerator achieves at most 18.89x speedups in comparison to two software-only solutions and two hardware-dedicated solutions.
6.	Chen, Xiaowen, et al. (författare) Area and Performance Optimization of Barrier Synchronization on Multi-core Network-on-Chips 2010 Ingår i: 3rd IEEE International Conference on Computer and Electrical Engineering (ICCEE). Konferensbidrag (refereegranskat)abstract Barrier synchronization is commonly and widelyused to synchronize the execution of parallel processor coreson multi-core Network-on-Chips (NoCs). Since its globalnature may cause heavy serialization resulting in largeperformance penalty, barrier synchronization should becarefully designed to have low latency communication and tominimize overall completion time. Therefore, in the paper, wepropose a fast barrier synchronization mechanism, targetingMulti-core NoCs. The fast barrier synchronization mechanismincludes a dedicated hardware module, named Fast BarrierSynchronizer (FBS), integrated with each processor node. Itoffers a set of barrier counters and can concurrently processsynchronization requests issued by the local node and remotenodes via the on-chip network. The salient feature of our fastbarrier synchronization mechanism is that, once the barriercondition is reached, the “barrier release” acknowledgement isrouted to all processor nodes in a broadcast way in order tosave chip area by avoiding storing source node informationand to minimize completion time by avoiding serialization ofbarrier releasing. Synthesis results suggest that the FBS canrun over 1 GHz in SMIC® 130nm technology with small areaoverhead. We implemented a FBS-enhanced multi-core NoCarchitecture on our FPGA platform using the Xilinx® Virtex 5as the FPGA chip. FPGA utilization and simulation resultsshow that our fast barrier synchronization demonstrates botharea and performance advantages over the barriersynchronization counterpart with unicast barrier releasing.
7.	Chen, Xiaowen, et al. (författare) Cooperative communication based barrier synchronization in on-chip mesh architectures 2011 Ingår i: IEICE Electronics Express. - : Institute of Electronics, Information and Communications Engineers (IEICE). - 1349-2543. ; 8:22, s. 1856-1862 Tidskriftsartikel (refereegranskat)abstract We propose cooperative communication as a means to enable efficient and scalable barrier synchronization on mesh-based many-core architectures. Our approach is different from but orthogonal to conventional algorithm-based optimizations. It relies on collaborating routers to provide efficient gather and multicast communication. In conjunction with a master-slave algorithm, it exploits the mesh regularity to achieve efficiency. The gather and multicast functions have been implemented in our router. Synthesis results suggest marginal area overhead. With synthetic and benchmark experiments, we show that our approach significantly reduces synchronization completion time and increases speedup.
8.	Chen, Xiaowen, et al. (författare) Cooperative communication for efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs 2014 Ingår i: IEICE Electronics Express. - : Institute of Electronics, Information and Communications Engineers (IEICE). - 1349-2543. ; 11:18, s. 20140542- Tidskriftsartikel (refereegranskat)abstract On many-core Network-on-Chips (NoCs), communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. Different from conventional algorithm-based approaches, the paper addresses the barrier synchronization problem from the angle of optimizing its communication performance and proposes cooperative communication as a means to achieve efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs. With the cooperative communication, routers collaborate with one another to accomplish a fast barrier synchronization task. The cooperative communication is implemented in our router at low cost. Through comparative experiments, our approach evidently exhibits high efficiency and good scalability.
9.	Chen, Xiaowen, et al. (författare) Handling Shared Variable Synchronization in Multi-core Network-on-Chips with Distributed Memory 2010 Ingår i: Proceedings. - 9781424466832 ; , s. 467-472 Konferensbidrag (refereegranskat)abstract Parallelized shared variable applications running on multi-core Network-on-Chips(NoCs) require efficient support for synchronization, since communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. In this paper, we propose a dedicated hardware module forsynchronization management. This module is called Synchronization Handler (SH), integrated with each processor-memory node on the multi-core NoCs. It uses two physical buffers to concurrently process synchronization requests issued by the local processor and remote processors via the on-chip network. One salient feature is that the two physical buffers are dynamically allocated to form multiple virtual buffers (a virtual buffer is related to a shared synchronization variable) so as to improve the buffer utilization and alleviate the head-of-line blocking. Synthesis results suggest that the SH can run over 900 MHz in 130nm technology with small area overhead. To justify the SH-enhanced multicore NoCs, we employ synthetic workloads to evaluate synchronizationcost and buffer utilization, and run synchronization-intensive applications to investigate speedup. The results show that our approach is viable.
10.	Chen, Xiaowen, et al. (författare) Hybrid distributed shared memory space in multi-core processors 2011 Ingår i: Journal of Software. - : International Academy Publishing (IAP). - 1796-217X. ; 6:12 SPEC. ISSUE, s. 2369-2378 Tidskriftsartikel (refereegranskat)abstract On multi-core processors, memories are preferably distributed and supporting Distributed Shared Memory (DSM) is essential for the sake of reusing huge amount of legacy code and easy programming. However, the DSM organization imports the inherent overhead of translating virtual memory addresses into physical memory addresses, resulting in negative performance. We observe that, in parallel applications, different data have different properties (private or shared). For the private data accesses, it's unnecessary to perform Virtual-to-Physical address translations. Even for the same datum, its property may be changeable in different phases of the program execution. Therefore, this paper focuses on decreasing the overhead of Virtualto- Physical address translation and hence improving the system performance by introducing hybrid DSM organization and supporting run-time partitioning according to the data property. The hybrid DSM organization aims at supporting fast and physical memory accesses for private data and maintaining a global and single virtual memory space for shared data. Based on the data property of parallel applications, the run-time partitioning supports changing the hybrid DSM organization during the program execution. It ensures fast physical memory addressing on private data and conventional virtual memory addressing on shared data, improving the performance of the entire system by reducing virtual-to-physical address translation overhead as much as possible. We formulate the run-time partitioning of hybrid DSM organization in order to analyze its performance. A real DSM based multi-core platform is also constructed. The experimental results of real applications show that the hybrid DSM organization with run-time partitioning demonstrates performance advantage over the conventional DSM counterpart. The percentage of performance improvement depends on problem size, way of data partitioning and computation/communication ratio of parallel applications, network size of the system, etc. In our experiments, the maximal improvement is 34.42%, the minimal improvement 3.68%.
11.	Chen, Xiaowen, et al. (författare) Multi-bit Transient Fault Control for NoC Links Using 2D Fault Coding Method 2016 Ingår i: 2016 TENTH IEEE/ACM INTERNATIONAL SYMPOSIUM ON NETWORKS-ON-CHIP (NOCS). - : IEEE. - 9781467390309 Konferensbidrag (refereegranskat)abstract In deep nanometer scale, Network-on-Chip (NoC) links are more prone to multi-bit transient fault. Conventional ECC techniques brings heavy area, power, and timing overheads when correcting and detecting multiple transient faults. Therefore, a cost-effective ECC technique, named 2D fault coding method, is adopted to overcome the multi-bit transient fault issue of NoC links. Its key innovation is that the wires of a link are treated as its matrix appearance and light-weight Parity Check Coding (PCC) is performed on the matrix's two dimensions (horizontal matrix rows and vertical matrix columns). Horizontal PCCs and vertical PCCs work together to find the faults' position and then correct them by simply inverting them. The procedure of using the 2D fault coding method to protect a NoC link is proposed, its correction and detection capability is analyzed, and its hardware implementation is carried out. Comparative experiments show that the proposal can largely reduce the ECC hardware cost, have much higher fault detection coverage, maintain almost zero silent fault percentages, and have higher fault correction percentages normalized under the same area, demonstrating that it is cost-effective and suitable to the multi-bit transient fault control for NoC links.
12.	Chen, Xiaowen, et al. (författare) Multi-FPGA Implementation of a Network-on-Chip Based Many-core Architecture with Fast Barrier Synchronization Mechanism 2010 Ingår i: Proceedings of the IEEE Norchip Conference. - 9781424489732 Konferensbidrag (refereegranskat)abstract In this paper, we propose a fast barrier synchronization mechanism, targetingNetwork-on-Chip based manycore architectures. Its salient feature is that, once thebarrier condition is reached, the "barrier release" acknowledgement is routed to all processor nodes in a broadcast way in order to save area by avoiding storing source node information and to minimize completion time by eliminating serialization of barrierreleasing. Then, we construct a multi-FPGA platform using Xilinx® Virtex 5 as FPGA chipsand implement a NoC based many-core architecture on it. FPGA utilization and simulation results show that our mechanism demonstrates both area and performance advantages over the barrier synchronization counterpart with unicast barrier releasing.
13.	Chen, Xiaowen, et al. (författare) Round-trip DRAM access fairness in 3D NoC-based many-core systems 2017 Ingår i: ACM Transactions on Embedded Computing Systems. - : Association for Computing Machinery. - 1539-9087 .- 1558-3465. ; 16:5s Tidskriftsartikel (refereegranskat)abstract In 3D NoC-based many-core systems, DRAM accesses behave differently due to their different communication distances and the latency gap of different DRAM accesses becomes bigger as the network size increases, which leads to unfair DRAM access performance among different nodes. This phenomenon may lead to high latencies for some DRAM accesses that become the performance bottleneck of the system. The paper addresses the DRAM access fairness problem in 3D NoC-based many-core systems by narrowing the latency difference of DRAM accesses as well as reducing the maximum latency. Firstly, the latency of a round-trip DRAM access is modeled and the factors causing DRAM access latency difference are discussed in detail. Secondly, the DRAM access fairness is further quantitatively analyzed through experiments. Thirdly, we propose to predict the network latency of round-trip DRAM accesses and use the predicted round-trip DRAM access time as the basis to prioritize the DRAM accesses in DRAM interfaces so that the DRAM accesses with potential high latencies can be transferred as early and fast as possible, thus achieving fair DRAM access. Experiments with synthetic and application workloads validate that our approach can achieve fair DRAM access and outperform the traditional First-Come-First-Serve (FCFS) scheduling policy and the scheduling policies proposed by reference [7] and [24] in terms of maximum latency, Latency Standard Deviation (LSD)1 and speedup. In the experiments, the maximum improvement of the maximum latency, LSD, and speedup are 12.8%, 6.57%, and 8.3% respectively. Besides, our proposal brings very small extra hardware overhead (<0.6%) in comparison to the three counterparts.
14.	Chen, Xiaowen, 1982-, et al. (författare) Run-time Partitioning of Hybrid Distributed Shared Memory on Multi-core Network-on-Chips 2010 Ingår i: The 3rd IEEE International Symposium on Parallel Architectures, Algorithms and Programming (PAAP 2010). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 39-46 Konferensbidrag (refereegranskat)abstract On multi-core Network-on-Chips (NoCs), mem- ories are preferably distributed and supporting Distributed Shared Memory (DSM) is essential for the sake of reusing huge amount of legacy code and easy programming. However, the DSM organization imports the inherent overhead of translating virtual memory addresses into physical memoryaddresses, resulting in negative performance. We observe that, in parallel applications, different data have different properties (private or shared). For the private data accesses, it's unnecessary to perform Virtual-to-Physical address translations. Even for the same datum, its property may be changeable in different phases of the program execution. Therefore, this paper focuses on decreasing the overhead of Virtual-to-Physical address translation and hence improving the system performance by introducing hybrid DSM organization and supporting run-time partitioning according to the data property. Thehybrid DSM organization aims at supporting fast and physical memory accesses for private data and maintaining a global and single virtual memory space for shared data. Based on the data property of parallel applications, the run-time partitioning supports changing the hybrid DSM organization during the program execution. It ensures fast physical memory addressing on private data and conventional virtual memory addressingon shared data, improving the performance of the entire system by reducing virtual-to-physical address translation overhead as much as possible. We formulate the run-timepartitioning of hybrid DSM organization in order to analyze its perfor- mance. A real DSM based multi-core NoC platform is also constructed. The experimental results of real applications show that the hybrid DSM organization with run-time partitioningdemonstrates performance advantage over the conventional DSM counterpart. The percentage of performance improve- ment depends on problem size, way of datapartitioning and computation/ communication ratio of parallel applications, network size of the system, etc. In our experiments, the maximal improvement is 34.42%, the minimal improvement 3.68%.
15.	Chen, Xiaowen, et al. (författare) Speedup Analysis of Data-parallel Applications on Multi-core NoCs 2009 Ingår i: Proceedings of the IEEE International Conference on ASIC (ASICON). - 9781424438686 ; , s. 105-108 Konferensbidrag (refereegranskat)abstract As more computing cores are integrated onto a single chip, the effect of network communication latency is becoming more and more significant on Multi-core Network-onChips (NoCs). For data-parallel applications, we study the model ofparallel speedup by including network communication latency in Amdahl's law. The speedup analysis considers the effect of network topology, network size, traffic model and computation/communication ratio. We also study the speedup efficiency. In our Multi-core NoC platform, a real data-parallel application, i.e. matrix multiplication, is used to validate the analysis. Our theoretical analysis and the application results show that the speedup improvement is nonlinear and the speedup efficiency decreases as the system size is scaled up. Such analysis can be used to guide architects and programmers to improve parallel processing efficiency by reducing network latency with optimized network design and increasing computation proportion in the program.
16.	Chen, Xiaowen, et al. (författare) Supporting Distributed Shared Memory on Multi-core Network-on-Chips Using a Dual Microcoded Controller 2010 Ingår i: Proceedings of the conference for Design Automation and Test in Europe. ; , s. 39-44 Konferensbidrag (refereegranskat)abstract Supporting Distributed Shared Memory (DSM) is essential for multi-coreNetwork-on-Chips for the sake of reusing huge amount of legacy code and easy programmability. We propose a microcoded controller as a hardware module in each node to connect the core, the local memory and the network. The controller is programmable where the DSM functions such as virtual-to-physical address translation,memory access and synchronization etc. are realized using microcode. To enable concurrent processing of memory requests from the local and remote cores, ourcontroller features two mini-processors, one dealing with requests from the local coreand the other from remote cores. Synthesis results suggest that the controller consumes 51k gates for the logic and can run up to 455 MHz in 130 nm technology. To evaluate its performance, we use synthetic and application workloads. Results show that, when the system size is scaled up, the delay overhead incurred by the controller may become less significant when compared with the network delay. In this way, the delay efficiency of our DSM solution is close to hardware solutions on average but still have all the flexibility of software solutions.
17.	Chen, Xiaowen, 1982-, et al. (författare) Supporting Efficient Synchronization in Multi-core NoCs Using Dynamic Buffer Allocation Technique 2010 Ingår i: Proceedings of the IEEE Annual Symposium on VLSI. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 462-463 Konferensbidrag (refereegranskat)abstract This paper explores a dynamic buffer allocation technique to guide a distributedsynchronization architecture to support efficient synchronization on multi-core Network-on-Chips (NoCs). The synchronization architecture features two physical buffers to be able to concurrently queue and handle synchronization requests issued by the local processor and remote processors via the on-chip network. Using the dynamic bufferallocation technique, the two physical buffers are dynamically allocated to form multiple virtual buffers in order to improve buffers' utilization. Experiments are carried on to evaluate buffers' utilization.
18.	Chen, Yancang, et al. (författare) A Trace-driven Hardware-level Simulator for Design and Verification of Network-on-Chips 2010 Ingår i: 2011 INTERNATIONAL CONFERENCE ON COMPUTERS, COMMUNICATIONS, CONTROL AND AUTOMATION (CCCA 2011), VOL II. - : IEEE. ; , s. 32-35 Konferensbidrag (refereegranskat)abstract Traditional communications of general-purpose multi-core processor and application-specific System-on-Chip face challenges in terms of scalability and complexity. Network-on-Chip (NoC) has been the most promising solution for the communications of multi-core and many-core chips. In this paper, we present a trace-driven hardware-level simulator (noted HS) based on SystemVerilog for the design and verification of NoCs. Different from the state-of-the-art NoC simulators, the HS owns three important characteristics in addition to the capability of creating simulation and synthesizable NoC descriptions: 1) hardware-level simulation can be done, which means more implementation details of hardware than flit-level simulation; 2) router debugging and verification can be done at RTL by inserting assertions and coverage; 3) trace-based application simulations can be done besides synthetic workloads. A 4 X 4 2D mesh NoC with output virtual-channel routers verifies the capability of our HS.
19.	Candaele, Bernard, et al. (författare) Mapping Optimisation for Scalable multi-core ARchiTecture : The MOSART approach 2010 Ingår i: Proceedings - IEEE Annual Symposium on VLSI, ISVLSI 2010. - 9780769540764 ; , s. 518-523 Konferensbidrag (refereegranskat)abstract The project will address two main challenges of prevailing architectures: 1) The global Interconnect and memory bottleneck due to a single, globally shared memory with high access times and power consumption; 2) The difficulties in programming heterogeneous, multi-core platforms, in particular in dynamically managing data structures in distributed memory. MOSART aims to overcome these through a multi-core architecture with distributed memory organisation, a Network-on-Chip (NoC) communication backbone and configurable processing cores that are scaled, optimised and customised together to achieve diverse energy, performance, cost and size requirements of different classes of applications. MOSART achieves this by: A) Providing platform support for management of abstract data structures Including middleware services and a run-time data manager for NoC based communication infrastructure; 2) Developing tool support for parallelizing and mapping applications on the multi-core target platform and customizing the processing cores for the application.
20.	Candaele, Bernard, et al. (författare) The MOSART Mapping Optimization for multi-core Architectures 2011 Ingår i: VLSI 2010 Annual Symposium. - Dordrecht : Springer Publishing Company. ; , s. 181-195 Konferensbidrag (refereegranskat)abstract MOSART project addresses two main challenges of prevailing architectures: (i) Theglobal interconnect and memory bottleneck due to a single, globally shared memorywith high access times and power consumption; (ii) The difficulties in programmingheterogeneous, multi-core platforms MOSART aims to overcome these through amulti-core architecture with distributed memory organization, a Network-on-Chip(NoC) communication backbone and configurable processing cores that are scaled,optimized and customized together to achieve diverse energy, performance, cost andsize requirements of different classes of applications. MOSART achieves this by:(i) Providing platform support for management of abstract data structures includingmiddleware services and a run-time data manager for NoC based communicationinfrastructure; (ii) Developing tool support for parallelizing and mapping applicationson the multi-core target platform and customizing the processing cores for theapplication.
21.	Chen, Xiaowen, 1982- (författare) Efficient Memory Access and Synchronization in NoC-based Many-core Processors 2019 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract In NoC-based many-core processors, memory subsystem and synchronization mechanism are always the two important design aspects, since mining parallelism and pursuing higher performance require not only optimized memory management but also efficient synchronization mechanism. Therefore, we are motivated to research on efficient memory access and synchronization in three topics, namely, efficient on-chip memory organization, fair shared memory access, and efficient many-core synchronization.One major way of optimizing the memory performance is constructing a suitable and efficient memory organization. A distributed memory organization is more suitable to NoC-based many-core processors, since it features good scalability. We envision that it is essential to support Distributed Shared Memory (DSM) because of the huge amount of legacy code and easy programming. Therefore, we first adopt the microcoded approach to address DSM issues, aiming for hardware performance but maintaining the flexibility of programs. Second, we further optimize the DSM performance by reducing the virtual-to-physical address translation overhead. In addition to the general-purpose memory organization such as DSM, there exists special-purpose memory organization to optimize the performance of application-specific memory access. We choose Fast Fourier Transform (FFT) as the target application, and propose a multi-bank data memory specialized for FFT computation.In 3D NoC-based many-core processors, because processor cores and memories reside in different locations (center, corner, edge, etc.) of different layers, memory accesses behave differently due to their different communication distances. As the network size increases, the communication distance difference of memory accesses becomes larger, resulting in unfair memory access performance among different processor cores. This unfair memory access phenomenon may lead to high latencies of some memory accesses, thus negatively affecting the overall system performance. Therefore, we are motivated to study on-chip memory and DRAM access fairness in 3D NoC-based many-core processors through narrowing the round-trip latency difference of memory accesses as well as reducing the maximum memory access latency.Barrier synchronization is used to synchronize the execution of parallel processor cores. Conventional barrier synchronization approaches such as master-slave, all-to-all, tree-based, and butterfly are algorithm oriented. As many processor cores are networked on a single chip, contended synchronization requests may cause large performance penalty. Motivated by this, different from the algorithm-based approaches, we choose another direction (i.e., exploiting efficient communication) to address the barrier synchronization problem. We propose cooperative communication as a means and combine it with the master-slave algorithm and the all-to-all algorithm to achieve efficient many-core barrier synchronization. Besides, a multi-FPGA implementation case study of fast many-core barrier synchronization is conducted.
22.	Chen, Xiaowen, et al. (författare) Kinetics and mechanism of autohydrolysis of hardwoods 2010 Ingår i: Bioresource Technology. - : Elsevier BV. - 0960-8524 .- 1873-2976. ; 101:20, s. 7812-7819 Tidskriftsartikel (refereegranskat)abstract Autohydrolysis using water is a promising method to extract hemicelluloses from wood prior to pulping in order to make co-products such as ethanol and acetic acid besides pulp. Many studies have been carried out on the kinetics and mechanism of autohydrolysis using batch reactors. The present study was performed in a continuous mixed flow reactor where the wood chips are retained in a basket inside the reactor. This reactor is well suited to determine intrinsic kinetics of hemicellulose dissolution because the dissolved products are rapidly removed from the reactor, thus minimizing further hydrolysis and degradation of the hemicelluloses in solution. The xylan removal rate follows an S-shaped behavior. GPC analysis of the continuously removed extract shows that the dissolved xylan oligomers have a DP smaller than about 25. Lignin-free xylan oligomers and cellulose oligomers are the major components dissolved in the initial stage of autohydrolysis, while xylan covalently bound to lignin (i.e. an LCC) is the major component removed during the later stage of autohydrolysis. The molecular weight of the dissolved components decreases with time in the second stage. The kinetics of xylan removal are explained in terms of a mechanism based on recent knowledge of the ultrastructure of the cell fibre wall.
23.	Gong, Xiaowen, et al. (författare) Guest Editorial Special Section on Distributed Edge Learning in Wireless Networks 2023 Ingår i: IEEE Open Journal of the Communications Society. - : Institute of Electrical and Electronics Engineers Inc.. - 2644-125X. ; 4, s. 2729-2732 Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)
24.	Hu, Kaibo, et al. (författare) Highly selective recovery of rare earth elements from mine wastewater by modifying kaolin with phosphoric acid 2023 Ingår i: Separation and Purification Technology. - : Elsevier. - 1383-5866 .- 1873-3794. ; 309 Tidskriftsartikel (refereegranskat)abstract Recovery of rare earth elements (REEs) from mine wastewater is essential for maintaining rare earth reserves and sustainable application of REEs. In the present study, we prepared a phosphoric acid modified kaolin (P-K) adsorbent by a simple mechanochemical process for the selective recovery of REEs from rare earth wastewater. The impacts of phosphoric acid dosage, milling duration, initial pH, temperature, initial ion concentration, and adsorbent dosage on the selective adsorption of REEs were investigated. The findings demonstrate that the adsorption of REEs by P-K follows pseudo-second-order kinetic model and the Langmuir isotherm model, and is dominated by chemical adsorption, with a maximum adsorption capacity of 19.82Â mg/g at 50 â„ƒ. Additionally, in an original mine wastewater, the recovery rate of REEs can reach more than 90%, whereas the adsorption rates of calcium, magnesium and, ammonia nitrogen (whose concentration is 18 times that of REEs) are nearly zero, indicating that P-K has extremely high selectivity for REEs. Furthermore, the feedstock solution containing 40Â mg/L of REEs may be concentrated to 3510Â mg/L following enrichment treatment, and 99.9% of the REEs are eluted using a low concentration of hydrochloric acid. The findings illustrate that P-K has a wide range of potential applications in the treatment of rare earth industrial effluents.
25.	Jantsch, Axel, et al. (författare) Memory Architecture and Management in an NoC Platform 2011. - 1 Ingår i: Scalable Multi-core Architectures. - New York, NY : Springer. - 9781441967770 ; , s. 3-28 Bokkapitel (refereegranskat)abstract The memory organization and the management of the memory space is a critical part of every NoC based platform design. We propose a Data Management Engine (DME), that is a block of programmable hardware and part of every processing element. It off-loads the processing element (CPU, DSP, etc.) by managing the memory space, memory access and the communication over the on-chip network. The DME’s main functions are virtual address translation, private and shared memory management, cache coherence protocol, support for memory consistency models, synchronization and protection mechanisms for shared memory communication. The DME is fully programmable and configurable thus allowing for customized support for high level data management functions such as dynamic memory allocation and abstract data types. This chapter describes the main concepts, design and functionality of the DME and presents case studies illustrating its usage and performance.
26.	Li, Yang, et al. (författare) Round-trip latency prediction for memory access fairness in mesh-based many-core architectures 2014 Ingår i: IEICE Electronics Express. - : Institute of Electronics, Information and Communications Engineers (IEICE). - 1349-2543. ; 11:24, s. 20141027- Tidskriftsartikel (refereegranskat)abstract In mesh-based many-core architectures, processor cores and memories reside in different locations (center, corner, edge, etc.), therefore memory accesses behave differently due to their different communication distances. The latency difference leads to unfair memory access and some memory accesses with very high latencies, degrading the system performance. However, improving one memory access's latency can worsen the latency of another since memory accesses contend in the network. Therefore, the goal should focus on memory access fairness through balancing the latencies of memory accesses while ensuring a low average latency. In the paper, we address the goal by proposing to predict the round-trip latencies of memory access related packets and use the predicted round-trip latencies to prioritize the packets. The router supporting fair memory access is designed and its hardware cost is given. Experiments are carried out with a variety of network sizes and packet injection rates and prove that our approach outperforms the classic round-robin arbitration in terms of average latency and LSD1. In the experiments, the maximum improvement of the average latency and the LSD are 16% and 48% respectively.
27.	Naeem, Abdul, et al. (författare) Realization and Performance Comparison of Sequential and Weak Memory Consistency Models in Network-on-Chip based Multi-core Systems 2011 Ingår i: Proceedings of 16th ACM/IEEE Asia and South Pacific Design Automation Conference(ASP-DAC) 2011. - : IEEE Press. ; , s. 154-159 Konferensbidrag (refereegranskat)abstract This paper studies realization and performance comparison of the sequential and weak consistency models in the network-on-chip (NoC) based distributed shared memory (DSM) multi-ore systems. Memory consistency constrains the order of shared memory operations for the expected behavior of the multi-core systems. Both the consistency models are realized in the NoC based multi-core systems. The performance of the two consistency models are compared for various sizes of networks using regular mesh topologies and deflection routing algorithm. The results show that the weak consistency improves the performance by 46.17% and 33.76% on average in the code and consistency latencies over the sequential consistency model, due to relaxation in the program order, as the system grows from single core to 64 cores.
28.	Naeem, Abdul, et al. (författare) Realization and Scalability of Release and Protected Release Consistency Models in NoC based Systems 2011 Ingår i: Proceeding of 14th Euromicro Conference on Digital System Design, 2011. - Oulu : IEEE Computer Society. - 9781457710483 ; , s. 47-54 Konferensbidrag (refereegranskat)abstract This paper studies the realization and scalability of release and protected release consistency models in Network-on-Chip (NoC) based Distributed Shared Memory (DSM) multi-core systems. The protected release consistency (PRC) model is proposed as an extension of the release consistency (RC) model and provides further relaxation in the shared memory operations. The realization schemes of RC and PRC models use a transaction counter in each node of the NoC based multi-core (McNoC) systems. Further, we study the scalability of these RC and PRC models and evaluate their performance in the McNoC platform. A configurable NoC based platform with 2D mesh topology and deflection routing algorithm is used in the tests. We experiment both with synthetic and application workloads. The performance of the RC and PRC models are compared using sequential consistency (SC) as the baseline. The experiments show that the average code execution time for the PRC model in 8x8 network (64 cores) is reduced by 30.5% over SC, and by 6.5% over RC model. Average data execution time in the 8x8 network for the PRC model is reduced by almost 37% over SC and by 8.8% over RC. The increase in area for the PRC of RC is about 880 gates in the network interface ( 1.7% ).
29.	Naeem, Abdul, et al. (författare) Scalability of Relaxed Consistency Models in NoC based Multicore Architectures 2009 Ingår i: SIGARCH Computer Architecture News. - : ACM Press. - 0163-5964 .- 1943-5851. ; 37:5, s. 8-15 Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)abstract This paper studies realization of relaxed memory consistency models in the network-on-chip based distributed shared memory (DSM) multi-core systems. Within DSM systems, memory consistency is a critical issue since it affects not only the performance but also the correctness of programs. We investigate the scalability of the relaxed consistency models (weak, release consistency) implemented by using transaction counters. Our experimental results compare the average and maximum code, synchronization and data latencies of the two consistency models for various network sizes with regular mesh topologies. The observed latencies rise for both the consistency models as the network size grows. However, the scaling behaviors are different. With the release consistency model these latencies grow significantly slower than with the weak onsistency due to better optimization potential by means of overlapping, reordering and program order relaxations. The release consistency improves the performance by 15.6% and 26.5% on average in the code and consistency latencies over the weak consistency model for the specific application, as the system grows from single core to 64 cores. The latency of data transactions rows 2.2 times faster on the average with a weak consistency model than with a release consistency model when the system scales from single core to 64 cores.
30.	Naeem, Abdul, et al. (författare) Scalability of Weak Consistency in NoC based Multicore Architectures 2010 Ingår i: IEEE INT SYMP CIRC SYST PROC. - New York : IEEE. - 9781424453085 ; , s. 3497-3500 Konferensbidrag (refereegranskat)abstract In Multicore Network-on-Chip, it is preferable to realize distributed but shared memory (DSM) in order to reuse the huge amount of legacy code and easy programming. Within DSM systems, memory consistency is a critical issue since it affects not only performance but also the correctness of programs. In this paper, we investigate the scalability of the weak consistency model, which may be implemented using a transaction counter. The experimental results compare synchronization latencies for various network sizes, topologies and lock positions in the network. Average synchronization latency rises exponentially for mesh and torus topologies as the network size grows. However, torus improves the synchronization latency in comparison to mesh. For mesh topology network average synchronization latency is also slightly affected by the lock position with respect to the network center.
31.	Wang, Zicong, et al. (författare) Cache Access Fairness in 3D Mesh-Based NUCA 2018 Ingår i: IEEE Access. - : Institute of Electrical and Electronics Engineers (IEEE). - 2169-3536. ; 6, s. 42984-42996 Tidskriftsartikel (refereegranskat)abstract Given the increase in cache capacity over the past few decades, cache access effciency has come to play a critical role in determining system performance. To ensure effcient utilization of the cache resources, non-uniform cache architecture (NUCA) has been proposed to allow for a large capacity and a short access latency. With the support of networks-on-chip (NoC), NUCA is often employed to organize the last level cache. However, this method also hurts cache access fairness, which denotes the degree of non-uniformity for cache access latencies. This drop in fairness can result in an increased number of cache accesses with overhigh latency, which leads to a bottleneck in system performance. This paper investigates the cache access fairness in the context of NoC-based 3-D chip architecture, and provides new insights into 3-D architecture design. We propose fair-NUCA (F-NUCA), a co-design scheme intended to optimize cache access fairness. In F-NUCA, we strive to improve fairness by equalizing cache access latencies. To achieve this goal, the memory mapping and the channel width are both redistributed non-uniformly, thereby equalizing the non-contention and contention latencies, respectively. The experimental results reveal that F-NUCA can effectively improve cache access fairness. When F-NUCA is compared with the traditional static NUCA in a simulation with PARSEC benchmarks, the average reductions in average latency and latency standard deviation are 4.64%/9.38% for a 4 x 4 x 2 mesh network, as well as 6.31%/13.51% for a 4 x 4 x 4 mesh network. In addition, a 4.0%/ 6.4% improvement in system throughput can be achieved for the two scales of mesh networks, respectively.
32.	Wang, Z., et al. (författare) Fairness-oriented and location-aware NUCA for many-core SoC 2017 Ingår i: 2017 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017. - New York, NY, USA : Association for Computing Machinery (ACM). - 9781450349840 Konferensbidrag (refereegranskat)abstract Non-uniform cache architecture (NUCA) is often employed to organize the last level cache (LLC) by Networks-on-Chip (NoC). However, along with the scaling up for network size of Systems-on-Chip (SoC), two trends gradually begin to emerge. First, the network latency is becoming the major source of the cache access latency. Second, the communication distance and latency gap between different cores is increasing. Such gap can seriously cause the network latency imbalance problem, aggravate the degree of non-uniform for cache access latencies, and then worsen the system performance. In this paper, we propose a novel NUCA-based scheme, named fairness-oriented and location-aware NUCA (FL-NUCA), to alleviate the network latency imbalance problem and achieve more uniform cache access. We strive to equalize network latencies which are measured by three metrics: average latency (AL), latency standard deviation (LSD), and maximum latency (ML). In FL-NUCA, the memory-to-LLC mapping and links are both non-uniform distributed to better fit the network topology and traffics, thereby equalizing network latencies from two aspects, i.e., non-contention latencies and contention latencies, respectively. The experimental results show that FL-NUCA can effectively improve the fairness of network latencies. Compared with the traditional static NUCA (SNUCA), in simulation with synthetic traffics, the average improvements for AL, LSD, and ML are 20.9%, 36.3%, and 35.0%, respectively. In simulation with PARSEC benchmarks, the average improvements for AL, LSD, and ML are 6.3%, 3.6%, and 11.2%, respectively.
33.	Wang, Z., et al. (författare) Fairness-oriented switch allocation for networks-on-chip 2017 Ingår i: 2017 30th IEEE International System-on-Chip Conference (SOCC). - : IEEE Computer Society. - 9781538640333 ; , s. 304-309 Konferensbidrag (refereegranskat)abstract Networks-on-Chip (NoC) is becoming the backbone of modern chip multiprocessor (CMP) systems. However, with the number of integrated cores increasing and the network size scaling up, the network-latency imbalance is becoming an important problem, which seriously influences the performance of the network and system. In this paper, we aim to alleviate this problem by optimizing the design of switch allocation. We propose fairness-oriented switch allocation (FOSA), a novel switch allocation strategy to achieve uniform network latencies. FOSA can improve system performance by achieving remarkable improvement in balancing network latencies. We evaluate the network and system performance of FOSA with synthetic traffics and SPEC CPU2006 benchmarks in a full-system simulator. Compared with the canonical separable switch allocator (Round-Robin) and the recently proposed switch allocator (TS-Router), the experiments with benchmarks show that our approach decreases maximum latency (ML) by 45.6% and 15.1%, respectively, as well as latency standard deviation (LSD) by 13.8% and 3.9%, respectively. Besides this, FOSA improves system throughput by 0.8% over that of TS-Router. Finally, we synthesize FOSA and give an evaluation of the additional consumption of area and power.
34.	Wang, Z., et al. (författare) Load-balanced link distribution in mesh-based many-core systems 2019 Ingår i: 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019 10-12 Aug. 2019. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 1028-1034 Konferensbidrag (refereegranskat)abstract Networks-on-Chip (NoC) is becoming the fundamental infrastructure of modern chip multiprocessors (CMPs). Along with the scaling up for a mesh-based network, the inequivalence of location for the links gradually causes unbalanced traffic load on each link. In a mesh network, the central regions are easy to become the hotspots, and the central links are heavily utilized than the peripheral links in the context of non-uniform cache architecture (NUCA). Different from the traditional uniform interconnection between network nodes, we propose the load-balanced link distribution scheme, which aims at assigning physical channels in accordance with the traffic load of each link. In this paper, we analyze the traffic load distribution for the mesh network with different scales and give the corresponding load-balanced link distributions. The simulation results indicate that the load-balanced scheme achieves not only lower physical channel costs but also better network and system performance than the traditional uniform scheme. The experiments with synthetic traffics show that the load-balanced scheme exhibits 57.33%/60.23%/47.56% lower network latency at saturation point on average compared with the uniform scheme for 8x8/10x10/12x12 mesh networks respectively. By contrast, the load-balanced link distribution scheme uses less physical channels, and the reductions in physical channel cost are 7.14%/5.56%/15.15% for 8x8/10x10/12x12 mesh networks respectively. The experiments with PARSEC benchmarks reveal that a 2.1% improvement of system throughput can be achieved by the load-balanced scheme.
35.	Wang, Z., et al. (författare) VP-Router : On balancing the traffic load in on-chip networks 2018 Ingår i: IEICE Electronics Express. - : Institute of Electronics Information Communication Engineers. - 1349-2543. ; 15:22 Tidskriftsartikel (refereegranskat)abstract Along with the scaling up for network-on-chips (NoC), the network traffic grows increasingly, and generally the central region is easily to become the traffic hotspots. The problem of unbalanced traffic can lead to a part of network links becoming the bottleneck of network communication, and thus hurt the network and system performance. In this paper, we propose load-balanced link distribution method, which is intended to allocating physical channels according to the traffic load on each link. To support connecting multiple physical channels between two routers, we propose a novel concept of virtual port, and design a low-cost multi-port router called virtual port router (VP-Router). Compared to the network with traditional routers, the network with VP-Routers can effectively balance the network traffic load on links. The experiments with SPLASH2 benchmarks exhibit that VP-Router performs 6.3% and 9.0% better in energy-delay-product (EDP) for 4 × 4 and 8 × 8 mesh networks respectively. As for system throughput, VP-Router improves by about 3.5% and 5.8% on average respectively.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Resultat 1-35 av 35

Avgränsa träffmängd

Typ av publikation: konferensbidrag (17); tidskriftsartikel (16); doktorsavhandling (1); bokkapitel (1)

Typ av innehåll: refereegranskat (32); övrigt vetenskapligt/konstnärligt (3)

Författare/redaktör: Lu, Zhonghai (24); Jantsch, Axel (18); Guo, Y (5); Wang, Z. (4); Li, C. (3); Chen, S. (2); visa fler...; Hemani, Ahmed (2); Anagnostopoulos, Ira ... (2); Xydis, Sotirios (2); Bartzas, Alexandros (2); Soudris, Dimitrios (2); Chabloz, Jean-Michel (2); Chen, H. (1); Zhang, J. (1); Jantsch, A. (1); Alonso, Alejandro (1); Liu, S. (1); Liu, Z. (1); Wang, Kai (1); Sun, Kai (1); Wang, Xin (1); Wang, Yi (1); Yang, Yong (1); Fischione, Carlo (1); Li, Ke (1); Zhang, Qian (1); Xu, Xin (1); Smith, Caroline (1); Zhang, Wei (1); Chen, Yan (1); Chen, Junyu (1); Wang, Wei (1); Martin, Michael (1); Garcia, David (1); Rigon, Luca (1); Jonsson, Martin (1); Lawrence, Jack (1); Hussain, Shahid (1); Brasas, Valentas (1); Zhang, Jun (1); wang, Ping (1); Wang, Li (1); Xu, Hao (1); Shi, Wei (1); Cheng, Cheng (1); Wang, Sihan (1); Backman, Ludvig J. (1); Yang, Fan (1); Ibrahim, Ahmed (1); Li, Yang (1); visa färre...

Lärosäte: Kungliga Tekniska Högskolan (33); Umeå universitet (1); Luleå tekniska universitet (1); Chalmers tekniska högskola (1)

Språk: Engelska (35)

Forskningsämne (UKÄ/SCB): Teknik (24); Naturvetenskap (10); Medicin och hälsovetenskap (1)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

LIBRIS.kb.se

Stäng

Kopiera och spara länken för att återkomma till aktuell vy