SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Chen Xiaowen) srt2:(2015-2019)"

Sökning: WFRF:(Chen Xiaowen) > (2015-2019)

  • Resultat 1-10 av 11
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Chen, Jialin, et al. (författare)
  • Characterization and comparison of post-natal rat Achilles tendon-derived stem cells at different development stages
  • 2016
  • Ingår i: Scientific Reports. - : Springer Science and Business Media LLC. - 2045-2322. ; 6
  • Tidskriftsartikel (refereegranskat)abstract
    • Tendon stem/progenitor cells (TSPCs) are a potential cell source for tendon tissue engineering. The striking morphological and structural changes of tendon tissue during development indicate the complexity of TSPCs at different stages. This study aims to characterize and compare post-natal rat Achilles tendon tissue and TSPCs at different stages of development. The tendon tissue showed distinct differences during development: the tissue structure became denser and more regular, the nuclei became spindle-shaped and the cell number decreased with time. TSPCs derived from 7 day Achilles tendon tissue showed the highest self-renewal ability, cell proliferation, and differentiation potential towards mesenchymal lineage, compared to TSPCs derived from 1 day and 56 day tissue. Microarray data showed up-regulation of several groups of genes in TSPCs derived from 7 day Achilles tendon tissue, which may account for the unique cell characteristics during this specific stage of development. Our results indicate that TSPCs derived from 7 day Achilles tendon tissue is a superior cell source as compared to TSPCs derived from 1 day and 56 day tissue, demonstrating the importance of choosing a suitable stem cell source for effective tendon tissue engineering and regeneration.
  •  
2.
  • Chen, Xiaowen, et al. (författare)
  • Performance analysis of homogeneous on-chip large-scale parallel computing architectures for data-parallel applications
  • 2015
  • Ingår i: Journal of Electrical and Computer Engineering. - : Hindawi Limited. - 2090-0147 .- 2090-0155. ; 2015
  • Tidskriftsartikel (refereegranskat)abstract
    • On-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to as On-chip Large-scale Parallel Computing Architectures (OLPCs) in the paper. Homogenous OLPCs feature strong regularity and scalability due to its identical cores and routers. Data-parallel applications have their parallel data subsets that are handled individually by the same program running in different cores. Therefore, data-parallel applications are able to obtain good speedup in homogenous OLPCs. The paper addresses modeling the speedup performance of homogeneous OLPCs for data-parallel applications. When establishing the speedup performance model, the network communication latency and the ways of storing data of data-parallel applications are modeled and analyzed in detail. Two abstract concepts (equivalent serial packet and equivalent serial communication) are proposed to construct the network communication latency model. The uniform and hotspot traffic models are adopted to reflect the ways of storing data. Some useful suggestions are presented during the performance model's analysis. Finally, three data-parallel applications are performed on our cycle-accurate homogenous OLPC experimental platform to validate the analytic results and demonstrate that our study provides a feasible way to estimate and evaluate the performance of data-parallel applications onto homogenous OLPCs.
  •  
3.
  • Chen, Xiaowen, et al. (författare)
  • A Variable-Size FFT Hardware Accelerator Based on Matrix Transposition
  • 2018
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : Institute of Electrical and Electronics Engineers (IEEE). - 1063-8210 .- 1557-9999. ; 26:10, s. 1953-1966
  • Tidskriftsartikel (refereegranskat)abstract
    • Fast Fourier transform (FFT) is the kernel and the most time-consuming algorithm in the domain of digital signal processing, and the FFT sizes of different applications are very different. Therefore, this paper proposes a variable-size FFT hardware accelerator, which fully supports the IEEE-754 single-precision floating-point standard and the FFT calculation with a wide size range from 2 to 220 points. First, a parallel Cooley-Tukey FFT algorithm based on matrix transposition (MT) is proposed, which can efficiently divide a large size FFT into several small size FFTs that can be executed in parallel. Second, guided by this algorithm, the FFT hardware accelerator is designed, and several FFT performance optimization techniques such as hybrid twiddle factor generation, multibank data memory, block MT, and token-based task scheduling are proposed. Third, its VLSI implementation is detailed, showing that it can work at 1 GHz with the area of 2.4 mm(2) and the power consumption of 91.3 mW at 25 degrees C, 0.9 V. Finally, several experiments are carried out to evaluate the proposal's performance in terms of FFT execution time, resource utilization, and power consumption. Comparative experiments show that our FFT hardware accelerator achieves at most 18.89x speedups in comparison to two software-only solutions and two hardware-dedicated solutions.
  •  
4.
  • Chen, Xiaowen, et al. (författare)
  • Multi-bit Transient Fault Control for NoC Links Using 2D Fault Coding Method
  • 2016
  • Ingår i: 2016 TENTH IEEE/ACM INTERNATIONAL SYMPOSIUM ON NETWORKS-ON-CHIP (NOCS). - : IEEE. - 9781467390309
  • Konferensbidrag (refereegranskat)abstract
    • In deep nanometer scale, Network-on-Chip (NoC) links are more prone to multi-bit transient fault. Conventional ECC techniques brings heavy area, power, and timing overheads when correcting and detecting multiple transient faults. Therefore, a cost-effective ECC technique, named 2D fault coding method, is adopted to overcome the multi-bit transient fault issue of NoC links. Its key innovation is that the wires of a link are treated as its matrix appearance and light-weight Parity Check Coding (PCC) is performed on the matrix's two dimensions (horizontal matrix rows and vertical matrix columns). Horizontal PCCs and vertical PCCs work together to find the faults' position and then correct them by simply inverting them. The procedure of using the 2D fault coding method to protect a NoC link is proposed, its correction and detection capability is analyzed, and its hardware implementation is carried out. Comparative experiments show that the proposal can largely reduce the ECC hardware cost, have much higher fault detection coverage, maintain almost zero silent fault percentages, and have higher fault correction percentages normalized under the same area, demonstrating that it is cost-effective and suitable to the multi-bit transient fault control for NoC links.
  •  
5.
  • Chen, Xiaowen, et al. (författare)
  • Round-trip DRAM access fairness in 3D NoC-based many-core systems
  • 2017
  • Ingår i: ACM Transactions on Embedded Computing Systems. - : Association for Computing Machinery. - 1539-9087 .- 1558-3465. ; 16:5s
  • Tidskriftsartikel (refereegranskat)abstract
    • In 3D NoC-based many-core systems, DRAM accesses behave differently due to their different communication distances and the latency gap of different DRAM accesses becomes bigger as the network size increases, which leads to unfair DRAM access performance among different nodes. This phenomenon may lead to high latencies for some DRAM accesses that become the performance bottleneck of the system. The paper addresses the DRAM access fairness problem in 3D NoC-based many-core systems by narrowing the latency difference of DRAM accesses as well as reducing the maximum latency. Firstly, the latency of a round-trip DRAM access is modeled and the factors causing DRAM access latency difference are discussed in detail. Secondly, the DRAM access fairness is further quantitatively analyzed through experiments. Thirdly, we propose to predict the network latency of round-trip DRAM accesses and use the predicted round-trip DRAM access time as the basis to prioritize the DRAM accesses in DRAM interfaces so that the DRAM accesses with potential high latencies can be transferred as early and fast as possible, thus achieving fair DRAM access. Experiments with synthetic and application workloads validate that our approach can achieve fair DRAM access and outperform the traditional First-Come-First-Serve (FCFS) scheduling policy and the scheduling policies proposed by reference [7] and [24] in terms of maximum latency, Latency Standard Deviation (LSD)1 and speedup. In the experiments, the maximum improvement of the maximum latency, LSD, and speedup are 12.8%, 6.57%, and 8.3% respectively. Besides, our proposal brings very small extra hardware overhead (<0.6%) in comparison to the three counterparts.
  •  
6.
  • Chen, Xiaowen, 1982- (författare)
  • Efficient Memory Access and Synchronization in NoC-based Many-core Processors
  • 2019
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • In NoC-based many-core processors, memory subsystem and synchronization mechanism are always the two important design aspects, since mining parallelism and pursuing higher performance require not only optimized memory management but also efficient synchronization mechanism. Therefore, we are motivated to research on efficient memory access and synchronization in three topics, namely, efficient on-chip memory organization, fair shared memory access, and efficient many-core synchronization.One major way of optimizing the memory performance is constructing a suitable and efficient memory organization. A distributed memory organization is more suitable to NoC-based many-core processors, since it features good scalability. We envision that it is essential to support Distributed Shared Memory (DSM) because of the huge amount of legacy code and easy programming. Therefore, we first adopt the microcoded approach to address DSM issues, aiming for hardware performance but maintaining the flexibility of programs. Second, we further optimize the DSM performance by reducing the virtual-to-physical address translation overhead. In addition to the general-purpose memory organization such as DSM, there exists special-purpose memory organization to optimize the performance of application-specific memory access. We choose Fast Fourier Transform (FFT) as the target application, and propose a multi-bank data memory specialized for FFT computation.In 3D NoC-based many-core processors, because processor cores and memories reside in different locations (center, corner, edge, etc.) of different layers, memory accesses behave differently due to their different communication distances. As the network size increases, the communication distance difference of memory accesses becomes larger, resulting in unfair memory access performance among different processor cores. This unfair memory access phenomenon may lead to high latencies of some memory accesses, thus negatively affecting the overall system performance. Therefore, we are motivated to study on-chip memory and DRAM access fairness in 3D NoC-based many-core processors through narrowing the round-trip latency difference of memory accesses as well as reducing the maximum memory access latency.Barrier synchronization is used to synchronize the execution of parallel processor cores. Conventional barrier synchronization approaches such as master-slave, all-to-all, tree-based, and butterfly are algorithm oriented. As many processor cores are networked on a single chip, contended synchronization requests may cause large performance penalty. Motivated by this, different from the algorithm-based approaches, we choose another direction (i.e., exploiting efficient communication) to address the barrier synchronization problem. We propose cooperative communication as a means and combine it with the master-slave algorithm and the all-to-all algorithm to achieve efficient many-core barrier synchronization. Besides, a multi-FPGA implementation case study of fast many-core barrier synchronization is conducted.
  •  
7.
  • Wang, Zicong, et al. (författare)
  • Cache Access Fairness in 3D Mesh-Based NUCA
  • 2018
  • Ingår i: IEEE Access. - : Institute of Electrical and Electronics Engineers (IEEE). - 2169-3536. ; 6, s. 42984-42996
  • Tidskriftsartikel (refereegranskat)abstract
    • Given the increase in cache capacity over the past few decades, cache access effciency has come to play a critical role in determining system performance. To ensure effcient utilization of the cache resources, non-uniform cache architecture (NUCA) has been proposed to allow for a large capacity and a short access latency. With the support of networks-on-chip (NoC), NUCA is often employed to organize the last level cache. However, this method also hurts cache access fairness, which denotes the degree of non-uniformity for cache access latencies. This drop in fairness can result in an increased number of cache accesses with overhigh latency, which leads to a bottleneck in system performance. This paper investigates the cache access fairness in the context of NoC-based 3-D chip architecture, and provides new insights into 3-D architecture design. We propose fair-NUCA (F-NUCA), a co-design scheme intended to optimize cache access fairness. In F-NUCA, we strive to improve fairness by equalizing cache access latencies. To achieve this goal, the memory mapping and the channel width are both redistributed non-uniformly, thereby equalizing the non-contention and contention latencies, respectively. The experimental results reveal that F-NUCA can effectively improve cache access fairness. When F-NUCA is compared with the traditional static NUCA in a simulation with PARSEC benchmarks, the average reductions in average latency and latency standard deviation are 4.64%/9.38% for a 4 x 4 x 2 mesh network, as well as 6.31%/13.51% for a 4 x 4 x 4 mesh network. In addition, a 4.0%/ 6.4% improvement in system throughput can be achieved for the two scales of mesh networks, respectively.
  •  
8.
  • Wang, Z., et al. (författare)
  • Fairness-oriented and location-aware NUCA for many-core SoC
  • 2017
  • Ingår i: 2017 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017. - New York, NY, USA : Association for Computing Machinery (ACM). - 9781450349840
  • Konferensbidrag (refereegranskat)abstract
    • Non-uniform cache architecture (NUCA) is often employed to organize the last level cache (LLC) by Networks-on-Chip (NoC). However, along with the scaling up for network size of Systems-on-Chip (SoC), two trends gradually begin to emerge. First, the network latency is becoming the major source of the cache access latency. Second, the communication distance and latency gap between different cores is increasing. Such gap can seriously cause the network latency imbalance problem, aggravate the degree of non-uniform for cache access latencies, and then worsen the system performance. In this paper, we propose a novel NUCA-based scheme, named fairness-oriented and location-aware NUCA (FL-NUCA), to alleviate the network latency imbalance problem and achieve more uniform cache access. We strive to equalize network latencies which are measured by three metrics: average latency (AL), latency standard deviation (LSD), and maximum latency (ML). In FL-NUCA, the memory-to-LLC mapping and links are both non-uniform distributed to better fit the network topology and traffics, thereby equalizing network latencies from two aspects, i.e., non-contention latencies and contention latencies, respectively. The experimental results show that FL-NUCA can effectively improve the fairness of network latencies. Compared with the traditional static NUCA (SNUCA), in simulation with synthetic traffics, the average improvements for AL, LSD, and ML are 20.9%, 36.3%, and 35.0%, respectively. In simulation with PARSEC benchmarks, the average improvements for AL, LSD, and ML are 6.3%, 3.6%, and 11.2%, respectively.
  •  
9.
  • Wang, Z., et al. (författare)
  • Fairness-oriented switch allocation for networks-on-chip
  • 2017
  • Ingår i: 2017 30th IEEE International System-on-Chip Conference (SOCC). - : IEEE Computer Society. - 9781538640333 ; , s. 304-309
  • Konferensbidrag (refereegranskat)abstract
    • Networks-on-Chip (NoC) is becoming the backbone of modern chip multiprocessor (CMP) systems. However, with the number of integrated cores increasing and the network size scaling up, the network-latency imbalance is becoming an important problem, which seriously influences the performance of the network and system. In this paper, we aim to alleviate this problem by optimizing the design of switch allocation. We propose fairness-oriented switch allocation (FOSA), a novel switch allocation strategy to achieve uniform network latencies. FOSA can improve system performance by achieving remarkable improvement in balancing network latencies. We evaluate the network and system performance of FOSA with synthetic traffics and SPEC CPU2006 benchmarks in a full-system simulator. Compared with the canonical separable switch allocator (Round-Robin) and the recently proposed switch allocator (TS-Router), the experiments with benchmarks show that our approach decreases maximum latency (ML) by 45.6% and 15.1%, respectively, as well as latency standard deviation (LSD) by 13.8% and 3.9%, respectively. Besides this, FOSA improves system throughput by 0.8% over that of TS-Router. Finally, we synthesize FOSA and give an evaluation of the additional consumption of area and power.
  •  
10.
  • Wang, Z., et al. (författare)
  • Load-balanced link distribution in mesh-based many-core systems
  • 2019
  • Ingår i: 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019 10-12 Aug. 2019. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 1028-1034
  • Konferensbidrag (refereegranskat)abstract
    • Networks-on-Chip (NoC) is becoming the fundamental infrastructure of modern chip multiprocessors (CMPs). Along with the scaling up for a mesh-based network, the inequivalence of location for the links gradually causes unbalanced traffic load on each link. In a mesh network, the central regions are easy to become the hotspots, and the central links are heavily utilized than the peripheral links in the context of non-uniform cache architecture (NUCA). Different from the traditional uniform interconnection between network nodes, we propose the load-balanced link distribution scheme, which aims at assigning physical channels in accordance with the traffic load of each link. In this paper, we analyze the traffic load distribution for the mesh network with different scales and give the corresponding load-balanced link distributions. The simulation results indicate that the load-balanced scheme achieves not only lower physical channel costs but also better network and system performance than the traditional uniform scheme. The experiments with synthetic traffics show that the load-balanced scheme exhibits 57.33%/60.23%/47.56% lower network latency at saturation point on average compared with the uniform scheme for 8x8/10x10/12x12 mesh networks respectively. By contrast, the load-balanced link distribution scheme uses less physical channels, and the reductions in physical channel cost are 7.14%/5.56%/15.15% for 8x8/10x10/12x12 mesh networks respectively. The experiments with PARSEC benchmarks reveal that a 2.1% improvement of system throughput can be achieved by the load-balanced scheme.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 11

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy